How to accelerate data prep with Data Wrangler in Microsoft Fabric

Статия
08/13/2024

The Data Wrangler tool is a notebook-based resource that provides an immersive interface for exploratory data analysis. It combines a grid-like data display with dynamic summary statistics, built-in visualizations, and a library of common data-cleaning operations. You can apply each operation with a few steps. You can update the data display in real time, and generate code in pandas or PySpark that you can save back to the notebook as a reusable function. This article focuses on exploration and transformation of pandas DataFrames. For more information about using Data Wrangler on Spark DataFrames, visit this resource.

Prerequisites

Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the left side of your home page to switch to the Synapse Data Science experience.

Limitations

Custom code operations are currently supported only for pandas DataFrames.
The Data Wrangler display works best on large monitors, although you can minimize or hide different portions of the interface, to accommodate smaller screens.

Launching Data Wrangler

You can launch Data Wrangler directly from a Microsoft Fabric notebook to explore and transform any pandas or Spark DataFrame. For more information about using Data Wrangler with Spark DataFrames, visit this companion article. This code snippet shows how to read sample data into a pandas DataFrame:

import pandas as pd

# Read a CSV into a Pandas DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/titanic.csv")
display(df)

In the notebook ribbon "Home" tab, use the Data Wrangler dropdown prompt to browse the active DataFrames available for editing. Select the one you want to open in Data Wrangler.

Tip

Data Wrangler cannot be opened while the notebook kernel is busy. An executing cell must finish its execution before Data Wrangler can launch, as shown in this screenshot:

Choosing custom samples

To open a custom sample of any active DataFrame with Data Wrangler, select "Choose custom sample" from the dropdown, as shown in this screenshot:

This launches a pop-up with options to specify the size of the desired sample (number of rows) and the sampling method (first records, last records, or a random set). The first 5,000 rows of the DataFrame serve as the default sample size, as shown in this screenshot:

Viewing summary statistics

When Data Wrangler loads, it displays a descriptive overview of the chosen DataFrame in the "Summary" panel. This overview includes information about the DataFrame dimensions, its missing values, and more. Selection of any column in the Data Wrangler grid prompts the "Summary" panel to update and display descriptive statistics about that specific column. Quick insights about every column are also available in its header.

Tip

Column-specific statistics and visuals (both in the "Summary" panel and in the column headers) depend on the column datatype. For instance, a binned histogram of a numeric column appears in the column header only if the column is cast as a numeric type, as shown in this screenshot:

Browsing data-cleaning operations

A searchable list of data-cleaning steps can be found in the "Operations" panel. From the "Operations" panel, selection of a data-cleaning step prompts you to provide a target column or columns, along with any necessary parameters to complete the step. For example, the prompt to numerically scale a column requires a new range of values, as shown in this screenshot:

Tip

You can apply a smaller selection of operations from the menu of each column header, as shown in this screenshot:

Previewing and applying operations

The Data Wrangler display grid automatically previews the results of a selected operation, and the corresponding code automatically appears in the panel below the grid. To commit the previewed code, select "Apply" in either place. To delete the previewed code and try a new operation, select "Discard" as shown in this screenshot:

Once an operation is applied, the Data Wrangler display grid and summary statistics update to reflect the results. The code appears in the running list of committed operations, located in the "Cleaning steps" panel, as shown in this screenshot:

Tip

You can always undo the most recently applied step. In the "Cleaning steps" panel, a trash can icon will appear if you hover your cursor over that most recently applied step, as shown in this screenshot:

This table summarizes the operations that Data Wrangler currently supports:

Operation	Description
Sort	Sort a column in ascending or descending order
Filter	Filter rows based on one or more conditions
One-hot encode	Create new columns for each unique value in an existing column, indicating the presence or absence of those values per row
One-hot encode with delimiter	Split and one-hot encode categorical data using a delimiter
Change column type	Change the data type of a column
Drop column	Delete one or more columns
Select column	Choose one or more columns to keep, and delete the rest
Rename column	Rename a column
Drop missing values	Remove rows with missing values
Drop duplicate rows	Drop all rows that have duplicate values in one or more columns
Fill missing values	Replace cells with missing values with a new value
Find and replace	Replace cells with an exact matching pattern
Group by column and aggregate	Group by column values and aggregate results
Strip whitespace	Remove whitespace from the beginning and end of text
Split text	Split a column into several columns based on a user-defined delimiter
Convert text to lowercase	Convert text to lowercase
Convert text to uppercase	Convert text to UPPERCASE
Scale min/max values	Scale a numerical column between a minimum and maximum value
Flash Fill	Automatically create a new column based on examples derived from an existing column

Modify your display

At any time, you can customize the interface with the "Views" tab in the toolbar located above the Data Wrangler display grid. This can hide or show different panes based on your preferences and screen size, as shown in this screenshot:

Saving and exporting code

The toolbar above the Data Wrangler display grid provides options to save the generated code. You can copy the code to the clipboard, or export it to the notebook as a function. Exporting the code closes Data Wrangler and adds the new function to a code cell in the notebook. You can also download the cleaned DataFrame as a csv file.

Tip

Data Wrangler generates code that is applied only when you manually run the new cell, and it won't overwrite your original DataFrame, as shown in this screenshot:

You can then run that exported code, as shown in this screenshot:

To try out Data Wrangler on Spark DataFrames, visit this companion article
For a live-action demo of Data Wrangler in Fabric, check out this video from our friends at Guy in a Cube
To try out Data Wrangler in Visual Studio Code, head to Data Wrangler in VS Code
Did we miss a feature you need? Let us know! Suggest it at the Fabric Ideas forum

Споделяне чрез

How to accelerate data prep with Data Wrangler in Microsoft Fabric

Prerequisites

Limitations

Launching Data Wrangler

Choosing custom samples

Viewing summary statistics

Browsing data-cleaning operations

Previewing and applying operations

Modify your display

Saving and exporting code

Обратна връзка

Обратна връзка

Допълнителни ресурси

Споделяне чрез

How to accelerate data prep with Data Wrangler in Microsoft Fabric

Prerequisites

Limitations

Launching Data Wrangler

Choosing custom samples

Viewing summary statistics

Browsing data-cleaning operations

Previewing and applying operations

Modify your display

Saving and exporting code

Related content

Обратна връзка

Обратна връзка

Допълнителни ресурси