In this writing, I’d like to share my journey of learning how to transform data into the desired structure in Fabric, using tools like Fabric Notebook, Data Pipeline, and more, after initially loading the data into Fabric Lakehouse.
Power BI developers who have mainly or exclusively used Power Query Editor might feel overwhelmed when tasked with performing data transformations in Fabric. This is especially true if they are advised to avoid using Data Flow Gen 2 for handling very large tables, as recommended.
When I first started learning and using Power Query Editor for data transformations quite some time ago, I preferred clicking buttons to execute transformations rather than writing M code directly in the formula bar. In my early stages of learning Power BI and data modeling, In Power Query Editor, I probably wrote M code directly less than 30% of the time. Now, as a Power BI developer working in a large team, I find myself doing even less data transformation in Power Query Editor. Most of my focus is on creating various parameters in Power Query Editor, as the majority of transformations are handled in the upstream by our team’s data engineers.
In my case, however, this approach is usually applied to existing reports & and after the Proof of Concept (POC) stage of when creating new reports.
When developing new reports, particularly during the POC stage, I rely heavily on Power Query Editor for transformations. This enables me to experiment with various table structures and data model designs, helping to identify the most effective setup for the semantic model. Once the team has determined the optimal structure for both fact and dimension tables, I work closely with data engineers to transform the source data into a table structure that delivers the best performance for the semantic model.
Working on the Fabric, the process might be similar when dealing with an import mode semantic model, and many tasks can be done in Power Query Editor. However, there will be numerous occasions where I choose to work with Direct Lake mode. In these cases, at POC stage, I want to ensure that I can seamlessly experiment within Fabric to create and optimize high-performing data models.
For example, when working with a large dataset that doesn’t fit well into an import mode due to size constraints or performance issues, Direct Lake mode becomes essential. In such scenarios, I need to be equipped to experiment with different data transformation approaches using various Fabric tools like Data Flow Gen 2, Data Pipeline, or Fabric Notebook. This is crucial not just for building an effective data model but also for iterating quickly and refining the model to meet performance requirements.
To achieve this, I aim to develop my skills in using these tools beyond a basic level. For instance, understanding how to leverage Data Flow Gen 2 for efficient data transformations, or using Fabric Notebook to automate and scale complex data tasks, is vital. Being comfortable with these tools allows me to experiment more freely and confidently, leading to better decision-making and ultimately, more robust data models.
Throughout my journey of learning Fabric, I plan to share my experiences, not just in this post but also in future ones. I’ll take readers through my path from being a complete beginner with these tools to reaching a foundational level of proficiency, with the ultimate goal of progressing to an intermediate skill level (or more). Along the way, I’ll try describing pros and cons of each method, delve into the challenges I encountered, the strategies I employed to overcome them, and the practical steps I took to become more proficient in using Data Flow Gen 2, Data Pipeline, and Fabric Notebook in my work.
Firstly, here’s a brief comparison of the pros and cons of using Data Flow Gen 2, Data pipelines and Fabric notebooks:
Data Flow Gen 2
Pros:
User-Friendly Interface: Ideal for users familiar with Power Query Editor, offering a highly visual, no-code/low-code environment.
Extensive Transformation Options: Provides over 300 data transformation functions.
Integration: Easily integrates with various data sources and destinations.
Cons:
Performance: May not be as performant as other tools for very large datasets or complex transformations.
Data Pipelines
Pros:
Scalability: Can handle large-scale data ingestion and transformation, suitable for petabyte-scale data.
Flexibility: Supports a wide range of data sources and destinations.
Scheduling: Allows for scheduled data processing, making it ideal for regular data updates.
Cons:
Complexity: May require more technical expertise to set up and manage compared to Data Flow Gen 2.
Less Visual: Typically involves more coding and less visual interface, which might be less intuitive for some users.
Fabric Notebooks
Pros:
Powerful and Flexible: Supports complex data transformations and processing using languages like Python, and Spark SQL.
Integration with Spark: Leverages the power of Spark for high-performance data processing.
Extensive Libraries: Access to a wide range of open-source libraries for data manipulation and analysis.
Cons:
Steeper Learning Curve: Requires knowledge of programming languages and Spark, which might be challenging for non-technical users.
Less Visual: More code-centric, which can be less user-friendly compared to Data Flow Gen 2.
Each tool has its strengths and is suited for different use cases. The choice depends on specific needs, technical expertise, and the scale of data operations.
Data Flow Gen 2
Let’s start with Data Flow Gen 2, whose user interface will be quite familiar to those experienced with Power Query Editor.
I set up a Lakehouse and uploaded several CSV files to it.
I attempted to transform some of the CSV files using Data Flow Gen 2.
I looked for the Lakehouse that holds the CSV files.
I renamed Data Flow Gen 2 and began transforming data. Since I opened the new Data Flow Gen 2 within the Lakehouse, the destination for each query was already set to the Lakehouse. If I had opened Data Flow Gen 2 in the Workspace instead, I would likely need to assign a destination for each query individually, which might not be efficient. In the image below, step 7 might be unnecessary because the destination for each query is pre-set, but I could still perform step 7 if I wanted a different destination for each query.
Now, the destination Lakehouse contains the date_src, product_src, and store_src tables, all transformed exactly as I wanted.
Data Pipeline
I set up a new Lakehouse and began utilizing Data Pipeline.
I chose a Lakehouse that holds the source file I want to transform.
I selected a source and clicked “Next.” The following step, as shown in the image below, is to choose the data destination. However, since I initiated the Data Pipeline operation within the Lakehouse, the destination is automatically set to the same Lakehouse.
I opted to change the data type of one column and then clicked “Next”.
Then, the transformation process will begin.
Fabric Notebook
One of the Lakehouses I recently created contains many tables with the same structure. I wanted to union some of them, so I tried to use the UNION function in a Fabric notebook.
df = spark.sql("SELECT * FROM bronze_lakehouse.yellow_tripdata_2024_01")
display(df)
df_02 = spark.sql("SELECT * FROM bronze_lakehouse.yellow_tripdata_2024_02")
display(df_02)
df_03 = spark.sql("SELECT * FROM bronze_lakehouse.yellow_tripdata_2024_03")
display(df_03)
result = df.union(df_02).union(df_03)
# Load the data into a Delta table
table_name = "union_data_test" # Replace with your desired table name
result.write.format("delta").mode("Overwrite"). saveAsTable(table_name)
The code above might not be written in the most optimal way, but I aimed to describe each step in as much detail as possible. And, I’m still learning to write PySpark code more efficiently and effectively, and I hope to improve soon.
As shown below, the new table named union_data_test has been created. It is now ready for further development and transformation, and can be used in the POC stage for initial analysis.
To summarize, learning to transform data in Fabric using tools like Fabric Notebook, Data Pipeline, and Data Flow Gen 2 can be challenging but rewarding. Initially, it might feel overwhelming, especially for those accustomed to only Power Query Editor. However, experimenting with different data transformation approaches during the Proof of Concept (POC) stage can help identify the most effective setup for your data models. By developing proficiency in these tools, you can create and optimize high-performing data models, ensuring seamless experimentation and better decision-making.
I hope this makes it enjoyable to start developing good skills in using Fabric for data transformations and data modeling.
Comments