Developing Large Power BI Datasets – Part 2 – Partitioning

November 23, 2022November 24, 2022

Developing Large Power BI Datasets – Part 1

Partitioning Large Tables

Table partitioning has long been an important task in semantic model design using SQL Server Analysis Services. Since SSAS Tabular and Power BI models are built on top of the SSAS architecture, the pattern of partitioning remains the same as it has been for twenty years. However, the specific methods for implementation have been fine-tuned and improved. The reasons for partitioning large fact tables mainly include:

Improve refresh speed
Prevent reloading historical records
Capture updated history
Reduce database resource load

You don’t have to have massive tables to benefit from partitioning. Even tables with a few hundred thousand records can benefit from partitioning, to improve data refresh performance and to detect source data changes. There is little maintenance overhead, so the benefits usually outweigh the cost, in terms of effort and management.

Crossing the Bridge to Enterprise Power BI Development

One thing to keep in mind is that once a dataset has been partitioned, you can no longer download it as a PBIX file from the service. There are actually a number of model properties that have this effect. Any changes you make to a deployed model using the XMLA endpoint, will disable downloading the file. Honestly, if you are managing an enterprise-class Power BI solution, you should be keeping your development files in a source code repository, so this shouldn’t be a restriction.

Why Partition?

Partitions are completely hidden from report and data model users. They only see the entire table and don’t see that it is comprised of multiple partitions. For administrators, partitions enable them to process data for only single time period (e.g. year, month, week or day). Each partition can use a different storage mode. For example, the past five years of sales history can be stored using Import mode and current records through the previous day can use DirectQuery so we always see real-time data for that brief period of time.

The following image is the TOM (Tabular Object Model Explorer) in Tabular Editor, showing a set of table partitions automatically generated from an Incremental Refresh policy in the Power BI service:

Here is an example of the XMLA script used to create a quarterly partition on the Fact Online Sales table for Q1 of 2022:

You might notice that this “XMLA” definition is actually in JSON format. The product team kept the old name from the days when we used XML formatted scripts to define objects but moved to the more modern JSON format back in 2016. It’s actually called TMSL, but it’s part of the XMLA standard. Regardless, this is how you create objects using the XMLA webservice endpoint when connected to a Power BI dataset in a Premium capacity workspace.

Back when we had to create and manage semantic model objects using script, semantic model developers would spend weeks generating and modifying scripts like this, and then automate their partition management, creating, dropping and merging partitions together, using tools like SSIS. Those days are behind us for the most part, thanks to the magic of Incremental Refresh policies. The script in the previous example was generated using Incremental Refresh in the Power BI service. An administrator could use this script as a starting point if they needed to manage the process manually, but this is unnecessary in most implementations today.

Implementing Incremental Refresh

This is a feature of the Power BI service that automatically generates and manages partitions based on a simple policy designed in Power BI Desktop. To enable this feature, follow the steps in the previous post to enable parameterization in Power Query. Create RangeStart & RangeEnd parameters and add a date range filter to fact table.

Create Incremental Refresh policy from the ellipsis menu for a fact table in Power BI Desktop. If you have setup the parameters and date range filter correctly, you will see the following dialog:

The settings show here will generate partitions for five years of data. The last two months prior to the refresh date will be stored in monthly partitions and the prior records will be stored in yearly partitions. Additionally, the policy will check all the partition queries to see if any records have been changed since the last refresh. If so, it will process the partition(s) containing those records.

At the 2022 PASS Data Community Summit in Seattle last week, Shabnam Watson covered this topic in great detail in this presentation: A look at automatic partition maintenance by Power BI Service for datasets with Incremental Refresh – Shabnam Watson’s Blog (wpcomstaging.com)

One of the things I learned in her session was that it takes two refresh cycles for partition merging to take place. This would mean that at the end of the month, the last monthly partition wouldn’t be merged into the first yearly partition until the next refresh cycle. The service manages to keep the data consistent throughout the process, but it takes time for the service to clean-up after itself internally. In other words: don’t worry about this unless you are orchestrating unusually complex automation over the partitions outside of using scheduled refresh.

Incremental Refresh works with Pro and Premium licensing, but datasets cannot exceed 1 GB of compressed data unless the workspace is configured with Premium capacity or a Premium Per User license.

Developing Large Power BI Datasets – Part 3 – Detail tables

Paul Turley

Microsoft Data Platform MVP, Principal Consultant for 3Cloud Solutions Specializing in Business Intelligence, SQL Server solutions, Power BI, Analysis Services & Reporting Services.

Power BI Direct Lake and DirectQuery in the Age of Fabric

I just returned from the Microsoft Fabric Community Conference in Las Vegas. Over 4,000 attendees saw a lot of demos showing how to effortlessly build a modern data platform with petabytes of data in One Lake, and then ask CoPilot to generate beautiful Power BI reports from semantic models that magically appear from data in a Fabric Lakehouse. Is Direct Lake the silver bullet solution that will finally deliver incredibly fast analytic reporting over huge volumes of data in any form, in real time? Will Direct Lake models replace Import model and solve the dreaded DirectQuery mode performance problems of the past? The answer is No, but Direct Lake can break some barriers. This post is a continuation of my previous post titled “Moving from Power BI to Microsoft fabric”.

Direct Lake is a new semantic model storage mode introduced in Microsoft Fabric, available to enterprise customers using Power BI Premium and Fabric capacities. It is an extension of the Analysis Services Vertipaq in-memory analytic engine that reads data directly from the Delta-parquet structured storage files in a Fabric lakehouse or warehouse.

Moving from Power BI to Microsoft Fabric

Fabric is here but what does that mean if you are using Power BI? What do you need to know and what, if anything will you need to change if you are a Power BI report designer, developer or BI solution architect? What parts of Fabric should you use now and how do you plan for the near-term future? As I write this in March of 2024, I’m at the Microsoft MVP Summit at the Microsoft campus in Redmond, Washington this week learning about what the product teams will be working on over the next year or so. Fabric is center stage in every conversation and session. To say that Fabric has moved my cheese would be a gross understatement. I’ve been working with data and reporting solutions for about 30 years and have seen many products come and go. Everything I knew about working with databases, data warehouses, transforming and reporting on data has changed recently BUT it doesn’t mean that everyone using Power BI must stop what they are doing and adapt to these changes. The core product is unchanged. Power BI still works as it always has.

The introduction of Microsoft Fabric in various preview releases over the past two years have immersed me into the world of Spark, Python, parquet-Delta storage, lakehouses and medallion data warehouse architectures. These technologies, significantly different from the SQL Server suite of products I’ve known and loved for the past twenty years, represent a major shift in direction, forming the backbone of OneLake; Microsoft’s universal integrated data platform that hosts all the components comprising Fabric. They built all of Fabric on top of the existing Power BI service, so all of the data workloads live inside familiar workspaces, accessible through the Power BI web-based portal (now called the Fabric portal).

CI/CD & DevOps for Power BI… Are We There Yet?

In my view, projects and teams of different sizes have different needs. I described DevOps maturity as a pyramid, where most projects don’t require a sophisticated DevOps implementation, and the most complex solutions do. The DevOps maturity is a progression, but only for projects of a certain scale. One of the following options might simply be the best fit for a particular project.
Unless you are throwing together a simple Power BI report that you don’t plan to maintain and add features to, the first and most basic managed project should start with a PBIX file or Power BI Project folder stored in a shared and cloud-backed storage location.
DevOps isn’t a requirement for all projects, but version control and shared file storage definitely is.

3 thoughts on “Developing Large Power BI Datasets – Part 2 – Partitioning”

Pingback: Partitioning Data in Power BI – Curated SQL
Oliver MIlls says:

November 24, 2022 at 9:07 am

Great post thanks. Is there any planned future developments which will make incremental refresh more effective for data that changes over a long period of time? Or is there any way to set our policy up using tabular editor to overcome the scenario below?

In our case we have data updating from 2 years ago, for example, meaning that our incremental refresh window needs to extend this far back.

In addition, we have data updating in almost every partition (not many rows, only a handful) meaning that an incremental refresh takes basically as long as a regular refresh, because every partition detects changes, even though it’s a small number of rows that have changed in each partition. For a scenario like this, it would make more sense to use a PK to look against source, check for any updates on the last_modified column and only update the rows that have changed.

Loading...

Reply
Pingback: Developing Large Power BI Datasets – Part 1 – Paul Turley's SQL Server BI Blog

Developing Large Power BI Datasets – Part 2 – Partitioning

Partitioning Large Tables