Developing Large Power BI Datasets – Part 2 – Partitioning

Developing Large Power BI Datasets – Part 1

Partitioning Large Tables

Table partitioning has long been an important task in semantic model design using SQL Server Analysis Services. Since SSAS Tabular and Power BI models are built on top of the SSAS architecture, the pattern of partitioning remains the same as it has been for twenty years. However, the specific methods for implementation have been fine-tuned and improved. The reasons for partitioning large fact tables mainly include:

  • Improve refresh speed
  • Prevent reloading historical records
  • Capture updated history
  • Reduce database resource load

You don’t have to have massive tables to benefit from partitioning. Even tables with a few hundred thousand records can benefit from partitioning, to improve data refresh performance and to detect source data changes. There is little maintenance overhead, so the benefits usually outweigh the cost, in terms of effort and management.

Crossing the Bridge to Enterprise Power BI Development

One thing to keep in mind is that once a dataset has been partitioned, you can no longer download it as a PBIX file from the service. There are actually a number of model properties that have this effect. Any changes you make to a deployed model using the XMLA endpoint, will disable downloading the file. Honestly, if you are managing an enterprise-class Power BI solution, you should be keeping your development files in a source code repository, so this shouldn’t be a restriction.

Why Partition?

Partitions are completely hidden from report and data model users. They only see the entire table and don’t see that it is comprised of multiple partitions. For administrators, partitions enable them to process data for only single time period (e.g. year, month, week or day). Each partition can use a different storage mode. For example, the past five years of sales history can be stored using Import mode and current records through the previous day can use DirectQuery so we always see real-time data for that brief period of time.

The following image is the TOM (Tabular Object Model Explorer) in Tabular Editor, showing a set of table partitions automatically generated from an Incremental Refresh policy in the Power BI service:

Here is an example of the XMLA script used to create a quarterly partition on the Fact Online Sales table for Q1 of 2022:

You might notice that this “XMLA” definition is actually in JSON format. The product team kept the old name from the days when we used XML formatted scripts to define objects but moved to the more modern JSON format back in 2016. It’s actually called TMSL, but it’s part of the XMLA standard. Regardless, this is how you create objects using the XMLA webservice endpoint when connected to a Power BI dataset in a Premium capacity workspace.

Back when we had to create and manage semantic model objects using script, semantic model developers would spend weeks generating and modifying scripts like this, and then automate their partition management, creating, dropping and merging partitions together, using tools like SSIS. Those days are behind us for the most part, thanks to the magic of Incremental Refresh policies. The script in the previous example was generated using Incremental Refresh in the Power BI service. An administrator could use this script as a starting point if they needed to manage the process manually, but this is unnecessary in most implementations today.

Implementing Incremental Refresh

This is a feature of the Power BI service that automatically generates and manages partitions based on a simple policy designed in Power BI Desktop. To enable this feature, follow the steps in the previous post to enable parameterization in Power Query. Create RangeStart & RangeEnd parameters and add a date range filter to fact table.

Create Incremental Refresh policy from the ellipsis menu for a fact table in Power BI Desktop. If you have setup the parameters and date range filter correctly, you will see the following dialog:

The settings show here will generate partitions for five years of data. The last two months prior to the refresh date will be stored in monthly partitions and the prior records will be stored in yearly partitions. Additionally, the policy will check all the partition queries to see if any records have been changed since the last refresh. If so, it will process the partition(s) containing those records.

At the 2022 PASS Data Community Summit in Seattle last week, Shabnam Watson covered this topic in great detail in this presentation: A look at automatic partition maintenance by Power BI Service for datasets with Incremental Refresh – Shabnam Watson’s Blog (wpcomstaging.com)

One of the things I learned in her session was that it takes two refresh cycles for partition merging to take place. This would mean that at the end of the month, the last monthly partition wouldn’t be merged into the first yearly partition until the next refresh cycle. The service manages to keep the data consistent throughout the process, but it takes time for the service to clean-up after itself internally. In other words: don’t worry about this unless you are orchestrating unusually complex automation over the partitions outside of using scheduled refresh.

Incremental Refresh works with Pro and Premium licensing, but datasets cannot exceed 1 GB of compressed data unless the workspace is configured with Premium capacity or a Premium Per User license.

Developing Large Power BI Datasets – Part 3 – Detail tables

Paul Turley

Paul Turley

Microsoft Data Platform MVP, Principal Consultant for 3Cloud Solutions Specializing in Business Intelligence, SQL Server solutions, Power BI, Analysis Services & Reporting Services.

3 thoughts on “Developing Large Power BI Datasets – Part 2 – Partitioning

  1. Great post thanks. Is there any planned future developments which will make incremental refresh more effective for data that changes over a long period of time? Or is there any way to set our policy up using tabular editor to overcome the scenario below?

    In our case we have data updating from 2 years ago, for example, meaning that our incremental refresh window needs to extend this far back.

    In addition, we have data updating in almost every partition (not many rows, only a handful) meaning that an incremental refresh takes basically as long as a regular refresh, because every partition detects changes, even though it’s a small number of rows that have changed in each partition. For a scenario like this, it would make more sense to use a PK to look against source, check for any updates on the last_modified column and only update the rows that have changed.

Leave a Reply

Discover more from Paul Turley's SQL Server BI Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading