Doing Power BI the Right Way: 10. Designing and Managing Large Datasets

June 1, 2021June 1, 2021

Part of the the series: Doing Power BI the Right Way

Later in this post, you will find a 27 minute framed video including a series of interviews and discussions with a few of my colleagues. They provide perspectives and recent experiences related to this topic. The best approach to design and maintain large data models depends on different factors. However, many years of experience have proven that there are best practices to embrace and common mistakes to avoid.

I was just talking to a consulting client about the best approach to build a data model and he told me something very interesting about the way they were loading data into Power BI. He said “We don’t use facts and dimensions, we load all of our data into one huge table.” He said that their data model performs well and that it meets their reporting needs. It is a difficult point to argue, when something is working at the time although the design might not follow the accepted rules. Life is like that and there are plenty of analogies to make the point that a practice, even a real bad practice, might solve a problem for a period of time and under certain conditions. <analogy>You can drive a car at excessive speed to get to your destination faster. You might not get caught by the police on that day and you might not crash but eventually, if you make it a habit, this practice will catch up to you.</analogy> Data is like that. If you don’t play by the rules, you limit your options. Bending the rules lets you move faster and sometimes with less hassle. But, as the project scope expands – and after adding enough data or other complexities to the solution, it will not endure. The data model won’t perform well, won’t load the correct data or it just won’t be reliable.

This post will explore the realities of best practice design for large data models; some important considerations and trade-off decisions when working with both “big data” and “large data”.

One of the great challenges with Power BI is that it is so easy to build simple data models and reports quickly. But, often when those projects need to grow-up, they aren’t ready for enterprise use. This isn’t a problem with the tool, it is a problem with the way it is often used. If someone says “Hey, Paul, I need you to create a report quickly. Can you import data from these three sources into Power BI and create a quick dashboard report?” I might say “sure, let’s just do this:”

Merge records from multiple files and queries into a single table in the model.
Use in-line SQL queries in Power Query to join source tables together and apply transformation logic.
In reports, summarize numeric columns from a table in the data model, rather than defining DAX measures.
Reload the entire model when source data changes.

There is nothing inherently wrong or bad about using these quick-and-easy techniques in a smallish project, but if it needs to grow, scale and be future-proof, you may need to follow a longer path to achieve a more durable result.

The answer to a question we often ask at the beginning of a BI project: “what information do we need include in the data model?”, is often the same. I think Freddie Mercury said it best: “I want it all and I want it now!” More specifically, stakeholders might say that they want:

A lot of rows of detail or historical fact records, (lets say tens of millions or billions of rows.)
Many columns from the source table(s) so they can slice, filter or group values on anything imaginable.
Reports must perform well and return results quickly.

…to these requests, we might say “pick any two but you can’t have all three”. Maybe you can have all three to a certain point but not to an extreme.

Tall and Wide tables

A “tall” table contains many rows or records. A million might be a substantial number of fact table rows in a desktop Power BI data model, and row counts in the billions are not out of reach in an enterprise-scale solution built in either Power BI or Analysis Services. The VertiPaq analytic data processing engine (the data modeling engine behind SSAS, AAS, Power Pivot and Power BI) is very good at handling very tall tables when the columns are conservative, numeric data types; especially when they contain repeating, non-unique values to aid columnar compression. In rare cases, Power BI data models can even read and process PetaBytes of data! The point is that tall tables are very achievable.

Tables with many columns can be managed by the modeling engine but tall (many rows) and wide (many columns) are often not a good combination. The fact is that analytic reporting solutions shouldn’t include the whole kitchen sink. The problem often stems from developers not building data models based on on business requirements but simply exposing all the available data just in case someone might find it valuable. This is just not the way BI works. If you don’t need certain columns to meet the current report requirements, leave them out. Columns can usually be added later when they are needed. Additional tables can be also added until the model exceeds its design goals. Herein lies the need for balance.

The thing is that simple data models can be relatively easy to build and often don’t take up a lot of space. It may be more cost-effective in the long run to create another data model with some duplicate design elements, than to spend all of your time and energy trying to craft an all-encompassing data model.

Large Data and Big Data

By “Large Data”, we generally mean that tables contain a high number of rows. Conversely, “Big Data” refers to a set of technologies that were engineered to manage large volumes of data records using distributed storage and processing.

Big Data

In the first wave of cloud computing, we saw the introduction of “Big Data” which was a new concept for data professionals. Technologies like Hadoop and Apache Spark use distributed computing to process larges sets of data in parallel, using highly scaled-out architecture. The term “Big Data” doesn’t just mean that there is a lot of data but it really means “distributed data processing” which can potentially be used to handle very large data sets. Like any architecture, there are strengths and weaknesses. Big Data implementations like Spark and Databricks can transform and process an enormous amount of information very quickly with portions of the data processed on distributed nodes.

Conversely, the VertiPaq engine at the center of a Power BI and SSAS tabular data model, is not a distributed architecture. Because VertiPaq keeps all of the data in memory on a single computing node, it can perform calculations extremely fast. For this to work, data must reside in contiguous memory on a single computer. So, Power BI is not Big Data but it can be used in conjunction with Big Data sources when you use these respective technologies the way they were designed and architected to work.

Complex Models

Unique problems require unique data model designs. Sometimes a Power BI report and accompanying data model just need to address a few business reporting objectives. As time goes by, additional requirements are introduced and the model grows. Along the way, certain data challenges might require additional data massaging or new tables to bridge tables and address special needs. Over time, a model can get complicated. This often raises the question: should we create one big, complex data model to address many requirements or multiple, less complicated data models? There are trade-off decisions either way. It is often possible to engineer larger, high-value models. However, you have to weight the cost to maintain a complex model with the relative simplicity and cost savings of less-complicated models. Sometimes starting over is cheaper than layering more messiness onto a data model with a shaky foundation.

Partitioning

Once considered an essential activity in the design of any data model, partitioning breaks a table into multiple storage units that can be managed and loaded separately. Partitions containing unchanged records don’t need to be reprocessed, and the smaller partitions containing new records or those that have changed, can be processed quickly and in parallel, so they don’t block other operations.

Video Interviews

This video runs about 27 minutes and includes a series of interviews and discussions with some of my colleagues, with their thoughts and experiences on this topic.

Paul Turley

Microsoft Data Platform MVP, Principal Consultant for 3Cloud Solutions Specializing in Business Intelligence, SQL Server solutions, Power BI, Analysis Services & Reporting Services.

Power BI Direct Lake and DirectQuery in the Age of Fabric

I just returned from the Microsoft Fabric Community Conference in Las Vegas. Over 4,000 attendees saw a lot of demos showing how to effortlessly build a modern data platform with petabytes of data in One Lake, and then ask CoPilot to generate beautiful Power BI reports from semantic models that magically appear from data in a Fabric Lakehouse. Is Direct Lake the silver bullet solution that will finally deliver incredibly fast analytic reporting over huge volumes of data in any form, in real time? Will Direct Lake models replace Import model and solve the dreaded DirectQuery mode performance problems of the past? The answer is No, but Direct Lake can break some barriers. This post is a continuation of my previous post titled “Moving from Power BI to Microsoft fabric”.

Direct Lake is a new semantic model storage mode introduced in Microsoft Fabric, available to enterprise customers using Power BI Premium and Fabric capacities. It is an extension of the Analysis Services Vertipaq in-memory analytic engine that reads data directly from the Delta-parquet structured storage files in a Fabric lakehouse or warehouse.

Moving from Power BI to Microsoft Fabric

Fabric is here but what does that mean if you are using Power BI? What do you need to know and what, if anything will you need to change if you are a Power BI report designer, developer or BI solution architect? What parts of Fabric should you use now and how do you plan for the near-term future? As I write this in March of 2024, I’m at the Microsoft MVP Summit at the Microsoft campus in Redmond, Washington this week learning about what the product teams will be working on over the next year or so. Fabric is center stage in every conversation and session. To say that Fabric has moved my cheese would be a gross understatement. I’ve been working with data and reporting solutions for about 30 years and have seen many products come and go. Everything I knew about working with databases, data warehouses, transforming and reporting on data has changed recently BUT it doesn’t mean that everyone using Power BI must stop what they are doing and adapt to these changes. The core product is unchanged. Power BI still works as it always has.

The introduction of Microsoft Fabric in various preview releases over the past two years have immersed me into the world of Spark, Python, parquet-Delta storage, lakehouses and medallion data warehouse architectures. These technologies, significantly different from the SQL Server suite of products I’ve known and loved for the past twenty years, represent a major shift in direction, forming the backbone of OneLake; Microsoft’s universal integrated data platform that hosts all the components comprising Fabric. They built all of Fabric on top of the existing Power BI service, so all of the data workloads live inside familiar workspaces, accessible through the Power BI web-based portal (now called the Fabric portal).

CI/CD & DevOps for Power BI… Are We There Yet?

In my view, projects and teams of different sizes have different needs. I described DevOps maturity as a pyramid, where most projects don’t require a sophisticated DevOps implementation, and the most complex solutions do. The DevOps maturity is a progression, but only for projects of a certain scale. One of the following options might simply be the best fit for a particular project.
Unless you are throwing together a simple Power BI report that you don’t plan to maintain and add features to, the first and most basic managed project should start with a PBIX file or Power BI Project folder stored in a shared and cloud-backed storage location.
DevOps isn’t a requirement for all projects, but version control and shared file storage definitely is.

Doing Power BI the Right Way: 10. Designing and Managing Large Datasets

Tall and Wide tables