Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

by David Dichmann

Posted in Technical | September 24, 2020 4 min read

Aren’t two superheroes better than one?

Some of the most powerful results come from combining complementary superpowers, and the “dynamic duo” of Apache Hive LLAP and Apache Impala, both included in Cloudera Data Warehouse, is further evidence of this. Both Impala and Hive can operate at an unprecedented and massive scale, with many petabytes of data. Both are 100% Open source, so you can avoid vendor lock-in while you use your favorite BI tools, and benefit from community-driven innovation.

Both Impala and Hive LLAP each sound like they will work great for my data warehouse use cases, so why do I really need to decide between the two? The answer is simple, each has its own unique specialties, and depending on the type of analytics you want to do, you might find one is better suited than the other. However, there is a secret I am keeping to the end of the blog, which makes the decision even easier for the user: so easy in fact, you do not even have to decide yourself.

Before I get into the differences between these SQL engines, it is important to note that both Impala and Hive LLAP share the same data and metadata (through the Hive Metastore) so not only can you switch from one to the other if you change your mind, you can even run different workloads using different engine choices on the same data, at the same time. A true “best of both worlds” situation.

So, why choose? Well, generally speaking, Impala works best when you are interacting with a data mart, which is typically a large dataset with a schema that is limited in scope. Meanwhile, Hive LLAP is a better choice for dealing with use cases across the broader scope of an enterprise data warehouse. These use cases often involve multiple departments and a variety of downstream applications, both of which result in a wider array of query patterns. We also see that Impala is a good choice for interactive, ad-hoc queries, especially if you have hundreds or thousands of users working on their own.

You can also mix and match, using Impala for some queries and some tables, and Hive LLAP for other queries and other tables.

Impala was designed for speed.

Written in C++, which is very CPU efficient, with a very fast query planner and metadata caching, Impala is optimized for low latency queries. Because of this, Impala is an ideal engine for use with a data mart, since people working with data marts are mostly running read-only queries and not large scale writes.

Impala also has a very efficient run-time execution framework, using code generation, process-to-process communication, massive parallelism, and metadata caching. Because of this, Impala is also great when working with ad-hoc queries, like when exploring by iteratively digging into data. You’ll want to change your query over and over again, at a moment’s notice, and have very fast response times so you’re not waiting forever for each iteration.

Hive LLAP was designed for sophistication.

Hive LLAP has many sophisticated capabilities that may make it a little harder for developers to get started and use effectively. In Hive LLAP, sometimes a query takes longer to go through the planning and ramp-up for execution. However, Hive is designed to be very fault-tolerant. If a fragment of a long-running query fails, Hive will reassign it and try again. Hive caches data files as well as query results, with sophisticated algorithms, meaning more frequently requested data stays cached with LLAP. Hive LLAP supports query federation, by allowing queries to run across multiple components and databases. Therefore, Hive LLAP makes up for any “slow start” in EDW use cases as it is much more robust, and has greater performance, in the long run.

Because of this sophistication and flexibility, Hive LLAP is better suited for enterprise data warehouse, or EDW, use cases. With an EDW, you are supporting Business Intelligence reports and dashboards, dependent data marts, other enterprise applications, external systems, and more. These workloads are often taking multiple dimensions into account, and as a result, EDWs often have to process more complex SQL requirements than data marts, with a greater need for complex data types, more scheduled queries, and query orchestration to populate data marts or generate regular data extracts.

Hive’s ability to more robustly handle longer running, more complex queries, on massive-scale data sets, make it often the better choice for these types of applications. In fast action ad-hoc queries, Hive LLAP’s start-up times may slow it down compared with Impala, yet with longer running queries, this start-up cost is a relatively inconsequential part of the total run time. Hive LLAP becomes a better choice for EDW also because of its fault tolerance (who wants a query to fail if you are waiting a long time for the result?) and better performance on more complex queries.

Using Impala and Hive LLAP

Impala	Hive LLAP
Data mart	Enterprise data warehouse
Good choice for interactive and ad-hoc analysis, especially with high concurrency self-service	Good choice for long-running queries requiring heavy transformations or multiple joins Good choice for interactive and ad-hoc analysis using features not available in Impala
Good choice for Business Intelligence tools that allow users to quickly change queries	Good choice for Dashboards that are pre-defined and not customizable by the viewer
Uses Parquet as the preferred file format	Uses ORC as the preferred file format Does better with JSON than Impala does

As massive data sets combine with growth of use cases, choosing the right Data Warehouse SQL Engine to get timely results makes all the difference.

Join us for Racing for Results! Data Warehouse – Impala vs. Hive LLAP, a lively debate among experts, on October 20, 2020, 10:00am US pacific time, 1:00pm US eastern time, complete with customer use case examples, and followed by a live q&a.

David Dichmann

Senior Director Product Management

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

1 Comments

by Naveen Chigurupalli on Sep 10, 2021 @ 3:53 am EDT

It’s the most comprehensive and on to the point article to understand the difference between Impala and hive. This helps the Data Architects to optimal design the data model as per the use case requirements.