Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Aren’t two superheroes better than one?

Some of the most powerful results come from combining complementary superpowers, and the “dynamic duo” of Apache Hive LLAP and Apache Impala, both included in Cloudera Data Warehouse, is further evidence of this.  Both Impala and Hive can operate at an unprecedented and massive scale, with many petabytes of data. Both are 100% Open source, so you can avoid vendor lock-in while you use your favorite BI tools, and benefit from community-driven innovation.

Both Impala and Hive LLAP each sound like they will work great for my data warehouse use cases, so why do I really need to decide between the two?  The answer is simple, each has its own unique specialties, and depending on the type of analytics you want to do, you might find one is better suited than the other.  However, there is a secret I am keeping to the end of the blog, which makes the decision even easier for the user: so easy in fact, you do not even have to decide yourself.

Before I get into the differences between these SQL engines, it is important to note that both Impala and Hive LLAP share the same data and metadata (through the Hive Metastore) so not only can you switch from one to the other if you change your mind, you can even run different workloads using different engine choices on the same data, at the same time.  A true “best of both worlds” situation.

So, why choose?  Well, generally speaking, Impala works best when you are interacting with a data mart, which is typically a large dataset with a schema that is limited in scope. Meanwhile, Hive LLAP is a better choice for dealing with use cases across the broader scope of an enterprise data warehouse.  These use cases often involve multiple departments and a variety of downstream applications, both of which result in a wider array of query patterns.  We also see that Impala is a good choice for interactive, ad-hoc queries, especially if you have hundreds or thousands of users working on their own. 

You can also mix and match, using Impala for some queries and some tables, and Hive LLAP for other queries and other tables.

Impala was designed for speed. 

Written in C++, which is very CPU efficient, with a very fast query planner and metadata caching, Impala is optimized for low latency queries.  Because of this, Impala is an ideal engine for use with a data mart, since people working with data marts are mostly running read-only queries and not large scale writes.  

Impala also has a very efficient run-time execution framework, using code generation, process-to-process communication, massive parallelism, and metadata caching. Because of this, Impala is also great when working with ad-hoc queries, like when exploring by iteratively digging into data.  You’ll want to change your query over and over again, at a moment’s notice, and have very fast response times so you’re not waiting forever for each iteration.  

Hive LLAP was designed for sophistication.

Hive LLAP has many sophisticated capabilities that may make it a little harder for developers to get started and use effectively.  In Hive LLAP, sometimes a query takes longer to go through the planning and ramp-up for execution.  However, Hive is designed to be very fault-tolerant.  If a fragment of a long-running query fails, Hive will reassign it and try again. Hive caches data files as well as query results, with sophisticated algorithms, meaning more frequently requested data stays cached with LLAP.  Hive LLAP supports query federation, by allowing queries to run across multiple components and databases.  Therefore, Hive LLAP makes up for any “slow start” in EDW use cases as it is much more robust, and has greater performance, in the long run.

Because of this sophistication and flexibility, Hive LLAP is better suited for enterprise data warehouse, or EDW, use cases.  With an EDW, you are supporting Business Intelligence reports and dashboards, dependent data marts, other enterprise applications, external systems, and more. These workloads are often taking multiple dimensions into account, and as a result, EDWs often have to process more complex SQL requirements than data marts, with a greater need for complex data types, more scheduled queries, and query orchestration to populate data marts or generate regular data extracts.

Hive’s ability to more robustly handle longer running, more complex queries, on massive-scale data sets, make it often the better choice for these types of applications.  In fast action ad-hoc queries, Hive LLAP’s start-up times may slow it down compared with Impala, yet with longer running queries, this start-up cost is a relatively inconsequential part of the total run time.  Hive LLAP becomes a better choice for EDW also because of its fault tolerance (who wants a query to fail if you are waiting a long time for the result?) and better performance on more complex queries.

Using Impala and Hive LLAP

Impala Hive LLAP
Data mart Enterprise data warehouse
  • Good choice for interactive and ad-hoc analysis, especially with high concurrency self-service
  • Good choice for long-running queries requiring heavy transformations or multiple joins
  • Good choice for interactive and ad-hoc analysis using features not available in Impala
  • Good choice for Business Intelligence tools that allow users to quickly change queries
  • Good choice for Dashboards that are pre-defined and not customizable by the viewer
  • Uses Parquet as the preferred file format
  • Uses ORC as the preferred file format
  • Does better with JSON than Impala does

 

As massive data sets combine with growth of use cases, choosing the right Data Warehouse SQL Engine to get timely results makes all the difference.  

Join us for Racing for Results! Data Warehouse – Impala vs. Hive LLAP, a lively debate among experts, on October 20, 2020, 10:00am US pacific time, 1:00pm US eastern time, complete with customer use case examples, and followed by a live q&a.  

David Dichmann
Senior Director Product Management
More by this author

1 Comments

by Naveen Chigurupalli on

It’s the most comprehensive and on to the point article to understand the difference between Impala and hive. This helps the Data Architects to optimal design the data model as per the use case requirements.

Leave a comment

Your email address will not be published. Links are not permitted in comments.