Enabling Self-Service Business Insights with Cloudera Data Warehouse

Requests to Central IT for data warehousing services can take weeks or months to deliver. Central IT teams at large organizations face a proliferation of IT projects arising from the complexities of markets and from the needs of internal lines of business (LoBs). At the same time, Central IT must juggle cost and risk. In data-driven organizations, to fulfill its charter to democratize data and provide on-demand, quality computing services in a secure, compliant environment, IT must replace legacy approaches and update technologies. There needs to emerge data-first, self-service replacement for these old systems.

Cloudera customers have described the data challenges they face. A large multinational pharmaceutical organization’s plan to bring a drug to market took over ’12 years and 4.3 billion dollars.’. To make that kind of investment, they need to determine:

  • Which drugs to focus on? 
  • Which drugs will have the largest patient population impact? 
  • Which demographic and pre-existing conditions should be targeted?
  • How likely will clinical trials, conducted for each disease, have valuable and reliable outcomes? 
  • Which geographic areas (postal codes) are most suitable for conducting trials with available trained staff? 

This business is tied to data and making it available to its LoBs, who are researching which drugs to safely bring to market. In addition to the significant upfront costs of bringing a drug to market, competition between pharmaceutical companies to choose and launch the most impactful drugs is fierce. Core business fails if data and compute services aren’t available fast, reliable, and scalable.

Rigorous management processes often evolve into lengthy internal processes. Requests for IT resources for data and compute services can’t be delayed three to six months, which is how long the typical procurement cycle, machine configuration, and software installation takes. Delays mean losing to competition or the missing the window of a perfect trial. Typical cries from LoBs sound something like this:

“Give us the resources quickly. Not in weeks but in hours or at max days.”

“I need to run 100 complex and concurrent queries over billions of rows, where can I get an environment to do that?”

“We have this new data set, actually it is sensor data. And we want to model it quickly with some historic customer usage data…and oh yeah, it should be about 100TB, per day.”

‘Shadow IT’ projects crop up in LoBs to overcome rigidity that stifles innovation and progress. These projects are not visible to or operated by, Central IT.

As Central IT tries to reduce risks and costs, the shadow IT projects that spring up increase risks and drive up the cost. LoB users who run these projects often lack expertise in security and governance requirements. They often don’t realize that infrastructure for BI must be scalable and shared with external partners who need to collaborate on projects. 

How self-service data warehousing frees IT resources

Cloudera Data Warehouse (CDW) is a cloud service and an integral part of the newly released Cloudera Data Platform (CDP). Key features are:

  • Highly scalable and performant open-source engines for BI and data warehousing workloads
  • Modern architecture
    •  Separate compute from storage
    •  Containerization
    • Hybrid and multi-cloud environments
  • Provides a pay-as-you-go model. 

The CDP Shared Data Experience (SDX) service underlying Cloudera Data Warehouse helps Central IT provide security and governance. SDX is also the key to serve multiple workloads on the same data. CDP supports Cloudera Machine Learning (CML) (see link below) and other compute options.

Simplified provisioning

Cloudera Data Warehouse can relieve Central IT of the work involved in getting LoB projects up and running. The simplicity and ease of provisioning of data warehouses, illustrated below, makes self-service possible. 

Elastic architecture

Virtual Warehouses in CDW simplify capacity planning because you can scale up, shrink down, and auto-suspend warehouses as needed. No need to spend months analyzing projects for accurate capacity planning. Start small and grow as needed. Use max settings as guardrails to prevent runaway costs, common mistakes when moving to cloud.

Contention elimination

In a multi-tenant environment, many users need to access the same data sources. Isolated compute resources make it easier to adhere to SLAs, control costs, and be agile. Experimental and production workloads access the same data without users impacting each others’ SLAs.

High performance

Cloudera Data Warehouse has two high-performance, massively parallel processing (MPP) query engines — Impala and Hive LLAP. These engines have a proven history of powering mission-critical data warehouses at the largest companies in the world.

Cloudera Data Platform Architecture

Cloudera Data Platform architecture overcomes the barriers of affordability, rigidity, and inflexibility. Most of these new architectures are based on the ability to:

  • Separate storage and compute. Isolates workloads, while permitting a shared “data lake.” Data can be served to the business with ease.
  • Use containerization. Delivers agility of provisioning and the ability to scale resources to the right size for each workload on-demand.
  • Deploy in the cloud. Deploy select workloads in the cloud providing a “pay-as-you-go” model that gives you more control to reign in costs.
  • Combine public cloud with on-premise deployments. Offers a hybrid model that enables you to optimize for cost and investment. You can start projects out in the cloud where there is minimum overhead for procurement and configuration. When projects look like they are long-term, you have the option to bring them back “on-prem,” taking advantage of your on-premises data center investments.

With all of the benefits of these architectures, there are drawbacks. It can be harder to keep cross-environment visibility and traceability of data in an effort to satisfy compliance requirements. In addition, it can be challenging to keep a strong control of costs and to know where your data resides in an effort to serve your business well.

Centralized security and governance

Shared Data Experience (SDX), a shared persistent layer of access models, lineage-audit trace, and all metadata, is the key to the Cloudera data lake implementation. This data lake offers separate computes, but shared storage, across multiple environments.

Workload intelligence

CDP also includes Workload Manager, a cloud-native or on-premises tool for insight into your workloads,  optimizing workloads for efficiency, and identifying workloads that are “cloud-ready.”

Safely migrate data

CDP’s Replication Manager also helps quickly determine which workloads are suited for the cloud and helps you safely migrate data in optimal ways and control costs. 

How CDP and CDW solves IT problems 

Using the CDW service in CDP, customers like the large pharmaceutical organization can address their needs for data on a self-serve basis instead of requesting the valuable time of IT professionals. CDW provides flexibility, cost-savings, and agility.

CHALLENGE Before CDW After CDW
FLEXIBILITY Central IT provides a fixed-size cluster template for all LoB customers. Any changes in this template requires massive time and effort in addition to needing approval from top-level executives in the organization for any new templates or modifications to existing templates. With CDP, Central IT can build templates of as small or as large a size as needed by a particular LoB and their use case. LoBs are enabled to use the tools and the engines best suited for their use cases. No need to carry engines or services that are not required. For example, if BI is the only use case, central IT can just provide those engines. If the LoB needs a complex solution requiring components like Kafka + HBase + Hive + Spark + Impala, then central IT can easily build it with the Data Hub service in CDP.
COST Since the stock solution that Central IT offers cannot be amortized across multiple LoBs, it leads to higher costs for LoBs. When LoBs seek out shadow IT cloud-based solutions, Central IT feels powerless to reign them in. With CDP, it’s easy to amortize costs for shared data, schema, security, and governance with the underlying SDX service. You can eliminate the cost of copying data and syncing it across multiple LoB silos. And you can eliminate the cost of managing security and governance across those silos, too, even if they potentially have very different security and governance frameworks. For more details on SDX, see the link below.

Also, you no longer waste CPU on things you don’t need for an LoB’s use case. Only start with the engine and services needed for a particular use case. The templates in CDP make things easier and faster. You can use the templates to kickstart LoB solutions, helping you get started in minutes rather than weeks.

AGILITY To meet the fast-changing demands of the most successful LoBs and their use cases, Central IT tries to offer their stock solutions. Occasionally, the stock solutions do meet an LoB’s requirements, but then the requirements shift and Central IT and the LoB are at odds again. The darling LoB quickly becomes Central IT’s nemesis. A 100-node Cloudera Data Warehouse is capable of sustaining 80 queries per second and you can provision it in 2.33 minutes. If the load requires more resources, auto-scaling kicks in 20 seconds.

Summary

In conclusion, CDP empowers Central IT to meet the needs for flexibility, cost control, and agility. Our customers have told us that with their traditional solutions, a 10TB system instance could take weeks to provision. With CDP, one such instance can be provisioned in a matter of hours. Once up and running with CDP, a data mart instance of CDW can be up in seconds to minutes. Furthermore, our customers also tell us that with their traditional solutions, scaling up a 10TB instance by a factor of two (to 20TB) also takes weeks. With CDW, you can scale up the same 10TB instance to 20TB, automatically, in less than 20 seconds. Traditional solutions offer no ability to scale down. With CDP’s data mart service, CDW, Auto-Suspend, and Resume occurs automatically, and auto-shrink is built into the product. 

The shared data experience supports multiple types of workloads, and avoids siloed point solutions. CDW and CDP are cloud-native with options to run in public or private cloud and to safely migrate between.  Pay-as-you-go, auto-scale/shrink, and auto-suspend functionality enable you to control costs. You predict your needs and optimize “on-the-fly” so you can further control costs, no matter what the environment.

Cloudera sponsors open source, ensuring that you are not locked into proprietary file formats or that you might lose the control of the auditability of your own data. Open source excels in innovation and offers the option to contribute yourself in case you want a specific feature. Choose Cloudera’s hybrid data platform for data engineering, data warehousing, and machine learning to successfully grow your business and help your Central IT operation succeed.

Related Links:

Bill Zhang
Senior Director Product Management, Data Warehousing
More by this author

1 Comments

by james brown on

really nice site

Leave a comment

Your email address will not be published. Links are not permitted in comments.