by George Corugedo

How to overcome the bottlenecks between data lakes and analytics for customer engagement

Opinion
May 24, 2018
AnalyticsBusiness Intelligence

Create the complete view of the customer, and engage in real time through the best channels.

traffic jam
Credit: Thinkstock

Competing in the business world today is all about knowing your customers. Consumers have come to expect personalized service and a satisfying experience, and anything less from the brands they interact with might cause them to take their business elsewhere.

Many organizations in a variety of industries struggle to access the customer data they need to provide personalized and contextual experiences across all touchpoints. Recently, data lakes have been touted as the best way to manage the variety of collected customer data, with many big data and analytics solutions focused on a self-service approach to leveraging the value of the data lake.

It is not enough to dump all customer data into a data lake; especially not when analytics engines are only as good as the quality of data they receive. According to a recent report by research firm Forrester Inc., only 25 percent of business and technology decision-makers report seeing increased revenue from their implementation of big data solutions. That means a large majority of companies are not effectively harnessing the insights from their customer data to better serve and retain their customers, Forrester says.

What’s inhibiting the use of big data in customer engagement? Two bottlenecks stand between the data lake and using analytics for customer engagement. One is creating the golden record, the accurate and complete view of the customer, and the other is overcoming latency within the process – at the data, analytical and execution levels – to engage with the customer in real time through the correct channel or touchpoint.

Data lakes and the golden record

Let’s look at the challenge of creating the golden record first. The concept of the golden record goes beyond what has been called the 360-degree customer view, or the single view of the customer. It’s a much richer, comprehensive collection of data from across the enterprise that involves collecting all that is knowable about the customer into a central point of data control that is accessible throughout the organization.

A well-constructed golden record is vital because the quality of data-driven customer engagement is highly dependent on the quality (accuracy, completeness, relevance and timeliness) of the data itself. Incorrect data leads to irrelevant offers, redundant approaches, and out-of-sequence or out-of-date offers. The data lake is a great repository, but the very reasons organizations choose data lakes (broad variety of raw data, large volumes of data, ease of data capture, etc.) magnify the quality problems they face.

The solution is to apply a rigorous and well thought out data matching strategy that utilizes heuristic, probabilistic and machine learning approaches to mastering the data and creating a persistent key structure and then interject automated processes to manage the data and produce appropriate “projections” for customer engagement. Adding automated processes streamlines the creation of the golden record, making it easier to maintain over time. This is what a customer data platform (CDP) should do, although not many of them are up to the task. True data quality requires things such as cross-source data extraction, name and address normalization, tuned deterministic and probabilistic matching, and as-needed human workflows for resolution, auditing, and compliance.  Its not for the fain of heart and most CDP providers will try to gloss over this level of complexity. Trust me, however, skip these steps and the resulting analytics will produce very inaccurate results.

For big data implementations, the CDP must natively handle all the source formats commonly found in data lakes, including multiple NoSQL and document formats such as MongoDB, Avro, Parquet and others across multiple big data environments. The CDP should be able to take advantage of big data distributed compute resources which are often ignored as the primary value of no-SQL databases.

With a well-tuned customer data platform in a big data environment, an organization can handle the variety, velocity, and volume of data lake information to produce accurate customer profiles for analytics and engagement.

Facing down latency in the data lake

The other key challenge is overcoming latency, which shows up in a data lake/analytics process in three fundamental ways. One is process initiation latency. The data processing is intentionally decoupled from data arriving into the database. This is part of data lake design but can also lead to delays in updating customer information.

When “data warehouse thinking” is brought to the data lake, information might be treated as a collection of semi-static data, and updates will be performed at scheduled times, such as overnight, to produce analytics-ready information in the morning. That’s fine for time-independent processes such as producing reports. It’s not suitable for processes that are supposed to match the cadence of the customer.

For these engagement-focused real time processes, data needs to be more regularly updated. This requires timing modifications (for example, data-triggered processes) and architectural or model changes to efficiently handle update-driven changes to the data rather than wholesale builds of profiles.

The second type of latency involves the CDP. The performance of the customer data platform or other software might add a direct time latency to the process. If the CDP can only handle a few thousand updates per hour, it’s not going to be pumping profile changes out fast enough to match customer cadence.

Many CDP vendors quote high performance figures for producing customer records, but don’t include complex matching or updates in the process. By not including matching, they only measure simple production of new records based on existing customer keys.

For medium-to-large businesses, performance generally needs to be in the millions of records per hour, with an actual input-to-output time for the full process in the two-to-five seconds range. Performance needs to meet this tight criterion because the upstream feed processes and downstream action processes themselves inject additional latency.

The third type is analysis/orchestration latency. Once the customer profile is ready, there might also be downstream latency introduced architecturally, due to mediocre performance in the analytics or orchestration software, or due to limited human resources.

Typically, this is a problem of matching use case requirements to measured latency. For many analytics tasks, the downstream latency might be made worse by a dearth of capable analysts. For this type of latency, a combination of better data quality automation; better identification of timing/performance requirements; and careful design, measurement and tuning of software and processes will ensure the downstream process latency is small enough to meet project requirements.

Data lakes are designed for the world of varied data structure and cadence the modern brand finds itself in. Deploying one is a good first step, but brands need to also understand that adding a data lake is only a step toward engaging the always-connected customer. They need to also eliminate the latency barrier and streamline the process of creating the golden customer record. By accomplishing those two goals, brands can truly connect with the modern consumer and deliver the personalized experiences customers so desire.