Navigating the Data Provider Jungle

Data Basics Hugo Schmitt

We speak a lot about the ways we can use data, transform it, and create powerful models based on advanced machine learning techniques, but we sometimes forget where the data comes from initially. In an organization, data is sourced either internally, like the information flowing out of a CRM database, or externally, when we think about any bytes of data that were provided by a third party. 

There’s even a definition for it: “External data refers to any type of data that has been captured, processed, and provided from outside the company” (Krasikov, Eurich, and Legner, 2020). If they provide free data (or at least data that seems free), “open” data, or paid data, these third parties constitute the realm of data providers. Who are they? What exactly differentiates them? Who are paid data providers, less known from the general public who are more inclined to use freely available data? The aim of this post is to give you an answer to these common questions.

jungle

Data Providers: Supplier or Vendor?

When data from internal sources is not sufficient enough to go through an analytics exercise, it is tempting to look over the fence and seek anything else that could help. Today, this external data is commonly found on the internet. Like the limitless nature of the internet, sources of external data will vary immensely. Let’s first try to define what a data provider is.

According to the OECD glossary, a data provider is “an organization which produces data or metadata.” It’s a good start, but to give it more depth, we could add that a data provider is an organization whose goal is to collect or create data and provide it in a more or less prepared, compiled, or transformed format. We commonly hear the terms “data vendors” or “data suppliers” to refer to data providers.

Among the family of data providers, we should distinguish the nature of their services: paid or unpaid. When the data is provided for free, it means that there are high chances that the provided data is open data. Providers of open data are broadly used and renowned, but let’s make a quick stop on them.

Setting the Stage: Open Data Providers

When we refer to data providers, we usually have in mind organizations that sell data as a business — although many organizations, usually NGOs or government entities, provide data for free. For example, the WHO (World Health Organization) is a renowned bank of open data. It means that sourcing data from there is free and the reuse of this unlicensed data won’t be a problem. 

Free data doesn’t mean it’s “free” to collect and use. There is a big difference between public data like the data you will find on any website and open data. For instance, the online news article from your local newspaper might be public, but don’t take it for granted that you can freely reuse and transform this data. The whole debate on the legality of web scraping is a good demonstration of this: It is legal to scrape the web, but beware of the way you use this data, because copyrights might apply. This article from Tom Waterman gives more details if you’re interested in the topic.

To close the loop on open data providers, something else to know is that open data is very often used by data providers that sell their services for profit. Meaning that in gathering and transforming data from diverse sources, including open data sources, a lot of data providers end up selling data… that they obtained for free! 

Where Do Data Providers Source Their Data?

As we just mentioned, data providers can take advantage of open data to build the collections that they end up selling. That’s usually one slice of their sources of data. But data providers usually have data that people are ready to pay for, meaning data that you won’t be able to find easily online. Depending on the type of data provided, it can be a mix of proprietary sources. For example, with data bought from companies themselves, data providers specialized in retail market trends could buy the data directly from retailers when they are willing to sell it. Another option for data providers is to produce the data themselves by carrying out surveys and tracking certain online activities; this happens a lot when carrying out customer satisfaction surveys, for instance. 

The source where data providers go to feed their data collection is usually something they are very quiet about — understandably, since their business model is based on the fact that people cannot find the information elsewhere for free. For them, protecting the source is like protecting a secret recipe. 

This secrecy poses questions about the legitimacy of the data they sell: If I don’t know the exact source, why should I believe in this data? And worse: Using data from unverified sources could expose my business to reputational risks and ethical scandals. However, there are a number of things that can be done to avoid falling in that trap. Selecting an accredited data provider could be one way, as the organization Data HQ advises.

Types of Data Providers

In the sphere of data providers, many categories could be imagined to try to come up with a clear classification. However, all actors sell data in very specific ways. To give a general overview and try to be as clear as possible, let us take the example of the financial industry to illustrate the different types. Data is an essential resource in this industry, as it is often used to make decisions that involve significant amounts of money.

Pure Data Providers

This first type of data provider represents the simplest way a data provider can operate: they sell the data they produce. Most of them are actually data providers as a side business. Their own activity generates data to operate and instead of letting this data sit on their hard drives, they take profit in dealing it with transversal market actors.

In the financial sector, stock exchanges themselves produce and record millions of financial transactions everyday. Many different companies are keen on getting this information, either to drive their own operations or resell their conclusions on the financial situation. For instance, at any moment, NASDAQ data, an American stock exchange specialized in technology companies, is available for sale right here.

Aggregators

This other type of data provider, much more renowned and standard when we refer to data providers, is the aggregators or consolidators. They don’t produce the data themselves, they consolidate different sources to obtain new information. They can knock on the door of a pure data provider to negotiate the access to their data (sometimes exclusive access) and/or do their shopping on an open data provider and other external data sources to enrich their databases. These actors mainly add value by making the relevant data consolidations, preparation, transformations, calculations, analytics, modeling, etc. to provide their customer with digested and refined sources of data. 

Sometimes they sell the data after minimal formatting operations and quality checks and sometimes they operate extensive transformation and information processing. This depends on the step at which they decided to stop in the chain of value of data transformation.

Morningstar is a good example of this type of data provider. The company started in the mid-1980s with the idea of making it easier for fund managers to get their hands on many different sources of financial information in the same place. Nowadays they provide a very wide range of data offerings, including basic market data to extensive analytical reports on certain industries.

We could go further on the chain of value of data transformation to distinguish a third type of data provider but companies offering reports, market analysis, and other very sophisticated outputs offer something that goes beyond data. It becomes inaccurate to refer to these actors as “data providers” as the core of their offering is based on insights and information, not data.

Emerging Type: Alternative Data Providers

→ Download Alternative Data in Financial Markets

Now that we can better distinguish between different regular types of data providers, we can have a look at a new player that more and more companies get to work with: alternative data providers. 

Alternative data is a source of data we use when regular sources of information are not sufficient to establish truth on a subject. Mainly used in the field of financial services, this type of data can be extremely varied and find its sources in very new ways of capturing data, including unstructured data like images. Certain companies specialize in alternative data to offer a solution to companies that could not rely on “mainstream” data alone.

As an example, the company Kayrros (a partner of Dataiku) captures and transforms raw data into relevant analytics for companies interested in alternative data: financial services companies as well as the oil and gas industry or public entities. Their platform is capable of leveraging data from many different sources: satellite images, anonymized geolocalization data, social data, etc. For instance, an asset manager could be able to cross-check information between regular sources and Kayrros’s insights to evaluate the environmental risks inherent in a certain asset. 

While the road to open data is clearly paved with many available resources compiled on the internet, the jungle of paid data providers stays relatively obscure. Distinguishing the differences between a pure data provider and aggregators as well as understanding the way they usually source their data can help you be more aware of your choices in case you need to shop external data. 

You May Also Like

Fine-Tuning a Model (In Plain English!)

Read More

How to Reach the Apex of Data Preparation

Read More

How to Address Churn With Predictive Analytics

Read More

What Is MLOps?

Read More