article thumbnail

Of Muffins and Machine Learning Models

Cloudera

In the case of CDP Public Cloud, this includes virtual networking constructs and the data lake as provided by a combination of a Cloudera Shared Data Experience (SDX) and the underlying cloud storage. Each project consists of a declarative series of steps or operations that define the data science workflow.

article thumbnail

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. You can use either the AWS Glue Data Catalog (recommended) or a Hive catalog for Iceberg tables.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

article thumbnail

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

The utility for cloning and experimentation is available in the open-sourced GitHub repository. This solution only replicates metadata in the Data Catalog, not the actual underlying data. This ensures that the data lake will still be functional in another Region if Lake Formation has an availability issue.

article thumbnail

Unlock data across organizational boundaries using Amazon DataZone – now generally available 

AWS Big Data

An Amazon DataZone domain contains an associated business data catalog for search and discovery, a set of metadata definitions to decorate the data assets that are used for discovery purposes, and data projects with integrated analytics and ML tools for users and groups to consume and publish data assets.

Metadata 100
article thumbnail

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

The company’s multicloud infrastructure has since expanded to include Microsoft Azure for business applications and Google Cloud Platform to provide its scientists with a greater array of options for experimentation. Much of Regeneron’s data, of course, is confidential. That’s hard to do when you have 30 years of data.”

Data Lake 110
article thumbnail

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.