Big Data, Data Integration, Data Lake and Machine Learning

Big Data

Data Integration

Data Lake

Machine Learning

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Five steps to jumpstart your data integration journey

IBM Big Data Hub

JUNE 26, 2020

Organizations need to collect, organize, and analyze their data across multi-cloud, hybrid cloud, and data lakes. In turn, enterprises are increasingly looking for machine-learning-powered integration tools to synchronize data for analytics, improve employee productivity, and prepare data for analytics.

Data Integration

Data Integration Data Lake Machine Learning Enterprise

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Product Manager’s Guide to Optimizing DX for Systemic Impact

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Trending Sources

Talend Data Fabric Simplifies Data Life Cycle Management

David Menninger's Analyst Perspectives

NOVEMBER 16, 2021

Talend is a data integration and management software company that offers applications for cloud computing, big data integration, application integration, data quality and master data management.

Management

Management Data Warehouse Data Quality Data Integration

Webinars

The Product Manager’s Guide to Optimizing DX for Systemic Impact

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

AWS Big Data

OCTOBER 20, 2023

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3). option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

Data Lake

Data Lake Big Data Consulting Data Warehouse

Unlocking the Potential of Machine Learning in a Data Lake

Data Virtualization

MARCH 27, 2019

With data becoming the brain food to the intelligence of every organization, regardless of size or sector, it has become crucial to harness this data to achieve the best results, make the most informed decisions and improve productivity. However, with.

Data Lake

Data Lake Machine Learning IT Data Integration

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime. options(**read_config).option("query",

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Automate schema evolution at scale with Apache Hudi in AWS Glue

AWS Big Data

FEBRUARY 7, 2023

This post focuses on such schema changes in file-based tables and shows how to automatically replicate the schema evolution of structured data from table formats in databases to the tables stored as files in cost-effective way. Apache Hudi supports ACID transactions and CRUD operations on a data lake. and save it.

Data Lake

Data Lake Testing Big Data Structured Data

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. Brian Ross is a Senior Software Development Manager at AWS.

Data Quality

Data Quality Statistics Data Lake Visualization

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. AWS Glue provides both visual and code-based interfaces to make data integration effortless.

Analytics

Analytics IT Data Lake Visualization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. As part of the transformation, the objects need to be treated to ensure data privacy (for example, PII redaction).

Data Governance

Data Governance Unstructured Data Metadata Data Lake

With a zero-ETL approach, AWS is helping builders realize near-real-time analytics

AWS Big Data

JUNE 28, 2023

In case the data sources change, data engineers have to manually make changes in their code and deploy it again. Furthermore, the time required to build or change pipelines makes the data unfit for near-real-time use cases such as detecting fraudulent transactions, placing online ads, and tracking passenger train schedules.

Analytics

Analytics Data Warehouse Data Lake Data-driven

Automate the archive and purge data process for Amazon RDS for PostgreSQL using pg_partman, Amazon S3, and AWS Glue

AWS Big Data

AUGUST 22, 2023

This post proposes an automated solution by using AWS Glue for automating the PostgreSQL data archiving and restoration process, thereby streamlining the entire procedure. Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a big data enthusiast and holds 14 AWS Certifications.

Data Processing

Data Processing Testing Data Lake Data Integration

Migrate data from Google Cloud Storage to Amazon S3 using AWS Glue

AWS Big Data

JULY 19, 2023

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning, and application development. Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team.

Big Data

Big Data Software Consulting Unstructured Data

Extract data from SAP ERP using AWS Glue and the SAP SDK

AWS Big Data

FEBRUARY 8, 2023

Vyaire developed a custom data integration platform, iDataHub, powered by AWS services such as AWS Glue , AWS Lambda , and Amazon API Gateway. In this post, we share how we extracted data from SAP ERP using AWS Glue and the SAP SDK. Prahalathan M is the Data Integration Architect at Vyaire Medical Inc.

Testing

Testing Data Integration Data Lake Enterprise

Breaking down Business Intelligence

BizAcuity

MAY 16, 2022

To stay relevant in the market and to increase brand awareness, organizations use big data analytics and business intelligence to navigate their way after getting a full understanding of their ideal customers and their behavior before and during the buying journey. So, make sure you have a data strategy in place. Data mining.

Business Intelligence

Business Intelligence Data mining Visualization Data Lake

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. option("header",True).schema(schema).load("s3://"+

Insurance

Insurance Data Lake Data-driven Management

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. After all, Alex may not be aware of all the data available to her.

Metadata

Metadata Data Quality Data-driven Data Governance

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

AWS Big Data

JUNE 26, 2023

Companies are faced with the daunting task of ingesting all this data, cleansing it, and using it to provide outstanding customer experience. Typically, companies ingest data from multiple sources into their data lake to derive valuable insights from the data. This will open the ML transforms page.

Insurance

Insurance Visualization Data Lake Metrics

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

It includes perspectives about current issues, themes, vendors, and products for data governance. My interest in data governance (DG) began with the recent industry surveys by O’Reilly Media about enterprise adoption of “ABC” (AI, Big Data, Cloud). Those days are long gone if they ever existed. the flywheel effect.

Data Governance

Data Governance Machine Learning Metadata Big Data

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

A Smart Approach to Logical Data Warehousing, with Azure Synapse and the Denodo Platform

Data Virtualization

SEPTEMBER 16, 2020

Organizations are leveraging cloud analytics to extract useful insights from big data, which draws from a variety of sources such as mobile phones, Internet of. Organizations all over the world are migrating their IT infrastructures and applications to the cloud.

Big Data

Big Data Internet of Things Analytics IT

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Under Administration , choose Data catalog settings.

Data Lake

Data Lake Snapshot Metadata Optimization

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Let’s go through the ten Azure data pipeline tools Azure Data Factory : This cloud-based data integration service allows you to create data-driven workflows for orchestrating and automating data movement and transformation. You can use it for big data analytics and machine learning workloads.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

Prepare and load Amazon S3 data into Teradata using AWS Glue through its native connector for Teradata Vantage

AWS Big Data

NOVEMBER 30, 2023

In this post, we explore how to use the AWS Glue native connector for Teradata Vantage to streamline data integrations and unlock the full potential of your data. Businesses often rely on Amazon Simple Storage Service (Amazon S3) for storing large amounts of data from various data sources in a cost-effective and secure manner.

IT Visualization Machine Learning Data Integration

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Metadata

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

AWS Big Data

DECEMBER 15, 2023

Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive data at scale. In this post, we provide an automated solution to detect PII data in Amazon Redshift using AWS Glue. For our solution, we use Amazon Redshift to store the data.

Data Lake

Data Lake Data Warehouse Big Data Structured Data

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience.

Data Warehouse

Data Warehouse Data Lake Analytics Data Science

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

AWS Big Data

MARCH 27, 2024

AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog.

Data Analytics

Data Analytics Analytics Data Warehouse Data Lake

Scale your AWS Glue for Apache Spark jobs with new larger worker types G.4X and G.8X

AWS Big Data

MAY 9, 2023

Hundreds of thousands of customers use AWS Glue , a serverless data integration service, to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue for Apache Spark jobs work with your code and configuration of the number of data processing units (DPU).

Data Lake

Data Lake Cost-Benefit Data Integration Data Transformation

Compose your ETL jobs for MongoDB Atlas with AWS Glue

AWS Big Data

MAY 3, 2023

In today’s data-driven business environment, organizations face the challenge of efficiently preparing and transforming large amounts of data for analytics and data science purposes. Businesses need to build data warehouses and data lakes based on operational data.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Unlock scalable analytics with AWS Glue and Google BigQuery

AWS Big Data

OCTOBER 27, 2023

Data integration is the foundation of robust data analytics. It encompasses the discovery, preparation, and composition of data from diverse sources. In the modern data landscape, accessing, integrating, and transforming data from diverse sources is a vital process for data-driven decision-making.

Analytics

Analytics Visualization Data Integration Cost-Benefit

It’s not your data. It’s how you use it. Unlock the power of data & build foundations of a data driven organisation

CIO Business Intelligence

MAY 24, 2022

Data analytics, applied effectively, can provide extremely valuable guidance to identify trends and inform business decision making, but the data has to be accessible to these data analytics tools if they are to deliver actionable insights. analyse the data, using business intelligence, visualisation or data science tools.

Data-driven

Data-driven Data Lake Data Warehouse Cost-Benefit

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. For this, Cargotec built an Amazon Simple Storage Service (Amazon S3) data lake and cataloged the data assets in AWS Glue Data Catalog.

Metadata

Metadata Data Lake Machine Learning Big Data

Using Synapse Services with Dynamics? These Tools Make it Easier

Jet Global

MAY 27, 2022

Synapse services are powerful tools for bringing data together for analytics, machine learning, reporting needs, and more. How Synapse works with Data Lakes and Warehouses. Synapse services, data lakes, and data warehouses are often discussed together. Streamline Data with Atlas.

Data Lake

Data Lake IT Recreation/Entertainment Data Warehouse

Unlocking the value of data as your differentiator

AWS Big Data

NOVEMBER 29, 2023

You also need services to store data for analysis and machine learning (ML) like Amazon Simple Storage Service (Amazon S3). Customers have created hundreds of thousands of data lakes on Amazon S3. We also announced that Amazon Q can help you create data integration pipelines using natural language.

Data Warehouse

Data Warehouse Data Lake Data Integration Dashboards

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

For data lake customers who need to discover petabytes of data, AWS Glue crawlers are a popular way to discover and catalog data in the background. This allows users to search and find relevant data from multiple data sources. Sandeep Adwankar is a Senior Technical Product Manager at AWS.

Metadata

Metadata Data Lake Machine Learning Management

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data Lake Data-driven Metrics

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

AWS Big Data

NOVEMBER 20, 2023

In AWS, hundreds of thousands of customers use AWS Glue , a serverless data integration service, to discover, combine, and prepare data for analytics and machine learning. The sample job reads data from MySQL database and writes to S3 in Parquet format. He works based in Tokyo, Japan.

Visualization

Visualization Optimization Data Lake Management

What is Data Pipeline? A Detailed Explanation

Smart Data Collective

OCTOBER 17, 2022

Big data is shaping our world in countless ways. Data powers everything we do. Exactly why, the systems have to ensure adequate, accurate and most importantly, consistent data flow between different systems. A point of data entry in a given pipeline. Data Pipeline: Use Cases. Destination.

Data Warehouse

Data Warehouse Data Lake Visualization Big Data

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

The data fabric architectural approach can simplify data access in an organization and facilitate self-service data consumption at scale. Read: The first capability of a data fabric is a semantic knowledge data catalog, but what are the other 5 core capabilities of a data fabric? What’s a data mesh?

Management

Management Metadata Data Architecture Data Lake

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Data ingestion You have to build ingestion pipelines based on factors like types of data sources (on-premises data stores, files, SaaS applications, third-party data), and flow of data (unbounded streams or batch data). Data exploration Data exploration helps unearth inconsistencies, outliers, or errors.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Five steps to jumpstart your data integration journey

Webinars

Trending Sources

Talend Data Fabric Simplifies Data Life Cycle Management

Webinars

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

Unlocking the Potential of Machine Learning in a Data Lake

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Automate schema evolution at scale with Apache Hudi in AWS Glue

AWS Glue Data Quality is Generally Available

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

Data governance in the age of generative AI

With a zero-ETL approach, AWS is helping builders realize near-real-time analytics

Automate the archive and purge data process for Amazon RDS for PostgreSQL using pg_partman, Amazon S3, and AWS Glue

Migrate data from Google Cloud Storage to Amazon S3 using AWS Glue

Extract data from SAP ERP using AWS Glue and the SAP SDK

Breaking down Business Intelligence

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

Five benefits of a data catalog

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Themes and Conferences per Pacoid, Episode 8

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

A Smart Approach to Logical Data Warehousing, with Azure Synapse and the Denodo Platform

Introducing Apache Hudi support with AWS Glue crawlers

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

Prepare and load Amazon S3 data into Teradata using AWS Glue through its native connector for Teradata Vantage

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

Scale your AWS Glue for Apache Spark jobs with new larger worker types G.4X and G.8X

Compose your ETL jobs for MongoDB Atlas with AWS Glue

Unlock scalable analytics with AWS Glue and Google BigQuery

It’s not your data. It’s how you use it. Unlock the power of data & build foundations of a data driven organisation

How Cargotec uses metadata replication to enable cross-account data sharing

Using Synapse Services with Dynamics? These Tools Make it Easier

Unlocking the value of data as your differentiator

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

What is Data Pipeline? A Detailed Explanation

Augmented data management: Data fabric versus data mesh

Create an end-to-end data strategy for Customer 360 on AWS

Stay Connected