Data Lake, Data Transformation and Information

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

For example, the Flink FileSystem connector has FileSystemTableFactory to read/write data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3), the Flink HBase connector has HBase2DynamicTableFactory to read/write data in HBase, and the Flink Kafka connector has KafkaDynamicTableFactory to read/write data in Kafka.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Webinars

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

What is a Data Pipeline?

Jet Global

MAY 9, 2024

The key components of a data pipeline are typically: Data Sources : The origin of the data, such as a relational database , data warehouse, data lake , file, API, or other data store. This can include tasks such as data ingestion, cleansing, filtering, aggregation, or standardization.

Data Lake

Data Lake Data Warehouse Business Intelligence Machine Learning

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

The Perilous State of Today’s Data Environments Data teams often navigate a labyrinth of chaos within their databases. Extrinsic Control Deficit: Many of these changes stem from tools and processes beyond the immediate control of the data team.

Data Quality

Data Quality Testing Data Lake Data Integration

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

How the BMW Group analyses semiconductor demand with AWS Glue

AWS Big Data

APRIL 26, 2023

The manufacturers need to know BMW Group’s exact current and future semiconductor volume information, which will effectively help steer the available worldwide supply. To enable this use case, we used the BMW Group’s cloud-native data platform called the Cloud Data Hub.

Forecasting

Forecasting Manufacturing Data Lake Big Data

Straumann Group is transforming dentistry with data, AI

CIO Business Intelligence

FEBRUARY 16, 2023

“Digitizing was our first stake at the table in our data journey,” he says. That step, primarily undertaken by developers and data architects, established data governance and data integration. That step, primarily undertaken by developers and data architects, established data governance and data integration.

Unstructured Data

Unstructured Data Data Lake Prescriptive Analytics Digital Transformation

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Here are a few examples that we have seen of how this can be done: Batch ETL with Azure Data Factory and Azure Databricks: In this pattern, Azure Data Factory is used to orchestrate and schedule batch ETL processes. Azure Blob Storage serves as the data lake to store raw data. Azure Machine Learning). So go ahead.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience.

Data Warehouse

Data Warehouse Data Lake Analytics Data Science

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

Ontotext

FEBRUARY 12, 2024

Organizational culture is a key factor in determining the viability and success of a data mesh initiative. Empowering individual domains and integrating them as a cohesive whole requires a reevaluation of various aspects of the business, including how data and information are managed and shared.

Data-driven

Data-driven Data Lake Data Quality Business Objectives

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Connecting the Data Lifecycle

Cloudera

NOVEMBER 29, 2021

Data transforms businesses. That’s where the data lifecycle comes into play. Managing data and its flow, from the edge to the cloud, is one of the most important tasks in the process of gaining data intelligence. . The firm also worked on creating a solid pipeline from the data warehouse to the data lake.

Data Lake

Data Lake Data Warehouse Data Architecture Reporting

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak Nabu reliably curates datasets for any line of business and personas, from business analysts to data scientists. Customers using Modak Nabu with CDP today have deployed Data Lakes and. An executive dashboard shows the status of crawlers, pipelines, and data profiling. Modak Nabu TM and CDE’s Spark-on-Kubernetes.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Self-Serve Data Prep CAN Be Easy AND Sophisticated!

Smarten

AUGUST 4, 2023

If your team has easy-to-use tools and features, you are much more likely to experience the user adoption you want and to improve data literacy and data democratization across the organization. Machine learning capability determines the best techniques, and the best fit transformations for data so that the outcome is clear and concise.

Data Lake

Data Lake Machine Learning Data Integration Optimization

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

AWS Big Data

NOVEMBER 9, 2023

In this post, we delve into a case study for a retail use case, exploring how the Data Build Tool (dbt) was used effectively within an AWS environment to build a high-performing, efficient, and modern data platform. It does this by helping teams handle the T in ETL (extract, transform, and load) processes.

Data Warehouse

Data Warehouse Testing Data Quality Reporting

Reference guide to build inventory management and forecasting solutions on AWS

AWS Big Data

APRIL 11, 2023

By collecting data from store sensors using AWS IoT Core , ingesting it using AWS Lambda to Amazon Aurora Serverless , and transforming it using AWS Glue from a database to an Amazon Simple Storage Service (Amazon S3) data lake, retailers can gain deep insights into their inventory and customer behavior.

Forecasting

Forecasting Management IoT Data-driven

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. For more information, see Setting up networking for development for AWS Glue. This enables organizations to streamline data integration and analytics with OpenSearch Service.

Analytics

Analytics IT Data Lake Visualization

Connect your data for faster decisions with AWS

AWS Big Data

NOVEMBER 7, 2023

Second, organizations still need transformations like cleansing, deduplication, and combining datasets for analysis and machine learning (ML). For these, AWS Glue provides fast, scalable data transformation. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio.

Dashboards

Dashboards Data-driven Data Integration Data Lake

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Finally, we show you the technical information to use the tool. Use case overview Migrating Hadoop workloads to Amazon EMR accelerates big data analytics modernization, increases productivity, and reduces operational cost. Refactoring coupled compute and storage to a decoupling architecture is a modern data solution.

Cost-Benefit

Cost-Benefit Data Lake Dashboards Big Data

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Choose Create stack.

Data Lake

Data Lake Dashboards Metrics Metadata

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

This field guide to data mapping will explore how data mapping connects volumes of data for enhanced decision-making. Why Data Mapping is Important Data mapping is a critical element of any data management initiative, such as data integration, data migration, data transformation, data warehousing, or automation.

Data Warehouse

Data Warehouse Reporting Data Transformation Sales

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

CDP Data Hub: a VM/Instance-based service that allows IT and developers to build custom business applications for a diverse set of use cases with secure, self-service access to enterprise data. . Predict – Data Engineering (Apache Spark). This is Now. New Services.

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Then it retrieves the job history information (YARN logs from application managers) by calling the YARN ResourceManager application API. For more information on how to use the YARN log organizer, refer to the yarn-log-organizer GitHub repo. This information helps you understand who submits an application to a queue.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Every event in the data source can be relevant, and our customers don’t tolerate data loss, poor data quality, or discrepancies between the source and Tricentis Analytics. While aggregating, summarizing, and aligning to a common information model, all transformations must not affect the integrity of data from its source.

Software

Software Data Lake Testing Cost-Benefit

Turning the page

Cloudera

JUNE 1, 2021

After all, we invented the whole idea of Big Data. Well, at Cloudera, we envision a world where everyone can quickly and easily access the data-powered information and insights they need – in just a few clicks. . The mission is to “Make data and analytics easy and accessible, for everyone.” 650-644-3900.

Uncertainty

Uncertainty Cost-Benefit Risk Strategy

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

By streamlining data-related workflows and enabling real-time collaboration, DataOps can help organizations to quickly turn data into insights, and to put those insights into action. This can include analyzing data to identify patterns and trends, and using predictive analytics to forecast future outcomes and guide decision making.

Machine Learning

Machine Learning Data-driven Optimization Modeling

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

Most of today’s largest foundation models, including the large language model (LLM) powering ChatGPT, have been trained on information culled from the internet. But how trustworthy is that training data? It helps you streamline data engineering with reduced data pipelines, simplified data transformation and enriched data.

Risk

Risk Modeling Management Metadata

Texas Rangers data transformation modernizes stadium operations

CIO Business Intelligence

OCTOBER 18, 2022

Resultant recommended a new, on-prem data infrastructure, complete with data lakes to provide stake holders with a better way to manage data reliability, accuracy, and timeliness. I think you have to toot your own horn that, yes, we have this information available.”. They want that information,” she says.

Data Transformation

Data Transformation Consulting Data Lake Reporting

Turnkey Cloud DataOps: Solution from Alation and Accenture

Alation

MARCH 22, 2022

The table details are extracted from the IDF pipeline information, which then syncs details like column, table, business, and technical metadata. This produces end-to-end lineage so business and technology users alike can understand the state of a data lake and/or lake house. Transparency is key.

Metadata

Metadata Cost-Benefit Data Quality Data Lake

Unlock scalable analytics with AWS Glue and Google BigQuery

AWS Big Data

OCTOBER 27, 2023

Efficiency : Data transformation tasks that previously took weeks or months can now be accomplished within minutes, optimizing efficiency. In this tutorial, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.

Analytics

Analytics Visualization Data Integration Cost-Benefit

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

erwin

JANUARY 11, 2019

Organizations have spent a lot of time and money trying to harmonize data across diverse platforms , including cleansing, uploading metadata, converting code, defining business glossaries, tracking data transformations and so on. Creating a High-Quality Data Pipeline.

Data Governance

Data Governance Risk Metadata Management

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

AWS Big Data

APRIL 5, 2023

The platform also provides analytics and insights to support successful information sharing and fuel continuous improvement. In 2021, Showpad decided to take the next step in its data evolution and set forth the vision to power innovation, product decisions, and customer engagement using data-driven insights.

Dashboards

Dashboards Reporting Cost-Benefit Visualization

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

This approach doesn’t solve for data quality issues in source systems, and doesn’t remove the need to have a wholistic data quality strategy. For addressing data quality challenges in Amazon Simple Storage Service (Amazon S3) data lakes and data pipelines, AWS has announced AWS Glue Data Quality (preview).

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Migrate from Amazon Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics Studio

AWS Big Data

JUNE 29, 2023

In our solution, we create a notebook to access automotive sensor data, enrich the data, and send the enriched output from the Kinesis Data Analytics Studio notebook to an Amazon Kinesis Data Firehose delivery stream for delivery to an Amazon Simple Storage Service (Amazon S3) data lake.

Data Analytics

Data Analytics Analytics IoT Data Lake

Building Better Data Models to Unlock Next-Level Intelligence

Sisense

MAY 11, 2021

The reasons for this are simple: Before you can start analyzing data, huge datasets like data lakes must be modeled or transformed to be usable. According to a recent survey conducted by IDC , 43% of respondents were drawing intelligence from 10 to 30 data sources in 2020, with a jump to 64% in 2021!

Modeling

Modeling Big Data IoT Data Warehouse

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

In another decade, the internet and mobile started the generate data of unforeseen volume, variety and velocity. It required a different data platform solution. Hence, Data Lake emerged, which handles unstructured and structured data with huge volume. Data lakehouse was created to solve these problems.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

ML use cases rarely dictate the master data management solution, so the ML stack needs to integrate with existing data warehouses. You want to carefully manage consistency of data between training and predictions, as well as make sure that there’s no leakage of information when models are being trained and tested with historical data.

IT

IT Testing Experimentation Software

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

These encoder-only architecture models are fast and effective for many enterprise NLP tasks, such as classifying customer feedback and extracting information from large documents. While they require task-specific labeled data for fine tuning, they also offer clients the best cost performance trade-off for non-generative use cases.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation

AUGUST 11, 2022

Today, the brightest minds in our industry are targeting the massive proliferation of data volumes and the accompanying but hard-to-find value locked within all that data. But there are only so many data engineers available in the market today; there’s a big skills shortage.

Metadata

Metadata Data Warehouse Data Quality Data Lake

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Webinars

Trending Sources

Build a data lake with Apache Flink on Amazon EMR

Webinars

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

What is a Data Pipeline?

Navigating the Chaos of Unruly Data: Solutions for Data Teams

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

How the BMW Group analyses semiconductor demand with AWS Glue

Straumann Group is transforming dentistry with data, AI

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Connecting the Data Lifecycle

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Self-Serve Data Prep CAN Be Easy AND Sophisticated!

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

Reference guide to build inventory management and forecasting solutions on AWS

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

Connect your data for faster decisions with AWS

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

What is Data Mapping?

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Happy Birthday, CDP Public Cloud

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Turning the page

Automate alerting and reporting for AWS Glue job resource usage

An AI Chat Bot Wrote This Blog Post …

How to use foundation models and trusted governance to manage AI workflow risk

Texas Rangers data transformation modernizes stadium operations

Turnkey Cloud DataOps: Solution from Alation and Accenture

Unlock scalable analytics with AWS Glue and Google BigQuery

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Migrate from Amazon Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics Studio

Building Better Data Models to Unlock Next-Level Intelligence

Data platform trinity: Competitive or complementary?

MLOps and DevOps: Why Data Makes It Different

Exploring the AI and data capabilities of watsonx

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift