Blog - Data Leaders Brief

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

AWS Big Data

APRIL 3, 2023

DynamoDB configuration table The DynamoDB configuration table ( rsql-blog-rsql-config-table ) is the basic building block of this solution. All the RSQL jobs, restart information and run mode (sequential or parallel), and sequence in which the jobs are to be run are stored in this configuration table. sh", "rsql_blog_script_2.sh"

Data Warehouse

Data Warehouse Testing Data Lake Data-driven

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

AWS Big Data

JUNE 2, 2023

Solution overview The solution provides a scalable and managed data migration workflow to migrate data from Google BigQuery to Amazon Simple Storage Service (Amazon S3), and then from Amazon S3 to Amazon Redshift. This pre-built solution scales to load data in parallel using input parameters. Do not change the default.

Metadata

Metadata Data Warehouse Big Data Testing

Advanced patterns with AWS SDK for pandas on AWS Glue for Ray

AWS Big Data

JUNE 5, 2023

It allows easy integration and data movement between 22 types of data stores, including Amazon Simple Storage Service (Amazon S3), Amazon Athena , Amazon Redshift , and Amazon OpenSearch Service. To illustrate these capabilities, we explored examples of writing Parquet files to Amazon S3 at scale and querying data in parallel with Athena.

Measurement

Measurement Management Interactive Analytics

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

The Key to Sustainable Energy Optimization: A Data-Driven Approach for Manufacturing

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Dynamic DAG generation with YAML and DAG Factory in Amazon MWAA

AWS Big Data

APRIL 22, 2024

It allows default customizations and is open-source, making it simple to create and customize new functionalities. Make sure the AWS Identity and Access Management (IAM) user or role used for setting up the environment has IAM policies attached for the following permissions: Read and write access to Amazon Simple Storage Service (Amazon S3).

Data-driven

Data-driven Management Cost-Benefit Testing

Optimizing Cloudera Data Engineering Autoscaling Performance

Cloudera

SEPTEMBER 2, 2021

ETL/analytics jobs arriving in waves and run periodically: A simple SparkPi job triggered every minute to have something that’s constantly running on the system; 3 jobs that are wrapped TPC-DS queries triggered every 5 minutes in parallel for stable load; and. To achieve this, a new virtual cluster with 200 r5d.4xlarge

Optimization

Optimization Testing Cost-Benefit Measurement

A Trick, a Tip and a Thing to Try in Your Next Presentation

Depict Data Studio

OCTOBER 26, 2021

In this blog post, you’ll learn from Elizabeth Dove. For example, a simple Venn diagram with two parts could be your visual framework, and it would be communicating that two things are being discussed as well as their critical overlapping region. Simple enough, right? It’s simple and apparent once you pick your visual framework.

Visualization

Visualization Experimentation Software Strategy

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

NiFi’s data provenance capability makes it simple to enhance, test, and trust data that is in motion. The post One Big Cluster Stuck: The Right Tool for the Right Job appeared first on Cloudera Blog. Visit our Data and IT Leaders page to learn more.

Testing

Testing Data Processing Visualization Data Science

Admission Control Architecture for Cloudera Data Platform

Cloudera

OCTOBER 8, 2021

Apache Impala is a massively parallel in-memory SQL engine supported by Cloudera designed for Analytics and ad hoc queries against data stored in Apache Hive, Apache HBase and Apache Kudu tables. This blog post will endeavour to: Explain Impala’s admission control mechanism; . Query Parallelism. Introduction. Memory Consumption.

Data Processing

Data Processing Statistics Risk Optimization

A Practitioner’s Guide to Deep Learning with Ludwig

Domino Data Lab

JULY 10, 2019

This blog post considers Ludwig, offering a brief overview of the package and providing tips for practitioners such as when to use Ludwig’s command-line syntax and when to use its Python API. This blog also provides code examples with a Jupyter notebook that you can download or run via hosting provided by Domino.

Deep Learning

Deep Learning Visualization Recreation/Entertainment Data Processing

What Is ‘Equity As Code,’ And How Can It Eliminate AI Bias?

DataKitchen

JUNE 7, 2021

Figure 1: 4 simple projects to get started with DataOps. With version control, your team will be better able to reuse code, work in parallel and trace bugs back to source code changes. We explain each of these types of tests in our recent blog on impact view. Measure Your Process. Lower Error Rates.

Testing

Testing IT Measurement Data-driven

Snowflake: Data Ingestion Using Snowpipe and AWS Glue

BizAcuity

NOVEMBER 22, 2022

You can create 1 to 99 parallel threads. So, parallelism is not guaranteed. However, with AWS Glue, Snowflake customers now have a simple option to manage their programmatic data integration processes without worrying about servers, Spark clusters, or the ongoing maintenance traditionally associated with these systems.

Data Warehouse

Data Warehouse Cost-Benefit Data Lake Internet of Things

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

In this post, we delve into the ways in which you can use Amazon Athena connectors to efficiently query data files residing across Azure Data Lake Storage (ADLS) Gen2, Google Cloud Storage (GCS), and Amazon Simple Storage Service (Amazon S3). An Athena connector is an extension of the Athena query engine.

Data Lake

Data Lake Analytics Cost-Benefit Management

Successfully conduct a proof of concept in Amazon Redshift

AWS Big Data

MARCH 27, 2024

Amazon Redshift auto-copy The Amazon Redshift auto-copy (preview) feature can automate data ingestion from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift with a simple SQL command. Beyond the professional sphere, he finds joy in travel and shares his experiences through insightful blogging on social media platforms.

Testing

Testing Data Warehouse Metrics Cost-Benefit

Snowflake: Data Ingestion Using Snowpipe and AWS Glue

BizAcuity

APRIL 1, 2023

You can create 1 to 99 parallel threads. So, parallelism is not guaranteed. However, with AWS Glue, Snowflake customers now have a simple option to manage their programmatic data integration processes without worrying about servers, Spark clusters, or the ongoing maintenance traditionally associated with these systems.

Data Warehouse

Data Warehouse Cost-Benefit Data Lake Internet of Things

How Cloud-Based Contact Centers Conquer Call Overload

CDW Research Hub

FEBRUARY 15, 2021

A simple, cloud-based solution. Luckily, cloud-based contact center solutions can provide a new, more effective approach to handling increased call volumes like those due to COVID-19 vaccine distribution by allowing organizations to easily scale up capacity while reducing call time with one simple, customizable solution.

Cost-Benefit

Cost-Benefit Interactive Reporting Management

GraphDB Users Ask: How To Optimize My Inference?

Ontotext

NOVEMBER 9, 2021

If you need customization, our recommendation is to start simple and build up one rule at a time. Use the serial pipeline, debugging doesn’t work with parallel inference. The best option is to ask the hard modeling questions first, and worry about optimizations later. Avoid recursion and duplications.

Optimization

Optimization Statistics Modeling Management

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Cloudera

SEPTEMBER 24, 2020

The answer is simple, each has its own unique specialties, and depending on the type of analytics you want to do, you might find one is better suited than the other. However, there is a secret I am keeping to the end of the blog, which makes the decision even easier for the user: so easy in fact, you do not even have to decide yourself.

Data Warehouse

Data Warehouse Metadata Interactive Dashboards

Now Available: Cloudera Data Science Workbench Release 1.2

Cloudera

NOVEMBER 16, 2017

Simple installation, configuration, and administration mean faster time to value and lower maintenance costs. For more detail on user monitoring, read this article on the Cloudera Engineering Blog. appeared first on Cloudera Blog. Improved sharing options. The post Now Available: Cloudera Data Science Workbench Release 1.2

Data Science

Data Science Deep Learning Big Data Software

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

In this simple example, we can see that modelName1 is associated with tables table1 and table2. Support for multiple sessions within a project allows data scientists, engineers and operations teams to work independently alongside each other on experimentation, pipeline development, deployment and monitoring activities in parallel.

Machine Learning

Machine Learning Modeling Metadata Recreation/Entertainment

Improve power utility operational efficiency using smart sensor data and Amazon QuickSight

AWS Big Data

MAY 16, 2023

This blog post is co-written with Steve Alexander at PG&E. To effectively solve smart sensor management issues and improve operational efficiency, distribution engineers need a BI application that is simple to use and has a powerful data processing and analytics engine.

Dashboards

Dashboards Statistics Data Collection Business Intelligence

Unlocking Data Storage: The Traditional Data Warehouse vs. Cloud Data Warehouse

Sisense

NOVEMBER 12, 2020

Cloud data warehouses took the benefits of the cloud and applied them to data warehouses — bringing massive parallel processing to data teams of all sizes. Scaling the warehouse as business analytics needs grow is as simple as clicking a few buttons (and in some cases, it is even automatic). with a cloud data warehouse is simple.

Data Warehouse

Data Warehouse Data Lake OLAP Data-driven

Simply Install: PostgreSQL

Insight

APRIL 4, 2019

Copyrights PostgreSQL Simply Install is a series of blogs covering installation instructions for simple tools related to data engineering. This blog covers basic steps to install and configuration PostgreSQL (a popular relational database) and expose it as a service.

Data Processing

Data Processing Testing Publishing IT

Cloudera DataFlow’s key milestones and wins in 2020

Cloudera

FEBRUARY 17, 2021

The key highlight of the event was that it had 8 parallel tracks for different industry-specific sessions. While Eventador was already supporting cloud services for Kafka and Flink, one of its key products was SQLStream Builder, which enabled analysts and personas like those to access real-time streaming data with just simple SQL.

Metrics

Metrics Digital Transformation Manufacturing Technology

Using Transfer Learning for NLP with Small Data

Insight

MAY 7, 2019

Here is another great blog post on BERT by a former Insight Fellow. Flask is an easy to use web framework written in Python and very popular for building simple web applications and APIs. Celery is a Python-based framework used to run multiple tasks in parallel in the background and can be thought of as a job scheduler.

Modeling

Modeling Testing Machine Learning Interactive

Our quest for robust time series forecasting at scale

The Unofficial Google Data Science Blog

APRIL 17, 2017

Facebook in a recent blog post unveiled Prophet , which is also a regression-based forecasting tool. A simple solution is to forecast only at the bottom of the hierarchy and simply sum the forecasts to produce an overall parent forecast (and indeed forecasts all throughout the hierarchy.) Why do simple methods perform well?

Forecasting

Forecasting Modeling Statistics Uncertainty

How DataOps Kitchens Enable Version Control

DataKitchen

FEBRUARY 4, 2021

This blog builds on earlier posts that defined Kitchens and showed how they map to technical environments. In this blog, we’re going to focus on version control, i.e. the branch and merge component of continuous integration. We’ll cover the remaining elements of the development and deployment process in future blogs. Branch 1.1

Testing

Testing Software Management Data Analytics

Simply Install: Spark (Cluster Mode)

Insight

JUNE 2, 2019

Simply Install is a series of blogs covering installation instructions for simple tools related to data engineering. This blog covers basic steps to install and configuration Apache Spark (a popular distributed computing framework) as a cluster. on AWS using EC2 Instances.

Testing

Testing Management Data Processing Interactive

Running Ray in Cloudera Machine Learning to Power Compute-Hungry LLMs

Cloudera

APRIL 27, 2023

In the blog we will cover how Ray can be used in Cloudera Machine Learning’s open-by-design architecture to bring fast distributed AI compute to CDP. Furthermore, Ray’s unique approach to parallelism, which focuses on fine-grained task scheduling, enables it to handle a wider range of workloads compared to Spark.

Machine Learning

Machine Learning Dashboards Deep Learning Testing

NVMe vs. SATA: What’s the difference?

IBM Big Data Hub

NOVEMBER 3, 2023

The reason is simple: Better storage technologies mean faster, higher-performing compute environments. NVMe (non-volatile memory express) is a protocol for highly parallel data transfer with reduced system overheads per input/output per second (I/O, or iops) that is used in flash storage and solid-state drives (SSDs). 2 NVMe drives M.2

Cost-Benefit

Cost-Benefit Enterprise Internet of Things Technology

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. Simple Exploration and Model. As with all Machine Learning problems, let’s start with a simple model. Some simple filtering out of columns with a lot of missing values.

Machine Learning

Machine Learning Data Science Data Lake Modeling

Spectrum Scale on IBM Cloud performance

IBM Big Data Hub

JUNE 27, 2023

In this blog post, we will discuss all pieces that come together to make this near-instant infrastructure a reality. Whenever possible, the provision steps are carried out in parallel. The controller Terraform illustrates how successfully we can parallelize the Terraform provisions.

Testing

Testing Data Processing Software Cost-Benefit

Can we identify 3-D images using very little training data?

Insight

OCTOBER 25, 2019

In this blog, I’ll illustrate an approach by walking you through my project during my Data Science Fellowship at Insight , followed by a quick discussion pertaining to broader application. Meta-learning algorithms often prefer simple and fast base learners (e.g.

Snapshot

Snapshot Testing Deep Learning Visualization

New Multithreading Model for Apache Impala

Cloudera

OCTOBER 20, 2020

Today we are introducing a new series of blog posts that will take a look at recent enhancements to Apache Impala. Two of the key tenets of Impala’s design philosophy are: Parallelism – for each part of query execution, run it in parallel on as many resources as possible. Introduction. But first, some context.

Modeling

Modeling Broadcasting Cost-Benefit Data Warehouse

Academic Research Done on Arc Diagrams

The Data Visualisation Catalogue

FEBRUARY 11, 2022

This paper explores the visualisation of American Football data through the use of two chart types: a Parallel Coordinates Plot for analysing season-long data and an Arc Diagram to visualise all of the plays (discrete actions) that occur in a single football game. Jankun-Kelly Topic: Data Visualisation.

Visualization

Visualization Interactive IT

Why Replicating HBase Data Using Replication Manager is the Best Choice

Cloudera

JULY 13, 2022

The service provides simple, easy-to-use, and feature-rich data movement capability to deliver data and metadata where it is needed, and has secure data backup and disaster recovery functionality. The job used 100 parallel factor and 1800 yarn containers. In the use case, 125 nodes wrote approximately 70 TB of data in a day.

Snapshot

Snapshot Management Cost-Benefit Metadata

Sankey Diagrams, Parallel Sets & Alluvial Diagrams… What’s the Difference?

The Data Visualisation Catalogue

OCTOBER 18, 2021

For ages, the naming between Sankey Diagrams , Parallel Sets , and Alluvial Diagrams have been used interchangeably. But how do Sankey Diagrams, Parallel Sets and Alluvial Diagrams differ? Parallel Sets. Are these visualisations that different from one another and is it a bad thing that some misnaming is taking place?

Visualization

Visualization IT

The Promise & Challenge of Behavior Targeting (& Two Prerequisites)

Occam's Razor

JULY 31, 2007

Behavior targeting done with the right tool for you means that you can overcome the Scale, Data and Diversity problem by, simple put, automatically understanding your visitors as they interact with your web presence and showing them the most relevant content. God's answer to all problems you could have on your websites.

Testing

Testing Marketing Analytics Sales

5 signs you need a premium DNS service

IBM Big Data Hub

FEBRUARY 1, 2024

The need for resilience often leads growing businesses to adopt multiple DNS solutions in parallel. DNS is easier to understand when you only perform simple tasks with it. Learn more about IBM NS1 Connect The post 5 signs you need a premium DNS service appeared first on IBM Blog.

Enterprise

Enterprise Optimization Data Processing Measurement

Using Apache Solr REST API in CDP Public Cloud

Cloudera

OCTOBER 27, 2022

Information in this blog post can be useful for engineers developing Apache Solr client applications. Based on these credentials Knox will forward the requests to Solr servers in round-robin, using Kerberos and Simple and Protected GSSAPI Negotiation Mechanism ( SPNEGO) on behalf of the authenticated end user. See Figure 1).

Data Processing

Data Processing Testing Measurement Visualization

IBM Watson Orchestrate: Unlocking new levels of productivity for every employee

IBM Big Data Hub

MAY 9, 2023

It can be as simple as adding a row to Excel or as complex as onboarding a new employee with the many tasks involved— collecting I-9 information, ordering a new computer and even setting up a welcome meeting with the team. Watson can now work multiple requests in parallel. To truly transform work, we need tools that can keep up.

Statistics

Statistics Enterprise Technology Interactive

8 Modeling Tools to Build Complex Algorithms

Domino Data Lab

AUGUST 9, 2021

Ray: is an open-source library framework that offers a simple API for scaling applications from a single computer to large clusters. It offers an uncomplicated user interface for multiple programming clusters with parallel data. The post 8 Modeling Tools to Build Complex Algorithms appeared first on Data Science Blog by Domino.

Modeling

Modeling Deep Learning Machine Learning Statistics

Selecting a Chart Based on the Number of Variables

The Data Visualisation Catalogue

SEPTEMBER 28, 2022

However, deciding on what chart to use isn’t as simple as identifying which chart can display the right number of variables, but it can be a helpful starting point. Parallel Coordinates Plot. The post Selecting a Chart Based on the Number of Variables appeared first on The Data Visualisation Catalogue Blog. Source: Wikipedia.

Visualization

Visualization IT Modeling

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

To ingest data from operational databases to an Amazon Simple Storage Service (Amazon S3) staging bucket of the data lake, either AWS Database Migration Service (AWS DMS) or any AWS partner solution from AWS Marketplace that has support for change data capture (CDC) can fulfill the requirement.

Data Lake

Data Lake Data Processing Metadata Snapshot

Towards optimal experimentation in online systems

The Unofficial Google Data Science Blog

APRIL 23, 2024

This blog post discusses such a comprehensive approach that is used at Youtube. Experiments, Parameters and Models At Youtube, the relationships between system parameters and metrics often seem simple — straight-line models sometimes fit our data well. And we can keep repeating this approach, relying on intuition and luck.

Experimentation

Experimentation Optimization Uncertainty Metrics

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

Webinars

Trending Sources

Advanced patterns with AWS SDK for pandas on AWS Glue for Ray

Webinars

Dynamic DAG generation with YAML and DAG Factory in Amazon MWAA

Optimizing Cloudera Data Engineering Autoscaling Performance

A Trick, a Tip and a Thing to Try in Your Next Presentation

One Big Cluster Stuck: The Right Tool for the Right Job

Admission Control Architecture for Cloudera Data Platform

A Practitioner’s Guide to Deep Learning with Ludwig

What Is ‘Equity As Code,’ And How Can It Eliminate AI Bias?

Snowflake: Data Ingestion Using Snowpipe and AWS Glue

Multicloud data lake analytics with Amazon Athena

Successfully conduct a proof of concept in Amazon Redshift

Snowflake: Data Ingestion Using Snowpipe and AWS Glue

How Cloud-Based Contact Centers Conquer Call Overload

GraphDB Users Ask: How To Optimize My Inference?

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Now Available: Cloudera Data Science Workbench Release 1.2

Of Muffins and Machine Learning Models

Improve power utility operational efficiency using smart sensor data and Amazon QuickSight

Unlocking Data Storage: The Traditional Data Warehouse vs. Cloud Data Warehouse

Simply Install: PostgreSQL

Cloudera DataFlow’s key milestones and wins in 2020

Using Transfer Learning for NLP with Small Data

Our quest for robust time series forecasting at scale

How DataOps Kitchens Enable Version Control

Simply Install: Spark (Cluster Mode)

Running Ray in Cloudera Machine Learning to Power Compute-Hungry LLMs

NVMe vs. SATA: What’s the difference?

NVIDIA RAPIDS in Cloudera Machine Learning

Spectrum Scale on IBM Cloud performance

Can we identify 3-D images using very little training data?

New Multithreading Model for Apache Impala

Academic Research Done on Arc Diagrams

Why Replicating HBase Data Using Replication Manager is the Best Choice

Sankey Diagrams, Parallel Sets & Alluvial Diagrams… What’s the Difference?

The Promise & Challenge of Behavior Targeting (& Two Prerequisites)

5 signs you need a premium DNS service

Using Apache Solr REST API in CDP Public Cloud

IBM Watson Orchestrate: Unlocking new levels of productivity for every employee

8 Modeling Tools to Build Complex Algorithms

Selecting a Chart Based on the Number of Variables

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Towards optimal experimentation in online systems

Stay Connected