AWS Big Data Blog

Perform time series forecasting using Amazon Redshift ML and Amazon Forecast

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

Many businesses use different software tools to analyze historical data and past patterns to forecast future demand and trends to make more accurate financial, marketing, and operational decisions. Forecasting acts as a planning tool to help enterprises prepare for the uncertainty that can occur in the future.

Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning (ML) models using familiar SQL commands in Amazon Redshift.

With Redshift ML, you can take advantage of Amazon SageMaker, a fully managed ML service, without learning new tools or languages. Simply use SQL statements to create and train SageMaker ML models using your Redshift data and then use these models to make predictions. For more information on how to use Redshift ML, refer to Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

With Redshift ML, you can now use Amazon Forecast, an ML-based time series forecasting service, without learning any new tools or having to create pipelines to move your data. You can use SQL statements to create and train forecasting models from your time series data in Amazon Redshift and use these models to generate forecasts about revenue, inventory, resource usage, or demand forecasting in your queries and reports.

For example, businesses use forecasting to do the following:

  • Use resources more efficiently
  • Time the launch of new products or services
  • Estimate recurring costs
  • Predict future events like sales volumes and earnings

In this post, we demonstrate how you can create forecasting models using Redshift ML and generate future forecasts using simple SQL commands.

When you use forecasting in Amazon Redshift, Redshift ML uses Forecast to train the forecasting model and to generate forecasts. You pay only the associated Forecast costs. There are no additional costs associated with Amazon Redshift for creating or using Forecast models to generate predictions. View Amazon Forecast pricing for details.

Solution overview

Amazon Forecast is a fully managed time series forecasting service based on machine learning. Forecast uses different ML algorithms to perform complex ML tasks for your datasets. Using historical data, Forecast automatically trains multiple algorithms and produces a forecasting model, also known as a predictor. Amazon Redshift provides a simple SQL command to create forecasting models. It seamlessly integrates with Forecast to create a dataset, predictor, and forecast automatically without you worrying about any of these steps. Redshift ML supports target time series data and related time series data.

As the following diagram demonstrates, Amazon Redshift will call Forecast, and data needed for Forecast model creation and training will be pushed from Amazon Redshift to Forecast through Amazon Simple Storage Service (Amazon S3). When the model is ready, it can be accessed using SQL from within Amazon Redshift using any business intelligence (BI) tool. In our case, we use Amazon Redshift Query Editor v2.0 to create forecast tables and visualize the data.

To show this capability, we demonstrate two use cases:

  • Forecast electricity consumption by customer
  • Predict bike sharing rentals

What is time series data?

Time series data is any dataset that collects information at various time intervals. This data is distinct because it orders data points by time. Time series data is plottable on a line graph and such time series graphs are valuable tools for visualizing the data. Data scientists use them to identify forecasting data characteristics.

Time series forecasting is a data science technique that uses machine learning and other computer technologies to study past observations and predict future values of time series data.

Prerequisites

Complete the following prerequisites before starting:

  1. Make sure you have an Amazon Redshift Serverless endpoint or a Redshift cluster.
  2. Have access to Amazon Redshift Query Editor v2.
  3. On the Amazon S3 console, create an S3 bucket that Redshift ML uses for uploading the training data that Forecast uses to train the model.
  4. Create an AWS Identity and Access Management (IAM role). For more information, refer to Creating an IAM role as the default.

Although it’s easy to get started with AmazonS3FullAccess, AmazonForecastFullAccess, AmazonRedshiftAllCommandsFullAccess, and AmazonSageMakerFullAccess, we recommend using the minimal policy that we have provided (if you already have an existing IAM role, just add it to that role). If you need to use AWS Key Management Service (AWS KMS) or VPC routing, refer to Cluster and configure setup for Amazon Redshift ML administration.

To use Forecast, you need to have the AmazonForecastFullAccess policy. For more restrictive IAM permissions, you can use the following IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "forecast:DescribeDataset",
                "forecast:DescribeDatasetGroup",
                "forecast:DescribeAutoPredictor",
                "forecast:CreateDatasetImportJob",
                "forecast:CreateForecast",
                "forecast:DescribeForecast",
                "forecast:DescribeForecastExportJob",
                "forecast:CreateMonitor",
                "forecast:CreateForecastExportJob",
                "forecast:CreateAutoPredictor",
                "forecast:DescribeDatasetImportJob",
                "forecast:CreateDatasetGroup",
                "forecast:CreateDataset",
                "forecast:TagResource",
                "forecast:UpdateDatasetGroup"
            ],
            "Resource": "*"
        } ,
		{
			"Effect": "Allow",
			"Action": [
				"iam:PassRole"
			],
			"Resource":"arn:aws:iam::<aws_account_id>:role/service-role/<Amazon_Redshift_cluster_iam_role_name>"
		}
    ]
}

To allow Amazon Redshift and Forecast to assume the role to interact with other services, add the following trust policy to the IAM role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "redshift.amazonaws.com",
           "redshift-serverless.amazonaws.com",
           "forecast.amazonaws.com",
           "sagemaker.amazonaws.com" 
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}[

Use case 1: Forecast electricity consumption

In our first use case, we demonstrate forecasting electricity consumption for individual households. Predicting or forecasting usage could help utility companies better manage their resources and keep them ahead on planning the distribution and supply. Typically, utility companies use software tools to perform the forecasting and perform a lot of steps to create the forecasting data. We show you how to use the data in your Redshift data warehouse to perform predictive analysis or create forecasting models.

For this post, we use a modified version of the individual household electric power consumption dataset. For more information, see ElectricityLoadDiagrams20112014 Data Set (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science).

Prepare the data

Refer to the following notebook for the steps needed to create this use case.

Using Query Editor V2, connect to your cluster and open a new notebook.

The data contains measurements of electric power consumption in different households for the year 2014. We aggregated the usage data hourly. Each row represents the total electricity usage for a given household at an hourly granularity.

For our use case, we use a subset of the source data’s attributes:

  • Usage_datetime – Electricity usage time
  • Consumptioninkw – Hourly electricity consumption data in kW
  • Customer_id – Household customer ID

Create the table electricity_consumption and load data using the COPY command:

CREATE TABLE electricity_consumption
(usage_datetime timestamp, 
consumptioninkw float, 
customer_id varchar(24)
);

COPY electricity_consumption
FROM 's3://redshift-blogs/amazon-forecast-blog/electricityusagedata/electricityusagedata.csv'
IAM_ROLE default
REGION 'us-east-1' delimiter ',' IGNOREHEADER 1;

You can verify the dataset by running a SQL query on your table.

As you can notice, the dataset has electricity consumption in the target field (consumptioninkw) at hourly intervals for individual consumers (customer_id).

Create a forecasting model in Redshift ML

We use the Create Model command to create and train a forecast model. For our forecasting dataset, we use electricity consumption data within the FROM clause. See the following code:

CREATE MODEL forecast_electricity_consumption
FROM electricity_consumption 
TARGET consumptioninkw 
IAM_ROLE 'arn:aws:your-IAM-Role'
AUTO ON MODEL_TYPE FORECAST
SETTINGS (S3_BUCKET 'your-S3-bucket-name',
 HORIZON 24,
 FREQUENCY 'H',
 S3_GARBAGE_COLLECT OFF);

Here, the model name is forecast_electricity_consumption. We use the following settings to create the model:

  • Target – The name of the field for prediction.
  • HORIZON – The number of time steps in the future to forecast.
  • FREQUENCY – The forecast frequency, which should match the input frequency in our case (H meaning hourly). Other acceptable frequency values are Y | M | W | D | H | 30min | 15min | 10min | 5min | 1min. For more details, refer to CREATE MODEL with Forecast.

The Create Model command must include one VARCHAR (customer_id) and a timestamp dimension (usage_datetime). All other related time series feature data must be INT or FLOAT data types.

For the Redshift ML forecasting model, make sure that when you issue a CREATE MODEL statement, you specify MODEL_TYPE as FORECAST. When Redshift ML trains a model or predictor on Forecast, it has a fixed forecast, meaning there is not a physical model to compile and run. Therefore, an inference function is not needed for Forecast models. Instead, we show you how you can pull an exported forecast from the training output location in Amazon S3 into a table locally in your Redshift data warehouse.

When using Forecast, the create model command is run in synchronous mode. This means that after the command is run, it will take 10–15 minutes to set up the required Forecast artifacts. The model will then start training in asynchronous mode, meaning that the training is done behind the scenes by Forecast. You can check when the model training is complete by running the show model command:

SHOW MODEL forecast_electricity_consumption;

The following screenshot shows the results.

The model is trained and deployed when status is shown as READY. If you see TRAINING, that means the model is still training and you need to wait for it to complete.

Generate a forecast

After a model has finished training, you can run a simple create table as command to instantiate all the forecast results into a table. This command will get all the forecast results from the S3 bucket where Forecast exported them.

Create the table locally and load the data in the new table:

CREATE TABLE forecast_electricty_predictions AS SELECT FORECAST(forecast_electricity_consumption);

Here, FORECAST is a function that takes your model’s name as input.

Next, check the forecasted data for the next 24 hours:

Select * from forecast_electricity_predictions;

The following screenshot shows the results.

As shown in the preceding screenshot, our forecast is generated for 24 hours because the HORIZON and FREQUENCY parameters at the model creation and training time were defined as 24H, and that can’t change after the model is trained.

Use case 2: Predict bike sharing rentals

Redshift ML supports historical related time series (RTS) datasets. Historical RTS datasets contain data points up to the forecast horizon, and don’t contain any data points within the forecast horizon.

For this use case, we use a modified version of the Bike Sharing Dataset (Fanaee-T,Hadi. (2013). Bike Sharing Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5W894).

Our time series dataset contains the event_timestamp and item_id dimensions. It also contains additional attributes, including season, holiday, temperature, and workingday. These features are RTS because they may impact the no_of_bikes_rented target attribute.

For this post, we only include the workingday feature as RTS to help forecast the no_of_bikes_rented target. Based on following chart, we can see a correlation where the number of bikes rented has a direct relationship with working day.

Prepare the data

Refer to the following notebook for the steps needed to create this use case.

Load the dataset into Amazon Redshift using the following SQL. You can use Query Editor v2 or your preferred SQL tool to run these commands.

To create the table, use the following commands:

create table bike_sampledata
(
event_timestamp timestamp,
season float , 
holiday float , 
workingday float , 
weather float , 
temperature float , 
atemperature float, 
humidity float , 
windspeed float , 
casual float , 
registered float , 
no_of_bikes_rented float,
item_id varchar(255)
);

To load data into Amazon Redshift, use the following COPY command:

copy bike_sampledata
from 's3://redshift-blogs/amazon-forecast-blog/bike-data/bike.csv'
IAM_ROLE default
format as csv
region 'us-east-1';

Create a model in Redshift ML using Forecast

For this example, we are not considering any other RTS features and the goal is to forecast the number of bike rentals for the next 24 hours by accounting for the working day only. You can perform analysis and include additional RTS features in the SELECT query as desired.

Run the following SQL command to create your model—note our target is no_of_bikes_rented, which contains the number of total rentals, and we use item_id, event_timestamp, and workingday as inputs from our training set:

CREATE MODEL forecast_bike_consumption 
FROM (
     select
     s.item_id , s.event_timestamp, s.no_of_bikes_rented, s.workingday
     from     
     bike_sampledata s
     )
TARGET no_of_bikes_rented
IAM_ROLE 'arn:aws:your-IAM-Role'
AUTO ON MODEL_TYPE FORECAST
OBJECTIVE 'AverageWeightedQuantileLoss'
SETTINGS (S3_BUCKET 'your-s3-bucket-name',
          HORIZON 24,
          FREQUENCY 'H',
          PERCENTILES '0.25,0.50,0.75,mean',
          S3_GARBAGE_COLLECT ON);

The Create Model command must include one VARCHAR (item_id) and a timestamp dimension (event_timestamp). All other RTS feature data must be INT or FLOAT data types.

The OBJECTIVE parameter specifies a metric to minimize or maximize the objective of a job. For more details, refer to AutoMLJobObjective.

As in the previous use case, the Create Model command will take 10–15 minutes to set up the required Forecast artifacts and then will start the training in asynchronous mode so model training is done behind the scenes by Forecast. You can check if the model is in the Ready state by running the show model command:

SHOW MODEL forecast_bike_consumption;

Generate predictions

After a model has finished training, you can run a Create table command to instantiate all the forecast results into a table. This command gets all the forecast results from the S3 bucket where Forecast exported them.

Create the table locally and load the data in the new table:

CREATE TABLE forecast_bike_consumption_results 
AS SELECT FORECAST(forecast_bike_consumption);

Run following SQL to inspect the generated forecast results:

select * from forecast_bike_consumption_results;

To visualize the data to help us understand it more, select Chart. For the X axis, choose the time attribute and for the Y axis, choose mean.

You can also visualize all the three forecasts together to understand the differences between them:

  1. Choose Trace and choose Time for the X axis and for p50 for the Y axis.
  2. Choose Trace again and choose Time for the X axis and p75 for the Y axis.
  3. Edit the chart title and legend and provide suitable labels.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Redshift Serverless workgroup or namespace you have for this post (this will also drop all the objects created).
  2. If you used an existing Redshift Serverless workgroup or namespace, use the following code to drop these objects:
    DROP TABLE forecast_electricty_predictions;
    DROP MODEL forecast_electricity_consumption;
    DROP TABLE electricity_consumption;
    DROP TABLE forecast_bike_consumption_results;
    DROP MODEL forecast_bike_consumption;
    DROP TABLE bike_sampledata;

Conclusion

Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Redshift ML to gain business insights for your data.

With Forecast, you can use time series data and related data to forecast different business outcomes using familiar Amazon Redshift SQL commands.

We encourage you to start using this amazing new feature and give us your feedback. For more details, refer to CREATE MODEL with Forecast.


About the authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms, Ahmed is passionate about helping customers build efficient, performant and scalable Analytic solutions.

Nikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.