Automate your time series forecasting in Snowflake using Amazon Forecast

This post is a joint collaboration with Andries Engelbrecht and James Sun of Snowflake, Inc.

The cloud computing revolution has enabled businesses to capture and retain corporate and organizational data without capacity planning or data retention constraints. Now, with diverse and vast reserves of longitudinal data, companies are increasingly able to find novel and impactful ways to use their digital assets to make better and informed decisions when making short-term and long-term planning decisions. Time series forecasting is a unique and essential science that allows companies to make surgical planning decisions to help balance customer service levels against often competing goals of optimal profitability.

At AWS, we sometimes work with customers who have selected our technology partner Snowflake to deliver a cloud data platform experience. Having a platform that can recall years and years of historical data is powerful—but how can you use this data to look ahead and use yesterday’s evidence to plan for tomorrow? Imagine not only having what has happened available in Snowflake—your single version of the truth—but also an adjacent set of non-siloed data that offers a probabilistic forecast for days, weeks, or months into the future.

In a collaborative supply chain, sharing information between partners can improve performance, increase competitiveness, and reduce wasted resources. Sharing your future forecasts can be facilitated with Snowflake Data Sharing, which enables you to seamlessly collaborate with your business partners securely and identify business insights. If many partners share their forecasts, it can help control the bullwhip effect in the connected supply chain. You can effectively use Snowflake Marketplace to monetize your predictive analytics from datasets produced in Amazon Forecast.

In this post, we discuss how to implement an automated time series forecasting solution using Snowflake and Forecast.

Essential AWS services that enable this solution

Forecast provides several state-of-the-art time series algorithms and manages the allocation of enough distributed computing capacity to meet the needs of nearly any workload. With Forecast, you don’t get one model; you get the strength of many models that are further optimized into a uniquely weighted model for each time series in the set. In short, the service delivers all the science, data handling, and resource management into a simple API call.

AWS Step Functions provides a process orchestration mechanism that manages the overall workflow. The service encapsulates API calls with Amazon Athena, AWS Lambda, and Forecast to create an automated solution that harvests data from Snowflake, uses Forecast to convert historical data into future predictions, and then creates the data inside Snowflake.

Athena federated queries can connect to several enterprise data sources, including Amazon DynamoDB, Amazon Redshift, Amazon OpenSearch Service, MySQL, PostgreSQL, Redis, and other popular third-party data stores, such as Snowflake. Data connectors run as Lambda functions—you can use this source code to help launch the Amazon Athena Lambda Snowflake Connector and connect with AWS PrivateLink or through a NAT Gateway.

Solution overview

One of the things we often do at AWS is work to help customers realize their goals while also removing the burden of the undifferentiated heavy lifting. With this in mind, we propose the following solution to assist AWS and Snowflake customers perform the following steps:

Export data from Snowflake. You can use flexible metadata to unload the necessary historical data driven by a ready-to-go workflow.
Import data into Forecast. No matter the use case, industry, or scale, importing prepared data inputs is easy and automated.
Train a state-of-the-art time series model. You can automate time series forecasting without managing the underlying data science or hardware provisioning.
Generate inference against the trained model. Forecast-produced outputs are easy to consume for any purpose. They’re available as simple CSV or Parquet files on Amazon Simple Storage Service (Amazon S3).
Use history and future predictions side by side directly in Snowflake.

The following diagram illustrates how to implement an automated workflow that enables Snowflake customers to benefit from highly accurate time series predictions supported by Forecast, an AWS managed service. Transcending use case and industry, the design offered here first extracts historical data from Snowflake. Next, the workflow submits the prepared data for time series computation. Lastly, future period predictions are available natively in Snowflake, creating a seamless user experience for joint AWS and Snowflake customers.

Although this architecture only highlights the key technical details, the solution is simple to put together, sometimes within 1–2 business days. We provide you with working sample code to help remove the undifferentiated heavy lifting of creating the solution alone and without a head start. After you discover how to implement this pattern for one workload, you can repeat the forecasting process for any data held in Snowflake. In the sections that follow, we outline the key steps that enable you to build an automated pipeline.

Extract historical data from Snowflake

In this first step, you use SQL to define what data you want forecasted and let an Athena Federated Query connect to Snowflake, run your customized SQL, and persist the resulting record set on Amazon S3. Forecast requires historical training data to be available on Amazon S3 before ingestion; therefore, Amazon S3 serves as an intermediate storage buffer between Snowflake and Forecast. We feature Athena in this design to enable Snowflake and other heterogeneous data sources. If you prefer, another approach is using the Snowflake COPY command and storage integration to write query results to Amazon S3.

Regardless of the transport mechanism used, we now outline the kind of data Forecast needs and how data is defined, prepared, and extracted. In the section that follows, we describe how to import data into Forecast.

The following screenshot depicts what a set of data might look like in its native Snowflake schema.

Although this screenshot shows how the data looks in its natural state, Forecast requires data to be shaped into three different datasets:

Target time series – This is a required dataset containing the target variable and is used to train and predict a future value. Alone, this dataset serves as a univariate time series model.
Related time series – This is an optional dataset that contains temporal variables that should have a relationship to the target variable. Examples include variable pricing, promotional efforts, hyperlocal event traffic, economic outlook data—anything you feel might help explain variance in the target time series and produce a better forecast. The related time series dataset turns your univariate model into a multivariate to help improve accuracy.
Item metadata – This is an optional dataset containing categorical data about the forecasted item. Item metadata often helps boost performance for newly launched products, which we term a cold start.

With the scope of each of the Forecast datasets defined, you can write queries in Snowflake that source the correct data fields from the necessary source tables with the proper filters to get the desired subset of data. The following are three example SQL queries used to generate each dataset that Forecast needs for a specific food demand planning scenario.

We start with the target time series query:

select LOCATION_ID, ITEM_ID, 
DATE_DEMAND as TIMESTAMP, QTY_DEMAND as TARGET_VALUE 
from DEMO.FOOD_DEMAND

The optional related time series query pulls covariates such as price and promotional:

select LOCATION_ID,ITEM_ID, DATE_DEMAND as TIMESTAMP,
CHECKOUT_PRICE, BASE_PRICE,
EMAILER_FOR_PROMOTION, HOMEPAGE_FEATURED
from DEMO.FOOD_DEMAND

The item metadata query fetches distinct categorical values that help give dimension and further define the forecasted item:

select DISTINCT ITEM_ID, FOOD_CATEGORY, FOOD_CUISINE
from DEMO.FOOD_DEMAND

With the source queries defined, we can connect to Snowflake through an Athena Federated Query to submit the queries and persist the resulting datasets for forecasting use. For more information, refer to Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake.

The Athena Snowflake Connector GitHub repo helps install the Snowflake connector. The Forecast MLOps GitHub repo helps orchestrate all macro steps defined in this post, and makes them repeatable without writing code.

Import data into Forecast

After we complete the previous step, a target time series dataset is in Amazon S3 and ready for import into Forecast. In addition, the optional related time series and item metadata datasets may also be prepared and ready for ingestion. With the provided Forecast MLOps solution, all you have to do here is initiate the Step Functions state machine responsible for importing data—no code is necessary. Forecast launches a cluster for each of the datasets you have provided and makes the data ready for the service to use for ML model building and model inference.

Create a time series ML model with accuracy statistics

After data has been imported, highly accurate time series models are created simply by calling an API. This step is encapsulated inside a Step Functions state machine that initiates the Forecast API to start model training. After the predictor model is trained, the state machine exports the model statistics and predictions during the backtest window to Amazon S3. Backtest exports are queryable by Snowflake as an external stage, as shown in the following screenshot. If you prefer, you can store the data in an internal stage. The point is to use the backtest metrics to evaluate the performance spread of time series in your dataset provided.

Create future predictions

With the model trained from the previous step, a purpose-built Step Functions state machine calls the Forecast API to create future-dated forecasts. Forecast provisions a cluster to perform the inference and pulls the imported target time series, related time series, and item metadata datasets through a named predictor model created in the previous step. After the predictions are generated, the state machine writes them to Amazon S3, where, once again, they can be queried in place as a Snowflake external stage or moved into Snowflake as an internal stage.

Use the future-dated prediction data directly in Snowflake

AWS hasn’t built a fully automated solution for this step; however, with the solution in this post, data was already produced by Forecast in the previous two steps. You may treat the outputs as actionable events or build business intelligence dashboards on the data. You may also use the data to create future manufacturing plans and purchase orders, estimate future revenue, build staffing resource plans, and more. Every use case is different, but the point of this step is to deliver the predictions to the correct consuming systems in your organization or beyond.

The following code snippet shows how to query Amazon S3 data directly from within Snowflake:

CREATE or REPLACE FILE FORMAT mycsvformat
type = 'CSV'
field_delimiter = ','
empty_field_as_null = TRUE
ESCAPE_UNENCLOSED_FIELD = None
skip_header = 1;

CREATE or REPLACE STORAGE INTEGRATION amazon_forecast_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::nnnnnnnnnn:role/snowflake-forecast-poc-role'
ENABLED = true
STORAGE_ALLOWED_LOCATIONS = (
's3://bucket/folder/forecast',
's3://bucket/folder/backtest-export/accuracy-metrics-values',
's3://bucket/folder/backtest-export/forecasted-values';

CREATE or REPLACE STAGE backtest_accuracy_metrics
storage_integration = amazon_forecast_integration
url = 's3://bucket/folder/backtest-export/accuracy-metrics-values'
file_format = mycsvformat;

CREATE or REPLACE EXTERNAL TABLE FOOD_DEMAND_BACKTEST_ACCURACY_METRICS (
ITEM_ID varchar AS (value:c1::varchar),
LOCATION_ID varchar AS (value:c2::varchar),
backtest_window varchar AS (value:c3::varchar),
backtestwindow_start_time varchar AS (value:c4::varchar),
backtestwindow_end_time varchar AS (value:c5::varchar),
wQL_10 varchar AS (value:c6::varchar),
wQL_30 varchar AS (value:c7::varchar),
wQL_50 varchar AS (value:c8::varchar),
wQL_70 varchar AS (value:c9::varchar),
wQL_90 varchar AS (value:c10::varchar),
AVG_wQL varchar AS (value:c11::varchar),
RMSE varchar AS (value:c12::varchar),
WAPE varchar AS (value:c13::varchar),
MAPE varchar AS (value:c14::varchar),
MASE varchar AS (value:c15::varchar)
)
with location = @backtest_accuracy_metrics
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);

For more information about setting up permissions, refer to Option 1: Configuring a Snowflake Storage Integration to Access Amazon S3. Additionally, you can use the AWS Service Catalog to configure Amazon S3 storage integration; more information is available on the GitHub repo.

Initiate a schedule-based or event-based workflow

After you install a solution for your specific workload, your final step is to automate the process on a schedule that makes sense for your unique requirement, such as daily or weekly. The main thing is to decide how to start the process. One method is to use Snowflake to invoke the Step Functions state machine and then orchestrate the steps serially. Another approach is to chain state machines together and start the overall run through an Amazon EventBridge rule, which you can configure to run from an event or scheduled task—for example, at 9:00 PM GMT-8 each Sunday night.

Conclusion

With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight. In this post, we showed you how to create an automated time series forecasting workflow. Better forecasting can lead to higher customer service outcomes, less waste, less idle inventory, and more cash on the balance sheet.

If you’re ready to automate and improve forecasting, we’re here to help support you on your journey. Contact your AWS or Snowflake account team to get started today and ask for a forecasting workshop to see what kind of value you can unlock from your data.

About the Authors

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Frank Dallezotte is a Sr. Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.

Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics.

Charles Laughlin is a Principal AI/ML Specialist Solutions Architect and works on the Time Series ML team at AWS. He helps shape the Amazon Forecast service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds an M.S. in Supply Chain Management and has spent the past decade working in the consumer packaged goods industry.

James Sun is a Senior Partner Solutions Architect at Snowflake. James has over 20 years of experience in storage and data analytics. Prior to Snowflake, he held several senior technical positions at AWS and MapR. James holds a PhD from Stanford University.

Vedere AI