Control formality in machine translated text using Amazon Translate

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Amazon Translate now supports formality customization. This feature allows you to customize the level of formality in your translation output. At the time of writing, the formality customization feature is available for six target languages: French, German, Hindi, Italian, Japanese, and Spanish. You can customize the formality of your translated output to suit your communication needs. 

You have three options to control the level of formality in the output:

  • Default – No control over formality by letting the neural machine translation operate with no influence
  • Formal – Useful in the insurance and healthcare industry, where you may prefer a more formal translation
  • Informal – Useful for customers in gaming and social media who prefer an informal translation

Formality customization is available in real-time translation operations in commercial AWS Regions where Amazon Translate is available. In this post, we walk you through how to use the formality customization feature and get a customized translated output securely.

Solution overview

To get formal or informal words and phrases in your translation output, you can toggle the formality button under the additional settings on the Amazon Translate console when you run the translations through real-time requests. The following sections describe using formality customization via the Amazon Translate console, AWS Command Line Interface (AWS CLI), or the Amazon Translate SDK (Python Boto3).

Amazon Translate console

To demonstrate the formality customization with real-time translation, we use the sample text “Good morning, how are you doing today? ” in English:

  1. On the Amazon Translate console, choose English (en) for Source language.
  2. Choose Spanish (es) for Target language.
  3. Enter the quoted text in the Source language text field.
  4. In the Additional settings section, enable Formality, and select Informal on the drop-down menu.

The translated output is “Buenos días, ¿cómo te va hoy? ” which is casual way of speaking in Spanish.

English to Spanish informal translation

  1. Now, select Formal on the drop-down Formality menu.

The translated output changes to “Buenos días, ¿cómo le va hoy? ” which is a more formal way of speaking in Spanish.

English to Spanish formal translation

You can follow the preceding steps to change the target language to other supported languages and note the difference between the informal and formal translations. Let’s try some more sample text.

In the following examples, we translate “So what do you think? ” from English to German. The first screenshot shows an informal translation.

English to German informal translation

The following screenshot shows the formal translation. English to German formal translation

In another example, we translate “Can I help you? ” from English to Japanese. The first screenshot shows an informal translation.

English to Japanese informal translation

The following screenshot shows the formal translation.

English to Japanese formal translation

AWS CLI

The translate-text AWS CLI command with --settings Formality=FORMAL | INFORMAL translates words and phrases in your translated text appropriately.

The following AWS CLI commands are formatted for Unix, Linux, and macOS. For Windows, replace the backslash () Unix continuation character at the end of each line with a caret (^).

In the following code, we translate “How are you? ” from English to Hindi, using the FORMAL setting:

aws translate translate-text 
--text "How are you?" 
--source-language-code "en" 
--target-language-code "hi" 
--settings Formality=FORMAL

You get a response like the following snippet:

{     "TranslatedText": "आप कैसे हो?", 
       "SourceLanguageCode": "en",      
       "TargetLanguageCode": "hi", 
       "AppliedSettings": {         
                            "Formality": "FORMAL"
                           } 
}

The following code translates the same text into informal Hindi:

aws translate translate-text 
--text "How are you?" 
--source-language-code "en" 
--target-language-code "hi" 
--settings Formality=INFORMAL

You get a response like the following snippet:

{     "TranslatedText": "तुम कैसे हो?",      
      "SourceLanguageCode": "en",      
      "TargetLanguageCode": "hi",     
      "AppliedSettings": {         
                          "Formality": "INFORMAL"     
                          } 
}

Amazon Translate SDK (Python Boto3)

The following Python Boto3 code uses the real-time translation call with both formality settings to translate “How are you? ” from English to Hindi.

import boto3
import json

translate = boto3.client(service_name='translate', region_name='us-west-2')

result = translate.translate_text(Text="How are you?", SourceLanguageCode="en", TargetLanguageCode="hi", Settings={"Formality": "INFORMAL"})
print('TranslatedText: ' + result.get('TranslatedText'))
print('SourceLanguageCode: ' + result.get('SourceLanguageCode'))
print('TargetLanguageCode: ' + result.get('TargetLanguageCode'))
print('AppliedSettings: ' + json.dumps(result.get('AppliedSettings')))

print('')

result = translate.translate_text(Text="How are you?", SourceLanguageCode="en", TargetLanguageCode="hi", Settings={"Formality":"FORMAL"})
print('TranslatedText: ' + result.get('TranslatedText'))
print('SourceLanguageCode: ' + result.get('SourceLanguageCode'))
print('TargetLanguageCode: ' + result.get('TargetLanguageCode'))
print('AppliedSettings: ' + json.dumps(result.get('AppliedSettings')))

Conclusion

You can use the formality customization feature in Amazon Translate to control the level of formality in machine translated text to meet your application context and business requirements. You can customize your translations using Amazon Translate in multiple ways, including custom terminology, profanity masking, and active custom translation.


About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.

Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, machine learning, and security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.

Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.

Read More

How SIGNAL IDUNA operationalizes machine learning projects on AWS

This post is co-authored with Jan Paul Assendorp, Thomas Lietzow, Christopher Masch, Alexander Meinert, Dr. Lars Palzer, Jan Schillemans of SIGNAL IDUNA.

At SIGNAL IDUNA, a large German insurer, we are currently reinventing ourselves with our transformation program VISION2023 to become even more customer oriented. Two aspects are central to this transformation: the reorganization of large parts of the workforce into cross functional and agile teams, and becoming a truly data-driven company. Here, the motto “You build it, you run it” is an important requirement for a cross-functional team that builds a data or machine learning (ML) product. This places tight constraints on how much work team can spend to productionize and run a product.

This post shows how SIGNAL IDUNA tackles this challenge and utilizes the AWS Cloud to enable cross-functional teams to build and operationalize their own ML products. To this end, we first introduce the organizational structure of agile teams, which sets the central requirements for the cloud infrastructure used to develop and run a product. Next, we show how three central teams at SIGNAL IDUNA enable cross-functional teams to build data products in the AWS Cloud with minimal assistance, by providing a suitable workflow and infrastructure solutions that can easily be used and adapted. Finally, we review our approach and compare it with a more classical approach where development and operation are separated more strictly.

Agile@SI – the Foundation of Organizational Change

Since the start of 2021, SIGNAL IDUNA has begun placing its strategy Agile@SI into action and establishing agile methods for developing customer-oriented solutions across the entire company [1]. Previous tasks and goals are now undertaken by cross-functional teams, called squads. These squads employ agile methods (such as the Scrum framework), make their own decisions, and build customer-oriented products. Typically, the squads are located in business divisions, such as marketing, and many have a strong emphasis on building data-driven and ML powered products. As an example, typical use cases in insurance are customer churn prediction and product recommendation.

Due to the complexity of ML, creating an ML solution by a single squad is challenging, and thus requires the collaboration of different squads.

SIGNAL IDUNA has three essential teams that support creating ML solutions. Surrounded by these three squads is the team that is responsible for the development and the long-term operation and of the ML solution. This approach follows the AWS shared responsibility model [2].

In the image above, all of the squads are represented in an overview.

Cloud Enablement

The underlying cloud infrastructure for the entire organization is provided by the squad Cloud Enablement. It is their task to enable the teams to build products upon cloud technologies on their own. This improves the time to market building new products like ML, and it follows the principle of “You build it, you run it”.

Data Office/Data Lake

Moving data into the cloud, as well as finding the right dataset, is supported by the squad Data Office/Data Lake. They set up a data catalogue that can be used to search and select required datasets. Their aim is to establish data transparency and governance. Additionally, they are responsible for establishing and operating a Data Lake that helps teams to access and process relevant data.

Data Analytics Platform

Our squad Data Analytics Platform (DAP) is a cloud and ML focused team at SIGNAL IDUNA that is proficient in ML engineering, data engineering, as well as data science. We enable internal teams using public cloud for ML by providing infrastructure components and knowledge. Our products and services are presented in detail in the following section.

Enabling Cross-Functional Teams to Build ML Solutions

To enable cross-functional teams at SIGNAL IDUNA to build ML solutions, we need a fast and versatile way to provision reusable cloud infrastructure as well as an efficient workflow for onboarding teams to utilize the cloud capabilities.

To this end, we created a standardized onboarding and support process, and provided modular infrastructure templates as Infrastructure as Code (IaC). These templates contain infrastructure components designed for common ML use cases that can be easily tailored to the requirements of a specific use case.

The Workflow of Building ML Solutions

There are three main technical roles involved in building and operating ML solutions: The data scientist, ML engineer, and a data engineer. Each role is part of the cross-functional squad and has different responsibilities. The data scientist has the required domain knowledge of functional as well as technical requirements of the use case. The ML engineer specializes in building automated ML solutions and model deployment. And the data engineer makes sure that the data flows from on-premises and within the cloud.

The process of providing the platform is as follows:

The infrastructure of the specific use case is defined in IaC and versioned in a central project repository. This also includes pipelines for model training and deployment, as well as other data science related code artifacts. Data scientists, ML engineers, and data engineers have access to the project repository and can configure and update all of the infrastructure code autonomously. This enables the team to rapidly alter the infrastructure if needed. However, the ML engineer can always support in developing and updating infrastructure or ML models.

Reusable and Modular Infrastructure Components

The hierarchical and modular IaC resources are implemented in Terraform and include infrastructure for common data science and ETL use cases. This lets us reuse infrastructure code and enforce required security and compliance policies, such as using AWS Key Management Service (KMS) encryption for data, as well as encapsulating infrastructure in Amazon Virtual Private Cloud (VPC) environments without direct internet access.

The hierarchical IaC structure is as follows:

  • Modules encapsulate basic AWS services with the required configuration for security and access management. This includes best practice configurations such as the prevention of public access to Amazon Simple Storage Service (S3) buckets, or enforcing encryption for all files stored.
  • In some cases, you need a variety of services to automate processes, such as to deploy ML models in different stages. Therefore, we defined Solutions as a bundle of different modules in a joint configuration for different types of tasks.
  • In addition, we offer complete Blueprints that combine solutions in different environments to meet the many potential needs of a project. In our MLOps blueprint, we define a deployable infrastructure for training, provisioning, and monitoring ML models that are integrated and distributed in AWS accounts. We discuss further details in the next section.

These products are versioned in a central repository by the DAP squad. This lets us continuously improve our IaC and consider new features from AWS, such as Amazon SageMaker Model Registry. Each squad can reference these resources, parameterize them as needed, and finally deploy them in their own AWS accounts.

MLOps Architecture

We provide a ready-to-use blueprint with specific solutions to cover the entire MLOps process. The blueprint contains infrastructure distributed over four AWS accounts for building and deploying ML models. This lets us isolate resources and workflows for the different steps in the MLOps process. The following figure shows the multi-account architecture, and we describe how the responsibility over specific steps of the process is divided between the different technical roles.

The modeling account includes services for the development of ML models. First, the data engineer employs an ETL process to provide relevant data from the SIGNAL IDUNA data lake, the centralized gateway for data-driven workflows in the AWS Cloud. Subsequently, the dataset can be utilized by the data scientist to train and evaluate model candidates. Once ready for extensive experiments, a model candidate is integrated into an automated training pipeline by the ML engineer. We use Amazon SageMaker Pipelines to automate training, hyperparameter tuning, and model evaluation at scale. This also includes model lineage and a standardized approval mechanism for models to be staged for deployment into production. Automated unit tests and code analysis ensure quality and reliability of the code for each step of the pipeline, such as data preprocessing, model training, and evaluation. Once a model is evaluated and approved, we use Amazon SageMaker ModelPackages as an interface to the trained model and relevant meta data.

The tooling account contains automated CI/CD pipelines with different stages for testing and deployment of trained models. In the test stage, models are deployed into the serving-nonprod account. Although model quality is evaluated in the training pipeline prior to the model being staged for production, here we run performance and integration tests in an isolated testing environment. After passing the testing stage, models are deployed into the serving-prod account to be integrated into production workflows.

Separating the stages of the MLOps workflow into different AWS accounts lets us isolate development and testing from production. Therefore, we can enforce a strict access and security policy. Furthermore, tailored IAM roles ensure that specific services can only access data and other services required for its scope, following the principle of least privilege. Services within the serving environments can additionally be made accessible to external business processes. For example, a business process can query an endpoint within the serving-prod environment for model predictions.

Benefits of our Approach

This process has many advantages as compared to a strict separation of development and operation for both the ML models, as well as the required infrastructure:

  • Isolation: Every team receives their own set of AWS accounts that are completely isolated from other teams’ environments. This makes it easy to manage access rights and keep the data private to those who are entitled to work with it.
  • Cloud enablement: Team members with little prior experience in cloud DevOps (such as many data scientists) can easily watch the whole process of designing and managing infrastructure since (almost) nothing is hidden from them behind a central service. This creates a better understanding of the infrastructure, which can in turn help them create data science products more efficiently.
  • Product ownership: The use of preconfigured infrastructure solutions and managed services keeps the barrier to managing an ML product in production very low. Therefore, a data scientist can easily take ownership of a model that is put into production. This minimizes the well-known risk of failing to put a model into production after development.
  • Innovation: Since ML engineers are involved long before a model is ready to put into production, they can create infrastructure solutions suitable for new use cases while the data scientists develop an ML model.
  • Adaptability: Since the IaC solution developed by DAP are freely available, any team can easily adapt these to match a specific need for their use case.
  • Open source: All new infrastructure solutions can easily be made available via the central DAP code repo to be used by other teams. Over time, this will create a rich code base with infrastructure components tailored to different use cases.

Summary

In this post, we illustrated how cross-functional teams at SIGNAL IDUNA are being enabled to build and run ML products on AWS. Central to our approach is the usage of a dedicated set of AWS accounts for each team in combination with bespoke IaC blueprints and solutions. These two components enable a cross-functional team to create and operate production quality infrastructure. In turn, they can take full end-to-end ownership of their ML products.

Refer to Amazon SageMaker Model Building Pipelines – Amazon SageMaker to learn more.

Find more information on ML on AWS on our official page.

References

[1] https://www.handelsblatt.com/finanzen/versicherungsbranche-vorbild-spotify-signal-iduna-wird-von-einer-handwerker-versicherung-zum-agilen-konzern/27381902.html

[2] https://blog.crisp.se/wp-content/uploads/2012/11/SpotifyScaling.pdf

[3] https://aws.amazon.com/compliance/shared-responsibility-model/


About the Authors

Jan Paul Assendorp is an ML engineer with a strong data science focus. He builds ML models and automates model training and the deployment into production environments.

Thomas Lietzow is the Scrum Master of the squad Data Analytics Platform.

Christopher Masch is the Product Owner of the squad Data Analytics Platform with knowledge in data engineering, data science, and ML engineering.

Alexander Meinert is part of the Data Analytics Platform team and works as an ML engineer. Started with statistics, grew on data science projects, found passion for ML methods and architecture.

Dr. Lars Palzer is a data scientist and part of the Data Analytics Platform team. After helping to build the MLOps architecture components, he is now using them to build ML products.

Jan Schillemans is a ML engineer with a software engineering background. He focusses on applying software engineering best practices onto ML environments (MLOps).

Read More

Bongo Learn provides real-time feedback to improve learning outcomes with Amazon Transcribe

Real-time feedback helps drive learning. This is especially important for designing presentations, learning new languages, and strengthening other essential skills that are critical to succeed in today’s workplace. However, many students and lifelong learners lack access to effective face-to-face instruction to hone these skills. In addition, with the rapid adoption of remote learning, educators are seeking more effective ways to engage their students and provide feedback and guidance in online learning environments. Bongo is filling that gap using video-based engagement and personalized feedback.

Bongo is a video assessment solution that enables experiential learning and soft skill development at scale. Their Auto Analysis™ is an automated reporting feature that provides deeper insight into an individual’s performance and progress. Organizations around the world—both corporate and higher education institutions—use Bongo’s Auto Analysis™ to facilitate automated feedback for a variety of use cases, including individual presentations, objection handling, and customer interaction training. The Auto Analysis™ platform, which runs on AWS and uses Amazon Transcribe, allows learners to demonstrate what they can do on video and helps evaluators get an authentic representation of a learner’s competency across a range of skills.

When users complete a video assignment, Bongo uses Amazon Transcribe, a deep learning-powered automatic speech recognition (ASR), to convert speech into text. Bongo analyzes the transcripts to identify the use of keywords and filler words, and assess clarity and effectiveness of the individual’s delivery. Bongo then auto-generates personalized feedback reports based on these performance insights, which learners can utilize as they practice iteratively. Learners can then submit their recording for feedback from evaluators and peers. Learners have reported a strong preference for receiving private and detailed feedback prior to submitting their work for evaluation or peer review.

Why Bongo chose Amazon Transcribe

During the technical evaluation process, Bongo looked at several speech-to-text vendors and machine learning services. Bruce Fischer, CTO at Bongo, says, “When choosing a vendor, AWS’ breadth and depth of services enabled us to build a complete solution through a single vendor. That saved us valuable development and deployment time. In addition, Amazon Transcribe produces high-quality transcripts with timestamps that allow Bongo Auto Analysis™ to provide accurate feedback to learners and improve learning outcomes. We are excited with how the service has evolved and how its new capabilities enable us to innovate faster.”

Since launch, Bongo has added the custom vocabulary feature of Amazon Transcribe. For example, it can recognize business jargon that is common in sales presentations. Foreign language learning is another important use case for Bongo customers. The automatic language detection feature in Amazon Transcribe and overall language support (37 different languages for batch processing) allows Bongo to deliver Auto Analysis™ in several languages, such as French, Spanish, German, and Portuguese.

Recently, Bongo launched auto-captioning for their on-demand videos. Powered by Amazon Transcribe, captions help address the accessibility needs of Bongo users with learning disabilities and impairments.

Amazon Transcribe enables Bongo’s Auto Analysis™ to quickly and accurately transcribe learner videos and provide feedback on the video that helps a learner employ a ‘practice, reflect, improve’ loop. This enables learners to increase content comprehension, retention, and learning outcomes, and reduces instructor assessment time since they are viewing a better work product. Teachers can focus on providing insightful feedback without spending time on the metrics the Auto Analysis™ produces automatically.

– Josh Kamrath, Bongo’s CEO.

Recently, Dr. Lynda Randall and Dr. Jessica Jaynes from California State University, Fullerton, conducted a research study to analyze the effectiveness of Bongo in an actual classroom setting on student engagement and learning outcomes.[1] The study results showed how the use of Bongo helped increase student comprehension and retention of concepts.

Conclusion

The Bongo team is now looking at how to incorporate other AWS AI services, such as Amazon Comprehend to do further language processing and Amazon Rekognition for visual analysis of videos. Bongo and their AWS team will continue working together to create the best experience for learners and instructors alike. To learn more about Amazon Transcribe and test it yourself, visit the Amazon Transcribe console.

[1] Randall, L.E., & Jaynes, J. A comparison of three assessment types of student engagement and content knowledge in online instruction. Online Learning Journal. (Status: Accepted. Publication date TBD)


About Bongo

Bongo is an embedded solution that drives meaningful assessment, experiential learning, and skill development at scale through video-based engagement and personalized feedback. Organizations use our video workflows to create opportunities for practice, demonstration, analysis, and collaboration. When individuals show what they can do within a real-world learning environment, evaluators get an authentic representation of their competency.


About the Author

Roshni Madaiah is an Account Manager on the AWS EdTech team, where she helps Education Technology customers build cutting edge solutions to transform learning and enrich student experience. Prior to AWS, she worked with enterprises and commercial customers to drive business outcomes via technical solutions. Outside of work, she enjoys traveling, reading and cooking without recipes.

Read More

Prepare time series data with Amazon SageMaker Data Wrangler

Time series data is widely present in our lives. Stock prices, house prices, weather information, and sales data captured over time are just a few examples. As businesses increasingly look for new ways to gain meaningful insights from time-series data, the ability to visualize data and apply desired transformations are fundamental steps. However, time-series data possesses unique characteristics and nuances compared to other kinds of tabular data, and require special considerations. For example, standard tabular or cross-sectional data is collected at a specific point in time. In contrast, time series data is captured repeatedly over time, with each successive data point dependent on its past values.

Because most time series analyses rely on the information gathered across a contiguous set of observations, missing data and inherent sparseness can reduce the accuracy of forecasts and introduce bias. Additionally, most time series analysis approaches rely on equal spacing between data points, in other words, periodicity. Therefore, the ability to fix data spacing irregularities is a critical prerequisite. Finally, time series analysis often requires the creation of additional features that can help explain the inherent relationship between input data and future predictions. All these factors differentiate time series projects from traditional machine learning (ML) scenarios and demand a distinct approach to its analysis.

This post walks through how to use Amazon SageMaker Data Wrangler to apply time series transformations and prepare your dataset for time series use cases.

Use cases for Data Wrangler

Data Wrangler provides a no-code/low-code solution to time series analysis with features to clean, transform, and prepare data faster. It also enables data scientists to prepare time series data in adherence to their forecasting model’s input format requirements. The following are a few ways you can use these capabilities:

  • Descriptive analysis– Usually, step one of any data science project is understanding the data. When we plot time series data, we get a high-level overview of its patterns, such as trend, seasonality, cycles, and random variations. It helps us decide the correct forecasting methodology for accurately representing these patterns. Plotting can also help identify outliers, preventing unrealistic and inaccurate forecasts. Data Wrangler comes with a seasonality-trend decomposition visualization for representing components of a time series, and an outlier detection visualization to identify outliers.
  • Explanatory analysis– For multi-variate time series, the ability to explore, identify, and model the relationship between two or more time series is essential for obtaining meaningful forecasts. The Group by transform in Data Wrangler creates multiple time series by grouping data for specified cells. Additionally, Data Wrangler time series transforms, where applicable, allow specification of additional ID columns to group on, enabling complex time series analysis.
  • Data preparation and feature engineering– Time series data is rarely in the format expected by time series models. It often requires data preparation to convert raw data into time series-specific features. You may want to validate that time series data is regularly or equally spaced prior to analysis. For forecasting use cases, you may also want to incorporate additional time series characteristics, such as autocorrelation and statistical properties. With Data Wrangler, you can quickly create time series features such as lag columns for multiple lag periods, resample data to multiple time granularities, and automatically extract statistical properties of a time series, to name a few capabilities.

Solution overview

This post elaborates on how data scientists and analysts can use Data Wrangler to visualize and prepare time series data. We use the bitcoin cryptocurrency dataset from cryptodatadownload with bitcoin trading details to showcase these capabilities. We clean, validate, and transform the raw dataset with time series features and also generate bitcoin volume price forecasts using the transformed dataset as input.

The sample of bitcoin trading data is from January 1 – November 19, 2021, with 464,116 data points. The dataset attributes include a timestamp of the price record, the opening or first price at which the coin was exchanged for a particular day, the highest price at which the coin was exchanged on the day, the last price at which the coin was exchanged on the day, the volume exchanged in the cryptocurrency value on the day in BTC, and corresponding USD currency.

Prerequisites

Download the Bitstamp_BTCUSD_2021_minute.csv file from cryptodatadownload and upload it to Amazon Simple Storage Service (Amazon S3).

Import bitcoin dataset in Data Wrangler

To start the ingestion process to Data Wrangler, complete the following steps:

  1. On the SageMaker Studio console, on the File menu, choose New, then choose Data Wrangler Flow.
  2. Rename the flow as desired.
  3. For Import data, choose Amazon S3.
  4. Upload the Bitstamp_BTCUSD_2021_minute.csv file from your S3 bucket.

You can now preview your data set.

  1. In the Details pane, choose Advanced configuration and deselect Enable sampling.

This is a relatively small data set, so we don’t need sampling.

  1. Choose Import.

You have successfully created the flow diagram and are ready to add transformation steps.

Add transformations

To add data transformations, choose the plus sign next to Data types and choose Edit data types.

Ensure that Data Wrangler automatically inferred the correct data types for the data columns.

In our case, the inferred data types are correct. However, suppose one data type was incorrect. You can easily modify them through the UI, as shown in the following screenshot.

edit and review data types

Let’s kick off the analysis and start adding transformations.

Data cleaning

We first perform several data cleaning transformations.

Drop column

Let’s start by dropping the unix column, because we use the date column as the index.

  1. Choose Back to data flow.
  2. Choose the plus sign next to Data types and choose Add transform.
  3. Choose + Add step in the TRANSFORMS pane.
  4. Choose Manage columns.
  5. For Transform, choose Drop column.
  6. For Column to drop, choose unix.
  7. Choose Preview.
  8. Choose Add to save the step.

Handle missing

Missing data is a well-known problem in real-world datasets. Therefore, it’s a best practice to verify the presence of any missing or null values and handle them appropriately. Our dataset doesn’t contain missing values. But if there were, we would use the Handle missing time series transform to fix them. Commonly used strategies for handling missing data include dropping rows with missing values or filling the missing values with reasonable estimates. Because time series data relies on a sequence of data points across time, filling missing values is the preferred approach. The process of filling missing values is referred to as imputation. The Handle missing time series transform allows you to choose from multiple imputation strategies.

  1. Choose + Add step in the TRANSFORMS pane.
  2. Choose the Time Series transform.
  3. For Transform, Choose Handle missing.
  4. For Time series input type, choose Along column.
  5. For Method for imputing values, choose Forward fill.

The Forward fill method replaces the missing values with the non-missing values preceding the missing values.

handle missing time series transform

Backward fill, Constant Value, Most common value and Interpolate are other imputation strategies available in Data Wrangler. Interpolation techniques rely on neighboring values for filling missing values. Time series data often exhibits correlation between neighboring values, making interpolation an effective filling strategy. For additional details on the functions you can use for applying interpolation, refer to pandas.DataFrame.interpolate.

Validate timestamp

In time series analysis, the timestamp column acts as the index column, around which the analysis revolves. Therefore, it’s essential to make sure the timestamp column doesn’t contain invalid or incorrectly formatted time stamp values. Because we’re using the date column as the timestamp column and index, let’s confirm its values are correctly formatted.

  1. Choose + Add step in the TRANSFORMS pane.
  2. Choose the Time Series transform.
  3. For Transform, choose Validate timestamps.

The Validate timestamps transform allows you to check that the timestamp column in your dataset doesn’t have values with an incorrect timestamp or missing values.

  1. For Timestamp Column, choose date.
  2. For Policy dropdown, choose Indicate.

The Indicate policy option creates a Boolean column indicating if the value in the timestamp column is a valid date/time format. Other options for Policy include:

  • Error – Throws an error if the timestamp column is missing or invalid
  • Drop – Drops the row if the timestamp column is missing or invalid
  1. Choose Preview.

A new Boolean column named date_is_valid was created, with true values indicating correct format and non-null entries. Our dataset doesn’t contain invalid timestamp values in the date column. But if it did, you could use the new Boolean column to identify and fix those values.

Validate Timestamp time series transform

  1. Choose Add to save this step.

Time series visualization

After we clean and validate the dataset, we can better visualize the data to understand its different component.

Resample

Because we’re interested in daily predictions, let’s transform the frequency of data to daily.

The Resample transformation changes the frequency of the time series observations to a specified granularity, and comes with both upsampling and downsampling options. Applying upsampling increases the frequency of the observations (for example from daily to hourly), whereas downsampling decreases the frequency of the observations (for example from hourly to daily).

Because our dataset is at minute granularity, let’s use the downsampling option.

  1. Choose + Add step.
  2. Choose the Time Series transform.
  3. For Transform, choose Resample.
  4. For Timestamp, choose date.
  5. For Frequency unit, choose Calendar day.
  6. For Frequency quantity, enter 1.
  7. For Method to aggregate numeric values, choose mean.
  8. Choose Preview.

The frequency of our dataset has changed from per minute to daily.

  1. Choose Add to save this step.

Seasonal-Trend decomposition

After resampling, we can visualize the transformed series and its associated STL (Seasonal and Trend decomposition using LOESS) components using the Seasonal-Trend-decomposition visualization. This breaks down original time series into distinct trend, seasonality and residual components, giving us a good understanding of how each pattern behaves. We can also use the information when modelling forecasting problems.

Data Wrangler uses LOESS, a robust and versatile statistical method for modelling trend and seasonal components. It’s underlying implementation uses polynomial regression for estimating nonlinear relationships present in the time series components (seasonality, trend, and residual).

  1. Choose Back to data flow.
  2. Choose the plus sign next to the Steps on Data Flow.
  3. Choose Add analysis.
  4. In the Create analysis pane, for Analysis type, choose Time Series.
  5. For Visualization, choose Seasonal-Trend decomposition.
  6. For Analysis Name, enter a name.
  7. For Timestamp column, choose date.
  8. For Value column, choose Volume USD.
  9. Choose Preview.

The analysis allows us to visualize the input time series and decomposed seasonality, trend, and residual.

  1. Choose Save to save the analysis.

With the seasonal-trend decomposition visualization, we can generate four patterns, as shown in the preceding screenshot:

  • Original – The original time series re-sampled to daily granularity.
  • Trend – The polynomial trend with an overall negative trend pattern for the year 2021, indicating a decrease in Volume USD value.
  • Season – The multiplicative seasonality represented by the varying oscillation patterns. We see a decrease in seasonal variation, characterized by decreasing amplitude of oscillations.
  • Residual – The remaining residual or random noise. The residual series is the resulting series after trend and seasonal components have been removed. Looking closely, we observe spikes between January and March, and between April and June, suggesting room for modelling such particular events using historical data.

These visualizations provide valuable leads to data scientists and analysts into existing patterns and can help you choose a modelling strategy. However, it’s always a good practice to validate the output of STL decomposition with the information gathered through descriptive analysis and domain expertise.

To summarize, we observe a downward trend consistent with original series visualization, which increases our confidence in incorporating the information conveyed by trend visualization into downstream decision-making. In contrast, the seasonality visualization helps inform the presence of seasonality and the need for its removal by applying techniques such as differencing, it doesn’t provide the desired level of detailed insight into various seasonal patterns present, thereby requiring deeper analysis.

Feature engineering

After we understand the patterns present in our dataset, we can start to engineer new features aimed to increase the accuracy of the forecasting models.

Featurize datetime

Let’s start the feature engineering process with more straightforward date/time features. Date/time features are created from the timestamp column and provide an optimal avenue for data scientists to start the feature engineering process. We begin with the Featurize datetime time series transformation to add the month, day of the month, day of the year, week of the year, and quarter features to our dataset. Because we’re providing the date/time components as separate features, we enable ML algorithms to detect signals and patterns for improving prediction accuracy.

  1. Choose + Add step.
  2. Choose the Time Series transform.
  3. For Transform, choose Featurize datetime.
  4. For Input Column, choose date.
  5. For Output Column, enter date (this step is optional).
  6. For Output mode, choose Ordinal.
  7. For Output format, choose Columns.
  8. For date/time features to extract, select Month, Day, Week of year, Day of year, and Quarter.
  9. Choose Preview.

The dataset now contains new columns named date_month, date_day, date_week_of_year, date_day_of_year, and date_quarter. The information retrieved from these new features could help data scientists derive additional insights from the data and into the relationship between input features and output features.

featurize datetime time series transform

  1. Choose Add to save this step.

Encode categorical

Date/time features aren’t limited to integer values. You may also choose to consider certain extracted date/time features as categorical variables and represent them as one-hot encoded features, with each column containing binary values. The newly created date_quarter column contains values between 0-3, and can be one-hot encoded using four binary columns. Let’s create four new binary features, each representing the corresponding quarter of the year.

  1. Choose + Add step.
  2. Choose the Encode categorical transform.
  3. For Transform, choose One-hot encode.
  4. For Input column, choose date_quarter.
  5. For Output style, choose Columns.
  6. Choose Preview.
  7. Choose Add to add the step.

Lag feature

Next, let’s create lag features for the target column Volume USD. Lag features in time series analysis are values at prior timestamps that are considered helpful in inferring future values. They also help identify autocorrelation (also known as serial correlation) patterns in the residual series by quantifying the relationship of the observation with observations at previous time steps. Autocorrelation is similar to regular correlation but between the values in a series and its past values. It forms the basis for the autoregressive forecasting models in the ARIMA series.

With the Data Wrangler Lag feature transform, you can easily create lag features n periods apart. Additionally, we often want to create multiple lag features at different lags and let the model decide the most meaningful features. For such a scenario, the Lag features transform helps create multiple lag columns over a specified window size.

  1. Choose Back to data flow.
  2. Choose the plus sign next to the Steps on Data Flow.
  3. Choose + Add step.
  4. Choose Time Series transform.
  5. For Transform, choose Lag features.
  6. For Generate lag features for this column, choose Volume USD.
  7. For Timestamp Column, choose date.
  8. For Lag, enter 7.
  9. Because we’re interested in observing up to the previous seven lag values, let’s select Include the entire lag window.
  10. To create a new column for each lag value, select Flatten the output.
  11. Choose Preview.

Seven new columns are added, suffixed with the lag_number keyword for the target column Volume USD.

Lag feature time series transform

  1. Choose Add to save the step.

Rolling window features

We can also calculate meaningful statistical summaries across a range of values and include them as input features. Let’s extract common statistical time series features.

Data Wrangler implements automatic time series feature extraction capabilities using the open source tsfresh package. With the time series feature extraction transforms, you can automate the feature extraction process. This eliminates the time and effort otherwise spent manually implementing signal processing libraries. For this post, we extract features using the Rolling window features transform. This method computes statistical properties across a set of observations defined by the window size.

  1. Choose + Add step.
  2. Choose the Time Series transform.
  3. For Transform, choose Rolling window features.
  4. For Generate rolling window features for this column, choose Volume USD.
  5. For Timestamp Column, choose date.
  6. For Window size, enter 7.

Specifying a window size of 7 computes features by combining the value at the current timestamp and values for the previous seven timestamps.

  1. Select Flatten to create a new column for each computed feature.
  2. Choose your strategy as Minimal subset.

This strategy extracts eight features that are useful in downstream analyses. Other strategies include Efficient Subset, Custom subset, and All features. For full list of features available for extraction, refer to Overview on extracted features.

  1. Choose Preview.

We can see eight new columns with specified window size of 7 in their name, appended to our dataset.

  1. Choose Add to save the step.

Export the dataset

We have transformed the time series dataset and are ready to use the transformed dataset as input for a forecasting algorithm. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you can choose Export step to automatically generate a Jupyter notebook with Amazon SageMaker Processing code for processing and exporting the transformed dataset to a S3 bucket. However, because our dataset contains just over 300 records, let’s take advantage of the Export data option in the Add Transform view to export the transformed dataset directly to Amazon S3 from Data Wrangler.

  1. Choose Export data.

  1. For S3 location, choose Browser and choose your S3 bucket.
  2. Choose Export data.

Now that we have successfully transformed the bitcoin dataset, we can use Amazon Forecast to generate bitcoin predictions.

Clean up

If you’re done with this use case, clean up the resources you created to avoid incurring additional charges. For Data Wrangler you can shutdown the underlying instance when finished. Refer to Shut Down Data Wrangler documentation for details. Alternatively, you can continue to Part 2 of this series to use this dataset for forecasting.

Summary

This post demonstrated how to utilize Data Wrangler to simplify and accelerate time series analysis using its built-in time series capabilities. We explored how data scientists can easily and interactively clean, format, validate, and transform time series data into the desired format, for meaningful analysis. We also explored how you can enrich your time series analysis by adding a comprehensive set of statistical features using Data Wrangler. To learn more about time series transformations in Data Wrangler, see Transform Data.


About the Author

Roop Bains is a Solutions Architect at AWS focusing on AI/ML. He is passionate about helping customers innovate and achieve their business objectives using Artificial Intelligence and Machine Learning. In his spare time, Roop enjoys reading and hiking.

Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.

Read More