Amazon SageMaker Data Wrangler for dimensionality reduction

In the world of machine learning (ML), the quality of the dataset is of significant importance to model predictability. Although more data is usually better, large datasets with a high number of features can sometimes lead to non-optimal model performance due to the curse of dimensionality. Analysts can spend a significant amount of time transforming data to improve model performance. Additionally, large datasets are costlier and take longer to train. If time is a constraint, model performance may be limited as a result.

Dimension reduction techniques can help reduce the size of your data while maintaining its information, resulting in quicker training times, lower cost, and potentially higher-performing models.

Amazon SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for ML. Data Wrangler simplifies the process of data preparation and feature engineering like data selection, cleansing, exploration, and visualization from a single visual interface. Data Wrangler has more than 300 preconfigured data transformations that can effectively be used in transforming the data. In addition, you can write custom transformation in PySpark, SQL, and pandas.

Today, we’re excited to add a new transformation technique that is commonly used in the ML world to the list of Data Wrangler pre-built transformations: dimensionality reduction using Principal Component Analysis. With this new feature, you can reduce the high number of dimensions in your datasets to one that can be used with popular ML algorithms with just a few clicks on the Data Wrangler console. This can have significant improvements in your model performance with minimal effort.

In this post, we provide an overview of this new feature and show how to use it in your data transformation. We will show how to use dimensionality reduction on large sparse datasets.

Overview of Principal Component Analysis

Principal Component Analysis (PCA) is a method by which the dimensionality of features can be transformed in a dataset with many numerical features into one with fewer features while still retaining as much information as possible from the original dataset. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. Several features in a dataset often have less impact on the final result and may increase the processing time of ML models. It can become difficult for humans to understand and solve such high-dimensional problems. Dimensionality reduction techniques like PCA can help solve this for us.

Solution overview

In this post, we show how you can use the dimensionality reduction transform in Data Wrangler on the MNIST dataset to reduce the number of features by 85% and still achieve similar or better accuracy than the original dataset. The MNIST (Modified National Institute of Standards and Technology) dataset, which is the de facto “hello world” dataset in computer vision, is a dataset of handwritten images. Each row of the dataset corresponds to a single image that is 28 x 28 pixels, for a total of 784 pixels. Each pixel is represented by a single feature in the dataset with a pixel value ranging from 0–255.

To learn more about the new dimensionality reduction feature, refer to Reduce Dimensionality within a Dataset.

Prerequisites

This post assumes that you have an Amazon SageMaker Studio domain set up. For details on how to set it up, refer to Onboard to Amazon SageMaker Domain Using Quick setup.

To get started with the new capabilities of Data Wrangler, open Studio after upgrading to the latest release and choose the File menu, New, and Flow, or choose New data flow from the Studio launcher.

Perform a Quick Model analysis

The dataset we use in this post contains 60,000 training examples and labels. Each row consists of 785 values: the first value is the label (a number from 0–9) and the remaining 784 values are the pixel values (a number from 0–255). First, we perform a Quick Model analysis on the raw data to get performance metrics and compare them with the model metrics post-PCA transformations for evaluation. Complete the following steps:

Download the MNIST dataset training dataset.
Extract the data from the .zip file and upload into an Amazon Simple Storage Service (Amazon S3) bucket.
In Studio, choose New and Data Wrangler Flow to create a new Data Wrangler flow.
Choose Import data to load the data from Amazon S3.
Choose Amazon S3 as your data source.
Select the dataset uploaded to your S3 bucket.
Leave the default settings and choose Import.

After the data is imported, Data Wrangler automatically validates the datasets and detects the data types for all the columns based on its sampling. In the MNIST dataset, because all the columns are long, we leave this step as is and go back to the data flow.

Choose Data flow at the top of the Data types page to return to the main data flow.

The flow editor now shows two blocks showcasing that the data was imported from a source and the data types recognized. You can also edit the data types if needed.

After confirming that the data quality is acceptable, we go back to the data flow and use Data Wrangler’s Data Quality and Insights Report. This report performs an analysis on the imported dataset and provides information about missing values, outliers, target leakage, imbalanced data, and a Quick Model analysis. Refer to Get Insights On Data and Data Quality for more information.

For this analysis, we only focus on the Quick Model part of the Data Quality report.

Choose the plus sign next to Data types, then choose Add analysis.
For Analysis type¸ choose Data Quality And Insights Report.
For Target column, choose label.
For Problem type, select Classification (this step is optional).
Choose Create.

For this post, we use the Data Quality and Insights Report to show how the model performance is mostly preserved using PCA. We recommend that you use a deep learning-based approach for better performance.

The following screenshot shows a summary of the dataset from the report. Fortunately, we don’t have any missing values. The time taken for the report to generate depends on the size of the dataset, number of features, and the instance size used by Data Wrangler.

The following screenshot shows how the model performed on the raw dataset. Here we notice that the model has an accuracy of 93.7% utilizing 784 features.

Use the Data Wrangler dimensionality reduction transform

Now let’s use the Data Wrangler dimensionality reduction transform to reduce the number of features in this dataset.

On the data flow page, choose the plus sign next to Data types, then choose Add transform.
Choose Add step.
Choose Dimensionality Reduction.

If you don’t see the dimensionality reduction option listed, you need to update Data Wrangler. For instructions, refer to Update Data Wrangler.

Configure the key variables that go into PCA:
1. For Transform, choose the dimensionality reduction technique that you want to use. For this post, we choose Principal component analysis.
2. For Input Columns, choose the columns that you want to include in the PCA analysis. For this example, we choose all the features except the target column label (you can also use the Select all feature to select all features and deselect features not needed). These columns need to be of numeric data type.
3. For Number of principal components, specify the number of target dimensions.
4. For Variance threshold percentage, specify the percentage of variation in the data that you want to explain by the principal components. The default value is 95; for this post, we use 80.
5. Select Center to center the data with the mean before scaling.
6. Select Scale to scale the data with the unit standard deviation.
  PCA gives more emphasis to variables with high variance. Therefore, if the dimensions are not scaled, we will get inconsistent results. For example, the value for one variable might lie in the range of 50–100, and another variable is 5–10. In this case, PCA will give more weight to the first variable. Such issues can be resolved by scaling the dataset before applying PCA.
7. For Output Format, specify if you want to output components into separate columns or vectors. For this post, we choose Columns.
8. For Output column, enter a prefix for column names generated by PCA. For this post, we enter PCA80_.
Choose Preview to preview the data, then choose Update.

After applying PCA, the number of columns will be reduced from 784 to 115—this is an 85% reduction in the number of features.

We can now use the transformed dataset and generate another Data Quality and Insights Report as shown in the following screenshot to observe the model performance.

We can see in the second analysis that the model performance has improved and accuracy increased to 91.8% compared to the first Quick Model report. PCA reduced the number of features in our dataset by 85% while maintaining the model accuracy at similar levels.

Based on the Quick Model analysis from the report, model performance is at 91.8%. With PCA, we reduced the columns by 85% while still maintaining the model accuracy at similar levels. For better results, you can try deep learning models, which might offer even better performance.

We found the following comparison in training time using Amazon SageMaker Autopilot with and without PCA dimensionality reduction:

With PCA dimensional reduction – 25 minutes
Without PCA dimensional reduction – 45 minutes

Operationalizing PCA

As data changes over time, it’s often desirable to retrain our parameters to new unseen data. Data Wrangler offers this capability through the use of refitting parameters. For more information on refitting trained parameters, refer to Refit trained parameters on large datasets using Amazon SageMaker Data Wrangler.

Previously, we applied PCA to a sample of the MNIST dataset containing 50,000 sample rows. Consequently, our flow file contains a model that has been trained on this sample and used for all created jobs unless we specify that we want to relearn those parameters.

To refit your model parameters on the MNIST training dataset, complete the following steps:

Create a destination for our flow file in Amazon S3 so we can create a Data Wrangler processing job.
Create a job and select Refit to learn new training parameters.

The Trained parameters section shows that there are 784 parameters. That is one parameter for each column because we excluded the label column in our PCA reduction.

Note that if we don’t select Refit in this step, the trained parameters learned during interactive mode will be used.

Create the job.
Choose the processing job link to monitor the job and find the location of the resulting flow file on Amazon S3.

This flow file contains the model learned on the entire MNIST train dataset.

Load this file into Data Wrangler.

Clean up

To clean up the environment so you don’t incur additional charges, delete the datasets and artifacts in Amazon S3. Additionally, delete the data flow file in Studio and shut down the instance it runs on. Refer to Shut Down Data Wrangler for more information.

Conclusion

Dimensionality reduction is a great technique to remove the unwanted variables from a model. It can be used to reduce the model complexity and noise in the data, thereby mitigating the common problem of overfitting in machine learning and deep learning models. In this blog we demonstrated that by reducing the number of features, we were still able to accomplish similar or higher accuracy for our models.

For more information about using PCA, refer to Principal Component Analysis (PCA) Algorithm. To learn more about the dimensionality reduction transform, refer to Reduce Dimensionality within a Dataset.

About the authors

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Vishaal Kapoor is a Senior Applied Scientist with AWS AI. He is passionate about helping customers understand their data in Data Wrangler. In his spare time, he mountain bikes, snowboards, and spends time with his family.

Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11+ years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.

Vedere AI