Manage AutoML workflows with AWS Step Functions and AutoGluon on Amazon SageMaker

Running machine learning (ML) experiments in the cloud can span across many services and components. The ability to structure, automate, and track ML experiments is essential to enable rapid development of ML models. With the latest advancements in the field of automated machine learning (AutoML), namely the area of ML dedicated to the automation of ML processes, you can build accurate decision-making models without needing deep ML knowledge. In this post, we loo at AutoGluon, an open-source AutoML framework that allows you to build accurate ML models with just a few lines of Python.

AWS offers a wide range of services to manage and run ML workflows, allowing you to select a solution based on your skills and application. For example, if you already use AWS Step Functions to orchestrate the components of distributed applications, you can use the same service to build and automate your ML workflows. Other MLOps tools offered by AWS include Amazon SageMaker Pipelines, which enables you to build ML models in Amazon SageMaker Studio with MLOps capabilities (such as CI/CD compatibility, model monitoring, and model approvals). Open-source tools, such as Apache Airflow—available on AWS through Amazon Managed Workflows for Apache Airflow—and KubeFlow, as well as hybrid solutions, are also supported. For example, you can manage data ingestion and processing with Step Functions while training and deploying your ML models with SageMaker Pipelines.

In this post, we show how even developers without ML expertise can easily build and maintain state-of-the-art ML models using AutoGluon on Amazon SageMaker and Step Functions to orchestrate workflow components.

After an overview of the AutoGluon algorithm, we present the workflow definitions along with examples and a code tutorial that you can apply to your own data.

AutoGluon

AutoGluon is an open-source AutoML framework that accelerates the adoption of ML by training accurate ML models with just a few lines of Python code. Although this post focuses on tabular data, AutoGluon also allows you to train state-of-the-art models for image classification, object detection, and text classification. AutoGluon tabular creates and combines different models to find the optimal solution.

The AutoGluon team at AWS released a paper that presents the principles that structure the library:

Simplicity – You can create classification and regression models directly from raw data without having to analyze the data or perform feature engineering
Robustness – The overall training process should succeed even if some of the individual models fail
Predictable timing – You can get optimal results within the time that you want to invest for training
Fault tolerance – You can stop the training and resume it at any time, which optimizes the costs if the process runs on spot images in the cloud

For more details about the algorithm, refer to the paper released by the AutoGluon team at AWS.

After you install the AutoGluon package and its dependencies, training a model is as easy as writing three lines of code:

from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset('s3://my-bucket/datasets/my-csv.csv')
predictor = TabularPredictor(label="my-label", path="my-output-folder").fit(train_data)

The AutoGluon team proved the strength of the framework by reaching the top 10 leaderboard in multiple Kaggle competitions.

Solution overview

We use Step Functions to implement an ML workflow that covers training, evaluation, and deployment. The pipeline design enables fast and configurable experiments by modifying the input parameters that you feed into the pipeline at runtime.

You can configure the pipeline to implement different workflows, such as the following:

Train a new ML model and store it in the SageMaker model registry, if no deployment is needed at this point
Deploy a pre-trained ML model, either for online (SageMaker endpoint) or offline (SageMaker batch transform) inference
Run a complete pipeline to train, evaluate, and deploy an ML model from scratch

The solutions consist of a general state machine (see the following diagram) that orchestrates the set of actions to be run based on a set of input parameters.

The steps of the state machine are as follows:

The first step IsTraining decides whether we’re using a pre-trained model or training a model from scratch. If using a pre-trained model, the state machine skips to Step 7.
When a new ML model is required, TrainSteps triggers a second state machine that performs all the necessary actions and returns the result to the current state machine. We go into more detail of the training state machine in the next section.
When training is finished, PassModelName stores the training job name in a specified location of the state machine context to be reused in the following states.
If an evaluation phase is selected, IsEvaluation redirects the state machine towards the evaluation branch. Otherwise, it skips to Step 7.
The evaluation phase is then implemented using an AWS Lambda function invoked by the ModelValidation step. The Lambda function retrieves model performances on a test set and compares it with a user-configurable threshold specified in the input parameters. The following code is an example of evaluation results:
```
"Payload":{
   "IsValid":true,
   "Scores":{
      "accuracy":0.9187,
      "balanced_accuracy":0.7272,
      "mcc":0.5403,
      "roc_auc":0.9489,
      "f1":0.5714,
      "precision":0.706,
      "recall":0.4799
   }
}
```
If the model evaluation at EvaluationResults is successful, the state machine continues with eventual deployment steps. If the model is performing below a user-define criteria, the state machine stops and deployment is skipped.
If deployment is selected, IsDeploy starts a third state machine through DeploySteps, which we describe later in this post. If deployment is not needed, the state machine stops here.

A set of input parameter samples is available on the GitHub repo.

Training state machine

The state machine for training a new ML model using AutoGluon is comprised of two steps, as illustrated in the following diagram. The first step is a SageMaker training job that creates the model. The second saves the entries in the SageMaker model registry.

You can run these steps either automatically as part of the main state machine, or as a standalone process.

Deployment state machine

Let’s now look at the state machine dedicated to the deployment phase (see the following diagram). As mentioned earlier, the architecture supports both online and offline deployment. The former consists of deploying a SageMaker endpoint, whereas the latter runs a SageMaker batch transform Job.

The implementation steps are as follows:

ChoiceDeploymentMode looks into the input parameters to define which deployment mode is needed and directs the state machine towards the corresponding branch.
If an endpoint is chosen, the EndpointConfig step defines its configuration, while CreateEndpoint starts the process of allocating the required computing resources. This allocation can take several minutes, so the state machine pauses at WaitForEndpoint and uses a Lambda function to poll the endpoint status.
While the endpoint is being configured, ChoiceEndpointStatus returns to the WaitForEndpoint state, otherwise it continues to either DeploymentFailed or DeploymentSucceeded.
If offline deployment is selected, the state machine runs a SageMaker batch transform job, after which the state machine stops.

Conclusion

This post presents an easy-to-use pipeline to orchestrate AutoML workflows and enable fast experiments in the cloud, allowing for accurate ML solutions without requiring advanced ML knowledge.

We provide a general pipeline as well as two modular ones that allow you to perform training and deployment separately if needed. Moreover, the solution is fully integrated with SageMaker, benefitting from its features and computational resources.

Get started now with this code tutorial to deploy the resources presented in this post into your AWS account and run your first AutoML experiments.

About the Authors

Federico Piccinini is a Deep Learning Architect for the Amazon Machine Learning Solutions Lab. He is passionate about machine learning, explainable AI, and MLOps. He focuses on designing ML pipelines for AWS customers. Outside of work, he enjoys sports and pizza.

Paolo Irrera is a Data Scientist at the Amazon Machine Learning Solutions Lab, where he helps customers address business problems with ML and cloud capabilities. He holds a PhD in Computer Vision from Telecom ParisTech, Paris.

Vedere AI

Manage AutoML workflows with AWS Step Functions and AutoGluon on Amazon SageMaker

AutoGluon

Solution overview

Training state machine

Deployment state machine

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.