How to Build an Interactive Chat-Generation Model using DialoGPT and PyTorch

The focus on interactive chat-generation (or conversational response-generation) models has greatly increased in the past several months. Conversational response-generation models such as ChatGPT and Google Bard have taken the AI world by storm. The purpose of interactive chat generation is to answer various questions posed by humans, and these AI based models use natural language processing (NLP) to generate conversations almost indistinguishable from those generated by humans.

This article showcases a code sample on how to create interactive chats based on a pre-trained DialoGPT model from Hugging Face with the addition of the Intel® Extension for PyTorch to perform dynamic quantization on the model.

Get Started

Why DialoGPT?

DialoGPT (Dialogue Generative Pre-trained Transformer) is a large-scale, pre-trained dialogue-response-generation model trained on 147M conversation-like exchanges pulled out from Reddit comment chains and discussion threads. DialoGPT was proposed by Microsoft in 2019. The main goal was to create open-domain chatbots capable of producing natural responses to a variety of conversational topics. The conversational response-generation systems that leverage DialoGPT generate more applicable, resourceful, diverse, and context-specific replies.

DialoGPT Architecture

DialoGPT architecture is based on the GPT-2 model. It is formulated as an autoregressive language model and uses a multi-layer transformer as the model architecture. GPT-2 was proposed by OpenAI. GPT-2 models are trained on general text data whereas DialoGPT is trained on Reddit discussion threads.

Let’s look at the GPT-2 architecture. There are two types of blocks in general transformer architecture:

  • Encoder – contains self-attention layer and feed-forward neural network
  • Decoder – similar to encoder, but the self-attention layer is masked

The self-attention layer allows a position to peak at tokens to the right of the current word (the successive words in text), whereas masked self-attention layer prevents that from happening.

self-attention layer vs masked self-attention layer

GPT-2 is built using transformer decoder blocks. This means that the following layers are used in the architecture:

  1. Embedding Layer – responsible for converting input text into embeddings (each word is converted to a fixed-length vector representation)
  2. Transformer Decoder – includes multiple decoder blocks with masked self-attention and feed forward neural network layers
  3. Output Layer – responsible for converting embeddings obtained from the decoder into words

GPT-2 architecture (and DialoGPT architecture) is shown below.

GPT-2 architecture

As the model is based on transformers architecture, it has the issue of repetition and copying the inputs. To avoid repetition, we can use Top-K sampling and Top-p sampling.

  • Top-K sampling – filters the K most likely next words and redistributes the probability mass among only those K next words.
  • Top-p sampling – rather than selecting only the most likely K words, selects the smallest possible set of words whose cumulative probability exceeds the probability p.

The probability mass is then redistributed among the words in the set. As a result, the size of the set of words can be dynamically increased and decreased based on the probability distribution of the next word.

Quantization using Intel® Extension for PyTorch

What is Quantization?

Quantization is a systematic reduction of the precision of all or several layers within the model. This means a higher-precision type, such as the single-precision floating-point (FP32) mostly used in deep learning, is converted into a lower-precision type such as FP16 (16 bits) or INT8 (8 bits).

This helps in achieving,

  • lower memory bandwidth
  • lower storage
  • higher performance with minimum-to-zero accuracy loss

Quantization is especially important with large models such as those based on the Transformer architecture like BERT or GPT.

There are two types of quantization:

  • Static – Static quantization quantizes the weights and activations of the model. This quantization is used when both memory bandwidth and compute savings are important.
  • Dynamic – In dynamic quantization, the weights are quantized ahead of time, but the activations are dynamically quantized during inference.

Intel Extension for PyTorch: The Intel Extension extends PyTorch with up-to-date features and optimizations for an extra performance boost on Intel® hardware. Learn how to install it standalone or get it a part of the Intel® AI Analytics Toolkit.

The extension can be loaded as a Python* module or linked as a C++ library. Python users can enable it dynamically by importing intel_extension_for_pytorch.

  • This CPU tutorial gives detailed information about Intel Extension for PyTorch for Intel CPUs. Source code is available at the master branch.
  • This GPU tutorial gives detailed information about Intel Extension for PyTorch for Intel GPUs. Source code is available at the xpu-master branch.

How to perform dynamic quantization using Intel Extension for PyTorch?

Here are the steps to quantize the existing FP32 model to INT8 model using dynamic quantization:

  1. Prepare quantization configuration – We can use default dynamic quantization configuration with ipex.quantization.default_dynamic_qconfig.
  2. Prepare the FP32 model by using the** ipex.quantization.prepare **method (provide the input parameters such as FP32 model to quantize, the prepared configuration, example inputs and information if the quantization should be in place).
  3. Convert the model from FP32 to INT8 – Use ipex.quantization.convert method for conversion. The input model will be the model prepared in step 2.

We also encourage you to check out the Intel® Neural Compressor tool that automates popular model-compression technologies such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks.

Code Sample

The following steps are implemented in the code sample:

  1. Load model and tokenizer: Transformers library (check out Intel® Extension for Transformers) and Auto Classes available in the Hugging Face Main Classes are used in this step. These allow us to automatically find the relevant model by the given name. It also allows to easily change the model without major changes in the code on the developer’s side as shown below:
    tokenizer = AutoTokenizer.from_pretrained(model)
    model = AutoModelForCausalLM.from_pretrained(model)

    The model parameter is specified as an input for the tokenizer, and model initialization is just the path to the pre-trained DialoGPT model. In this sample, we are using ‘microsoft/DialoGPT-large.’ If you have limited resources, you can use ‘microsoft/DialoGPT-medium’ or ‘microsoft/DialoGPT-small’ models and receive comparable results.

  2. Perform dynamic quantization of the model:
    1. Create the configuration using the default dynamic quantization configuration from Intel Extension for PyTorch library.
    2. Prepare the model.
    3. Convert the model from FP32 to INT8.
      The steps are explained in detail in the above section.
  3. Response generation: The first step in response generation is to encode the input sentence as shown in the code below:
    new_input_ids = tokenizer.encode(input(">> You:") + tokenizer.eos_token, return_tensors='pt')

    In this sample, we want our model to save history, so we are adding input sentences in the form of tokens to the chat history:

    bot_input_ids =[chat_history_ids, new_input_ids], dim=-1) if chat_round > 0 else new_input_ids

    The text generation can be done by the model.generate function, where we can specify all important parameters like saved chat history, length of the response in tokens, and usage of both Top-K and Top-p sampling.

    chat_history_ids = model.generate(bot_input_ids, do_sample=True, max_length=2000, top_k=50, top_p=0.95, pad_token_id=tokenizer.eos_token_id) 

    The last step is to decode and print the response:

  4. Preparation for interactive conversation: After response generation, the last step is to add interaction. This can be done by using a simple for loop. Based on the initialized tokenizer, model, and empty chat history, responses are generated for a number of rounds:
    for chat_round in range(n):
    chat_history_ids = generate_response(

    An example of interactive chat generation will look like the one shown in the picture below.

An example of interactive chat generation

What’s Next?

Get started with interactive chat-generation models using Intel Extension for PyTorch and DialoGPT. Download and try the Intel AI Analytics Toolkit and Intel Extension for PyTorch for yourself to build various end-to-end AI applications.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

For more details about the new 4th Gen Intel® Xeon® Scalable processors, visit Intel’s AI Solution Platform portal where you can learn how Intel is empowering developers to run end-to-end AI pipelines on these powerful CPUs.

Useful resources

Scaling up learning across many different robot types

We are launching a new set of resources for general-purpose robotics learning across different robot types, or embodiments. Together with partners from 34 academic labs we have pooled data from 22 different robot types to create the Open X-Embodiment dataset. We also release RT-1-X, a robotics transformer (RT) model derived from RT-1 and trained on our dataset, that shows skills transfer across many robot embodiments.

Today, we are excited to announce Code Llama foundation models, developed by Meta, are available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. Code Llama is a state-of-the-art large language model (LLM) capable of generating code and natural language about code from both code and natural language prompts. Code Llama is free for research and commercial use. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Code Llama model via SageMaker JumpStart.

What is Code Llama

Code Llama is a model released by Meta that is built on top of Llama 2 and is a state-of-the-art model designed to improve productivity for programming tasks for developers by helping them create high quality, well-documented code. The models show state-of-the-art performance in Python, C++, Java, PHP, C#, TypeScript, and Bash, and have the potential to save developers’ time and make software workflows more efficient. It comes in three variants, engineered to cover a wide variety of applications: the foundational model (Code Llama), a Python specialized model (Code Llama-Python), and an instruction-following model for understanding natural language instructions (Code Llama-Instruct). All Code Llama variants come in three sizes: 7B, 13B, and 34B parameters. The 7B and 13B base and instruct variants support infilling based on surrounding content, making them ideal for code assistant applications.

The models were designed using Llama 2 as the base and then trained on 500 billion tokens of code data, with the Python specialized version trained on an incremental 100 billion tokens. The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.

The model is made available under the same community license as Llama 2.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment and customize models using SageMaker for model training and deployment.

You can now discover and deploy Code Llama models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Code Llama models are discoverable and can be deployed in in US East (N. Virginia), US West (Oregon) and Europe (Ireland) regions.

Customers must accept the EULA to deploy model visa SageMaker SDK.

Discover models

You can access Code Llama foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

On the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find Code Llama models in the Foundation Models: Text Generation carousel.

You can also find other model variants by choosing Explore all Text Generation Models or searching for Code Llama.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find two buttons, Deploy and Open Notebook, which will help you use the model.


When you choose Deploy and acknowledge the terms, deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The example notebook that provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id="meta-textgeneration-llama-codellama-7b")
predictor = model.deploy()

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
   "inputs": "<s>[INST] How do I deploy a model on Amazon SageMaker? [/INST]",
   "parameters": {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}
predictor.predict(payload, custom_attributes="accept_eula=true")

Note that by default, accept_eula is set to false. You need to set accept_eula=true to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.

Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by = and pairs are separated by ;. If the user passes the same key more than once, the last value is kept and passed to the script handler (in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true is passed to the server, then accept_eula=true is kept and passed to the script handler.

Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

The following table lists all the Code Llama models available in SageMaker JumpStart along with the model IDs, default instance types, and the maximum supported tokens (sum of the number of input tokens and number of generated tokens for all concurrent requests) supported for each of these models.

Model Name Model ID Default Instance Type Max Supported Tokens
CodeLlama-7b meta-textgeneration-llama-codellama-7b ml.g5.2xlarge 10000
CodeLlama-7b-Instruct meta-textgeneration-llama-codellama-7b-instruct ml.g5.2xlarge 10000
CodeLlama-7b-Python meta-textgeneration-llama-codellama-7b-python ml.g5.2xlarge 10000
CodeLlama-13b meta-textgeneration-llama-codellama-13b ml.g5.12xlarge 32000
CodeLlama-13b-Instruct meta-textgeneration-llama-codellama-13b-instruct ml.g5.12xlarge 32000
CodeLlama-13b-Python meta-textgeneration-llama-codellama-13b-python ml.g5.12xlarge 32000
CodeLlama-34b meta-textgeneration-llama-codellama-34b ml.g5.48xlarge 48000
CodeLlama-34b-Instruct meta-textgeneration-llama-codellama-34b-instruct ml.g5.48xlarge 48000
CodeLlama-34b-Python meta-textgeneration-llama-codellama-34b-python ml.g5.48xlarge 48000

While the Code Llama models were trained on a context length of 16,000 tokens, the models have reported good performance on even larger context windows. The maximum supported tokens column in the preceding table is the upper limit on the supported context window on the default instance type. Since the Code Llama 7B model can only support 10,000 tokens on an ml.g5.2xlarge instance, we recommend deploying a 13B or 34B model version if larger contexts are required for your application.

By default, all models work for code generation tasks. The base and instruct models both respond to infilling tasks, though the base model had better quality output for the majority of sample queries. Finally, only instruct models work on instruct tasks. The following table illustrates which models had good performance (Good) and moderate performance (Moderate) on example queries in the demo notebooks.

. Code Generation Code Infilling Code instructions
CodeLlama-7b Good Good N/A
CodeLlama-7b-Instruct Good Moderate Good
CodeLlama-7b-Python Good N/A N/A
CodeLlama-13b Good Good N/A
CodeLlama-13b-Instruct Good Moderate Good
CodeLlama-13b-Python Good N/A N/A
CodeLlama-34b Good N/A N/A
CodeLlama-34b-Instruct Good N/A Good
CodeLlama-34b-Python Good N/A N/A

Code generation

The following examples were run on the CodeLlama-34b-Instruct model with payload parameters "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}:

import socket

def ping_exponential_backoff(host: str):

    Ping a host using exponential backoff.
    for i in range(1, 10):
            return True
        except socket.gaierror:
            time.sleep(2 ** i)
    return False

import argparse

def main(string: str):

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("string", help="String to be printed")
    args = parser.parse_args()

Code infilling

Code infilling involves returning generated code given surrounding context. This differs from the code generation task because, in addition to a prefix code segment, the model is also provided with a code segment suffix. Special tokens were used during fine-tuning to mark the beginning of the prefix (<PRE>), the beginning of the suffix (<SUF>), and the beginning of the middle (<MID>). Input sequences to the model should be in one of the following formats:

  • prefix-suffix-middle <PRE> {prefix} <SUF>{suffix} <MID>
  • suffix-prefix-middle<PRE> <SUF>{suffix} <MID> {prefix}

The following examples use the prefix-suffix-middle format on the CodeLlama-7b model with payload parameters {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}:

<PRE> def remove_non_ascii(s: str) -> str:
    """ <SUF>
    return result

    Remove non-ASCII characters from a string.

    :param s: The string to remove non-ASCII characters from.
    :return: The string with non-ASCII characters removed.
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c

# Installation instructions:
This downloads the LLaMA inference code and installs the repository as a local pip package.

    git clone
    cd LLaMA-inference
    pip install -e .

class InterfaceManagerFactory(AbstractManagerFactory):
    def __init__(<FILL>
def main():
    factory = InterfaceManagerFactory(
    managers = []
    for i in range(10):

self, start=None):
        self.start = start

    def build(self, id):
        return InterfaceManager(id, self.start)

/-- A quasi-prefunctoid is 1-connected iff all its etalisations are 1-connected. -/
theorem connected_iff_etalisation [C D : precategoroid] (P : quasi_prefunctoid C D) :
  π₁ P = 0 ↔ <FILL> = 0 :=
  { intros h f,
    rw pi_1_etalisation at h,
    simp [h],
  { intro h,
    have := @quasi_adjoint C D P,
    simp [←pi_1_etalisation, this, h],

∀ f : C ⟶ D, π₁ (P.etalise f)

Code instructions

Meta also provided an instruction-tuned variant of Code Llama. Example queries in this section can only be applied to these instruction-tuned Code Llama models, which are the models with a model ID instruct suffix. The Code Llama format for instructions is the same as the Llama-2-chat prompt format, which we detail in Llama 2 foundation models are now available in SageMaker JumpStart

A simple user prompt may look like the following:

<s>[INST] {user_prompt} [/INST]

You may also add a system prompt with the following syntax:

<s>[INST] <<SYS>>

{user_prompt} [/INST]

Finally, you can have a conversational interaction with the model by including all previous user prompts and assistant responses in the input:

<s>[INST] <<SYS>>

{user_prompt_1} [/INST] {assistant_response_1} </s><s>[INST] {user_prompt_1} [/INST]

These examples were run on the CodeLlama-13b-Instruct model with payload parameters “parameters”: {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}:

<s>[INST] In Bash, how do I list all text files in the current directory (excluding subdirectories) that have been modified in the last month? [/INST] 

 You can use the `find` command in Bash to list all text files in the current directory (excluding subdirectories) that have been modified in the last month. Here's an example command:
find . -type f -name "*.txt" -mtime -30
Here's a breakdown of the command:

* `find .`: This tells `find` to search the current directory and all its subdirectories.
* `-type f`: This tells `find` to only search for files (not directories).
* `-name "*.txt"`: This tells `find` to only search for files with the `.txt` extension.
* `-mtime -30`: This tells `find` to only search for files that have been modified in the last 30 days (2,592,000 seconds).

The output of the command will be a list of all text files in the current directory (excluding subdirectories) that have been modified in the last month.

<s>[INST] What is the difference between inorder and preorder traversal? Give an example in Python. [/INST] 

 Inorder and preorder traversal are two different ways of traversing a binary tree.

Inorder traversal is a way of visiting the nodes of a binary tree in a specific order. In an inorder traversal, the left subtree is visited first, then the current node, and finally the right subtree is visited. This means that the nodes are visited in a left-to-right, top-to-bottom order.

Preorder traversal, on the other hand, is a way of visiting the nodes of a binary tree in a different order. In a preorder traversal, the current node is visited first, followed by the left subtree, and finally the right subtree. This means that the nodes are visited in a top-down, left-to-right order.

Here is an example of an inorder traversal of a binary tree in Python:
def inorder_traversal(root):
    if root is None:
        return []
    return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)
And here is an example of a preorder traversal of a binary tree in Python:
def preorder_traversal(root):
    if root is None:
        return []
    return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)
In this example, the `inorder_traversal` function returns a list of the values in the tree in the order of the keys, from smallest to largest. The `preorder_traversal` function returns a list of the values in the tree in the order of the keys, but with the current node visited before its children.

<s>[INST] <<SYS>>
Provide answers in JavaScript

Write a function that computes the set of sums of all contiguous sublists of a given list. [/INST] 

 function contiguousSublistSums(list) {
  let sums = [];
  for (let i = 0; i < list.length; i++) {
    let sum = 0;
    for (let j = i; j < list.length; j++) {
      sum += list[j];
  return sums;

console.log(contiguousSublistSums([1, 2, 3, 4, 5])); // [1, 3, 6, 10, 15]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6])); // [1, 3, 6, 10, 15, 21]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7])); // [1, 3, 6, 10, 15, 21, 28]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7, 8])); // [1, 3, 6, 10, 15, 21, 28, 36]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7, 8, 9])); // [1, 3, 6, 10, 15, 21, 28, 36, 45]

Clean up

After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:



In this post, we showed you how to get started with Code Llama models in SageMaker Studio and deploy the model for generating code and natural language about code from both code and natural language prompts. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.


About the authors

Gabriel Synnaeve
is a Research Director on the Facebook AI Research (FAIR) team at Meta. Prior to Meta, Gabriel was a postdoctoral fellow in Emmanuel Dupoux’s team at École Normale Supérieure in Paris, working on reverse-engineering the acquisition of language in babies. Gabriel received his PhD in Bayesian modeling applied to real-time strategy games AI from the University of Grenoble.

Eissa Jamil is a Partner Engineer RL, Generative AI at Meta.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Vivek Singh is a product manager with SageMaker JumpStart. He focuses on enabling customers to onboard SageMaker JumpStart to simplify and accelerate their ML journey to build Generative AI applications.

A successful deployment of a machine learning (ML) model in a production environment heavily relies on an end-to-end ML pipeline. Although developing such a pipeline can be challenging, it becomes even more complex when dealing with an edge ML use case. Machine learning at the edge is a concept that brings the capability of running ML models locally to edge devices. In order to deploy, monitor, and maintain these models at the edge, a robust MLOps pipeline is required. An MLOps pipeline allows to automate the full ML lifecycle from data labeling to model training and deployment.

Implementing an MLOps pipeline at the edge introduces additional complexities that make the automation, integration, and maintenance processes more challenging due to the increased operational overhead involved. However, using purpose-built services like Amazon SageMaker and AWS IoT Greengrass allows you to significantly reduce this effort. In this series, we walk you through the process of architecting and building an integrated end-to-end MLOps pipeline for a computer vision use case at the edge using SageMaker, AWS IoT Greengrass, and the AWS Cloud Development Kit (AWS CDK).

This post focuses on designing the overall MLOps pipeline architecture; Part 2 and Part 3 of this series focus on the implementation of the individual components. We have provided a sample implementation in the accompanying GitHub repository for you to try yourself. If you’re just getting started with MLOps at the edge on AWS, refer to MLOps at the edge with Amazon SageMaker Edge Manager and AWS IoT Greengrass for an overview and reference architecture.

Use case: Inspecting the quality of metal tags

As an ML engineer, it’s important to understand the business case you are working on. So before we dive into the MLOps pipeline architecture, let’s look at the sample use case for this post. Imagine a production line of a manufacturer that engraves metal tags to create customized luggage tags. The quality assurance process is costly because the raw metal tags need to be inspected manually for defects like scratches. To make this process more efficient, we use ML to detect faulty tags early in the process. This helps avoid costly defects at later stages of the production process. The model should identify possible defects like scratches in near-real time and mark them. In manufacturing shop floor environments, you often have to deal with no connectivity or constrained bandwidth and increased latency. Therefore, we want to implement an on-edge ML solution for visual quality inspection that can run inference locally on the shop floor and decrease the requirements in regards to connectivity. To keep our example straightforward, we train a model that marks detected scratches with bounding boxes. The following image is an example of a tag from our dataset with three scratches marked.

Metal tag with scratches

Defining the pipeline architecture

We have now gained clarity into our use case and the specific ML problem we aim to address, which revolves around object detection at the edge. Now it’s time to draft an architecture for our MLOps pipeline. At this stage, we aren’t looking at technologies or specific services yet, but rather the high-level components of our pipeline. In order to quickly retrain and deploy, we need to automate the whole end-to-end process: from data labeling, to training, to inference. However, there are a few challenges that make setting up a pipeline for an edge case particularly hard:

  • Building different parts of this process requires different skill sets. For instance, data labeling and training has a strong data science focus, edge deployment requires an Internet of Things (IoT) specialist, and automating the whole process is usually done by someone with a DevOps skill set.
  • Depending on your organization, this whole process might even be implemented by multiple teams. For our use case, we’re working under the assumption that separate teams are responsible for labeling, training, and deployment.
  • More roles and skill sets mean different requirements when it comes to tooling and processes. For instance, data scientists might want to monitor and work with their familiar notebook environment. MLOps engineers want to work using infrastructure as code (IaC) tools and might be more familiar with the AWS Management Console.

What does this mean for our pipeline architecture?

Firstly, it’s crucial to clearly define the major components of the end-to-end system that allows different teams to work independently. Secondly, well-defined interfaces between teams must be defined to enhance collaboration efficiency. These interfaces help minimize disruptions between teams, enabling them to modify their internal processes as needed as long as they adhere to the defined interfaces. The following diagram illustrates what this could look like for our computer vision pipeline.

MLOps pipeline scribble

Let’s examine the overall architecture of the MLOps pipeline in detail:

  1. The process begins with a collection of raw images of metal tags, which are captured using an edge camera device in the production environment to form an initial training dataset.
  2. The next step involves labeling these images and marking defects using bounding boxes. It’s essential to version the labeled dataset, ensuring traceability and accountability for the utilized training data.
  3. After we have a labeled dataset, we can proceed with training, fine-tuning, evaluating, and versioning our model.
  4. When we’re happy with our model performance, we can deploy the model to an edge device and run live inferences at the edge.
  5. While the model operates in production, the edge camera device generates valuable image data containing previously unseen defects and edge cases. We can use this data to further enhance our model’s performance. To accomplish this, we save images for which the model predicts with low confidence or makes erroneous predictions. These images are then added back to our raw dataset, initiating the entire process again.

It’s important to note that the raw image data, labeled dataset, and trained model serve as well-defined interfaces between the distinct pipelines. MLOps engineers and data scientists have the flexibility to choose the technologies within their pipelines as long as they consistently produce these artifacts. Most significantly, we have established a closed feedback loop. Faulty or low-confidence predictions made in production can be used to regularly augment our dataset and automatically retrain and enhance the model.

Target architecture

Now that the high-level architecture is established, it’s time to go one level deeper and look at how we could build this with AWS services. Note that the architecture shown in this post assumes you want to take full control of the whole data science process. However, if you’re just getting started with quality inspection at the edge, we recommend Amazon Lookout for Vision. It provides a way to train your own quality inspection model without having to build, maintain, or understand ML code. For more information, refer to Amazon Lookout for Vision now supports visual inspection of product defects at the edge.

However, if you want to take full control, the following diagram shows what an architecture could look like.

MLOps pipeline architecture

Similar to before, let’s walk through the workflow step by step and identify which AWS services suit our requirements:

  1. Amazon Simple Storage Service (Amazon S3) is used to store raw image data because it provides us with a low-cost storage solution.
  2. The labeling workflow is orchestrated using AWS Step Functions, a serverless workflow engine that makes it easy to orchestrate the steps of the labeling workflow. As part of this workflow, we use Amazon SageMaker Ground Truth to fully automate the labeling using labeling jobs and managed human workforces. AWS Lambda is used to prepare the data, start the labeling jobs, and store the labels in Amazon SageMaker Feature Store.
  3. SageMaker Feature Store stores the labels. It allows us to centrally manage and share our features and provides us with built-in data versioning capabilities, which makes our pipeline more robust.
  4. We orchestrate the model building and training pipeline using Amazon SageMaker Pipelines. It integrates with the other SageMaker features required via built-in steps. SageMaker Training jobs are used to automate the model training, and SageMaker Processing jobs are used to prepare the data and evaluate model performance. In this example, we’re using the Ultralytics YOLOv8 Python package and model architecture to train and export an object detection model to the ONNX ML model format for portability.
  5. If the performance is acceptable, the trained model is registered in Amazon SageMaker Model Registry with an incremental version number attached. It acts as our interface between the model training and edge deployment steps. We also manage the approval state of models here. Similar to the other services used, it’s fully managed, so we don’t have to take care of running our own infrastructure.
  6. The edge deployment workflow is automated using Step Functions, similar to the labeling workflow. We can use the API integrations of Step Functions to easily call the various required AWS service APIs like AWS IoT Greengrass to create new model components and afterwards deploy the components to the edge device.
  7. AWS IoT Greengrass is used as the edge device runtime environment. It manages the deployment lifecycle for our model and inference components at the edge. It allows us to easily deploy new versions of our model and inference components using simple API calls. In addition, ML models at the edge usually don’t run in isolation; we can use the various AWS and community provided components of AWS IoT Greengrass to connect to other services.

The architecture outlined resembles our high-level architecture shown before. Amazon S3, SageMaker Feature Store, and SageMaker Model Registry act as the interfaces between the different pipelines. To minimize the effort to run and operate the solution, we use managed and serverless services wherever possible.

Merging into a robust CI/CD system

The data labeling, model training, and edge deployment steps are core to our solution. As such, any change related to the underlying code or data in any of those parts should trigger a new run of the whole orchestration process. To achieve this, we need to integrate this pipeline into a CI/CD system that allows us to automatically deploy code and infrastructure changes from a versioned code repository into production. Similar to the previous architecture, team autonomy is an important aspect here. The following diagram shows what this could look like using AWS services.

CI/CD pipeline

Let’s walk through the CI/CD architecture:

  1. AWS CodeCommit acts as our Git repository. For the sake of simplicity, in our provided sample, we separated the distinct parts (labeling, model training, edge deployment) via subfolders in a single git repository. In a real-world scenario, each team might use different repositories for each part.
  2. Infrastructure deployment is automated using the AWS CDK and each part (labeling, training, and edge) gets its own AWS CDK app to allow independent deployments.
  3. The AWS CDK pipeline feature uses AWS CodePipeline to automate the infrastructure and code deployments.
  4. The AWS CDK deploys two code pipelines for each step: an asset pipeline and a workflow pipeline. We separated the workflow from the asset deployment to allow us to start the workflows separately in case there are no asset changes (for example, when there are new images available for training).
    • The asset code pipeline deploys all infrastructure required for the workflow to run successfully, such as AWS Identity and Access Management (IAM) roles, Lambda functions, and container images used during training.
    • The workflow code pipeline runs the actual labeling, training, or edge deployment workflow.
  5. Asset pipelines are automatically triggered on commit as well as when a previous workflow pipeline is complete.
  6. The whole process is triggered on a schedule using an Amazon EventBridge rule for regular retraining.

With the CI/CD integration, the whole end-to-end chain is now fully automated. The pipeline is triggered whenever code changes in our git repository as well as on a schedule to accommodate for data changes.

Thinking ahead

The solution architecture described represents the basic components to build an end-to-end MLOps pipeline at the edge. However, depending on your requirements, you might think about adding additional functionality. The following are some examples:


In this post, we outlined our architecture for building an end-to-end MLOps pipeline for visual quality inspection at the edge using AWS services. This architecture streamlines the entire process, encompassing data labeling, model development, and edge deployment, enabling us to swiftly and reliably train and implement new versions of the model. With serverless and managed services, we can direct our focus towards delivering business value rather than managing infrastructure.

In Part 2 of this series, we will delve one level deeper and look at the implementation of this architecture in more detail, specifically labeling and model building. If you want to jump straight to the code, you can check out the accompanying GitHub repo.

About the authors

Michael RothMichael Roth is a Senior Solutions Architect at AWS supporting Manufacturing customers in Germany to solve their business challenges through AWS technology. Besides work and family he’s interested in sports cars and enjoys Italian coffee.

Jörg WöhrleJörg Wöhrle is a Solutions Architect at AWS, working with manufacturing customers in Germany. With a passion for automation, Joerg has worked as a software developer, DevOps engineer, and Site Reliability Engineer in his pre-AWS life. Beyond cloud, he’s an ambitious runner and enjoys quality time with his family. So if you have a DevOps challenge or want to go for a run: let him know.

Johannes LangerJohannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

In Part 1 of this series, we drafted an architecture for an end-to-end MLOps pipeline for a visual quality inspection use case at the edge. It is architected to automate the entire machine learning (ML) process, from data labeling to model training and deployment at the edge. The focus on managed and serverless services reduces the need to operate infrastructure for your pipeline and allows you to get started quickly.

In this post, we delve deep into how the labeling and model building and training parts of the pipeline are implemented. If you’re particularly interested in the edge deployment aspect of the architecture, you can skip ahead to Part 3. We also provide an accompanying GitHub repo if you want to deploy and try this yourself.

Solution overview

The sample use case used for this series is a visual quality inspection solution that can detect defects on metal tags, which could be deployed as part of a manufacturing process. The following diagram shows the high-level architecture of the MLOps pipeline we defined in the beginning of this series. If you haven’t read it yet, we recommend checking out Part 1.

Architecture diagram

Automating data labeling

Data labeling is an inherently labor-intensive task that involves humans (labelers) to label the data. Labeling for our use case means inspecting an image and drawing bounding boxes for each defect that is visible. This may sound straightforward, but we need to take care of a number of things in order to automate this:

  • Provide a tool for labelers to draw bounding boxes
  • Manage a workforce of labelers
  • Ensure good label quality
  • Manage and version our data and labels
  • Orchestrate the whole process
  • Integrate it into the CI/CD system

We can do all of this with AWS services. To facilitate the labeling and manage our workforce, we use Amazon SageMaker Ground Truth, a data labeling service that allows you to build and manage your own data labeling workflows and workforce. You can manage your own private workforce of labelers, or use the power of external labelers via Amazon Mechanical Turk or third-party providers.

On top of that, the whole process can be configured and managed via the AWS SDK, which is what we use to orchestrate our labeling workflow as part of our CI/CD pipeline.

Labeling jobs are used to manage labeling workflows. SageMaker Ground Truth provides out-of-the-box templates for many different labeling task types, including drawing bounding boxes. For more details on how to set up a labeling job for bounding box tasks, check out Streamlining data labeling for YOLO object detection in Amazon SageMaker Ground Truth. For our use case, we adapt the task template for bounding box tasks and use human annotators provided by Mechanical Turk to label our images by default. The following screenshot shows what a labeler sees when working on an image.

Labeling UI

Let’s talk about label quality next. The quality of our labels will affect the quality of our ML model. When automating the image labeling with an external human workforce like Mechanical Turk, it’s challenging to ensure a good and consistent label quality due to the lack of domain expertise. Sometimes a private workforce of domain experts is required. In our sample solution, however, we use Mechanical Turk to implement automated labeling of our images.

There are many ways to ensure good label quality. For more information about best practices, refer to the AWS re:Invent 2019 talk, Build accurate training datasets with Amazon SageMaker Ground Truth. As part of this sample solution, we decided to focus on the following:

Finally, we need to think about how to store our labels so they can be reused for training later and enable traceability of used model training data. The output of a SageMaker Ground Truth labeling job is a file in JSON-lines format containing the labels and additional metadata. We decided to use the offline store of Amazon SageMaker Feature Store to store our labels. Compared to simply storing the labels on Amazon Simple Storage Service (Amazon S3), it provides us with a few distinct advantages:

  • It stores a complete history of feature values, combined with point-in-time queries. This allow us to easily version our dataset and ensure traceability.
  • As a central feature store, it promotes reusability and visibility of our data.

For an introduction to SageMaker Feature Store, refer to Getting started with Amazon SageMaker Feature Store. SageMaker Feature Store supports storing features in tabular format. In our example, we store the following features for each labeled image:

  • The location where the image is stored on Amazon S3
  • Image dimensions
  • The bounding box coordinates and class values
  • A status flag indicating whether the label has been approved for use in training
  • The labeling job name used to create the label

The following screenshot shows what a typical entry in the feature store might look like.

Feature store

With this format, we can easily query the feature store and work with familiar tools like Pandas to construct a dataset to be used for training later.

Orchestrating data labeling

Finally, it’s time to automate and orchestrate each of the steps of our labeling pipeline! For this we use AWS Step Functions, a serverless workflow service that provides us with API integrations to quickly orchestrate and visualize the steps in our workflow. We also use a set of AWS Lambda functions for some of the more complex steps, specifically the following:

  • Check if there are new images that require labeling in Amazon S3
  • Prepare the data in the required input format and start the labeling job
  • Prepare the data in the required input format and start the label verification job
  • Write the final set of labels to the feature store

The following figure shows what the full Step Functions labeling state machine looks like.

Labeling StepFunction

Labeling: Infrastructure deployment and integration into CI/CD

The final step is to integrate the Step Functions workflow into our CI/CD system and ensure that we deploy the required infrastructure. To accomplish this task, we use the AWS Cloud Development Kit (AWS CDK) to create all of the required infrastructure, like the Lambda functions and Step Functions workflow. With CDK Pipelines, a module of AWS CDK, we create a pipeline in AWS CodePipeline that deploys changes to our infrastructure and triggers an additional pipeline to start the Step Functions workflow. The Step Functions integration in CodePipeline makes this task very easy. We use Amazon EventBridge and CodePipeline Source actions to make sure that the pipeline is triggered on a schedule as well as when changes are pushed to git.

The following diagram shows what the CI/CD architecture for labeling looks like in detail.

Labeling CDK

Recap automating data labeling

We now have a working pipeline to automatically create labels from unlabeled images of metal tags using SageMaker Ground Truth. The images are picked up from Amazon S3 and fed into a SageMaker Ground Truth labeling job. After the images are labeled, we do a quality check using a label verification job. Finally, the labels are stored in a feature group in SageMaker Feature Store. If you want to try the working example yourself, check out the accompanying GitHub repository. Let’s look at how to automate model building next!

Automating model building

Similar to labeling, let’s have an in-depth look at our model building pipeline. At a minimum, we need to orchestrate the following steps:

  • Pull the latest features from the feature store
  • Prepare the data for model training
  • Train the model
  • Evaluate model performance
  • Version and store the model
  • Approve the model for deployment if performance is acceptable

The model building process is usually driven by a data scientist and is the outcome of a set of experiments done using notebooks or Python code. We can follow a simple three-step process to convert an experiment to a fully automated MLOps pipeline:

  1. Convert existing preprocessing, training, and evaluation code to command line scripts.
  2. Create a SageMaker pipeline definition to orchestrate model building. Use the scripts created in step one as part of the processing and training steps.
  3. Integrate the pipeline into your CI/CD workflow.

This three-step process is generic and can be used for any model architecture and ML framework of your choice. Let’s follow it and start with Step 1 to create the following scripts:

  • – This pulls labeled images from SageMaker Feature Store, splits the dataset, and transforms it into the required format for training our model, in our case the input format for YOLOv8
  • – This trains an Ultralytics YOLOv8 object detection model using PyTorch to detect scratches on images of metal tags

Orchestrating model building

In Step 2, we bundle these scripts up into training and processing jobs and define the final SageMaker pipeline, which looks like the following figure.

SageMaker Pipeline

It consists of the following steps:

  1. A ProcessingStep to load the latest features from SageMaker Feature Store; split the dataset into training, validation, and test sets; and store the datasets as tarballs for training.
  2. A TrainingStep to train the model using the training, validation, and test datasets and export the mean Average Precision (mAP) metric for the model.
  3. A ConditionStep to evaluate if the mAP metric value of the trained model is above a configured threshold. If so, a RegisterModel step is run that registers the trained model in the SageMaker Model Registry.

If you are interested in the detailed pipeline code, check out the pipeline definition in our sample repository.

Training: Infrastructure deployment and integration into CI/CD

Now it’s time for Step 3: integration into the CI/CD workflow. Our CI/CD pipeline follows the same pattern illustrated in the labeling section before. We use the AWS CDK to deploy the required pipelines from CodePipeline. The only difference is that we use Amazon SageMaker Pipelines instead of Step Functions. The SageMaker pipeline definition is constructed and triggered as part of a CodeBuild action in CodePipeline.

Training CDK


We now have a fully automated labeling and model training workflow using SageMaker. We started by creating command line scripts from the experiment code. Then we used SageMaker Pipelines to orchestrate each of the model training workflow steps. The command line scripts were integrated as part of the training and processing steps. At the end of the pipeline, the trained model is versioned and registered in SageMaker Model Registry.

Check out Part 3 of this series, where we will take a closer look at the final step of our MLOps workflow. We will create the pipeline that compiles and deploys the model to an edge device using AWS IoT Greengrass!

About the authors

Michael RothMichael Roth is a Senior Solutions Architect at AWS supporting Manufacturing customers in Germany to solve their business challenges through AWS technology. Besides work and family he’s interested in sports cars and enjoys Italian coffee.

Jörg WöhrleJörg Wöhrle is a Solutions Architect at AWS, working with manufacturing customers in Germany. With a passion for automation, Joerg has worked as a software developer, DevOps engineer, and Site Reliability Engineer in his pre-AWS life. Beyond cloud, he’s an ambitious runner and enjoys quality time with his family. So if you have a DevOps challenge or want to go for a run: let him know.

Johannes LangerJohannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

This is Part 3 of our series where we design and implement an MLOps pipeline for visual quality inspection at the edge. In this post, we focus on how to automate the edge deployment part of the end-to-end MLOps pipeline. We show you how to use AWS IoT Greengrass to manage model inference at the edge and how to automate the process using AWS Step Functions and other AWS services.

Solution overview

In Part 1 of this series, we laid out an architecture for our end-to-end MLOps pipeline that automates the entire machine learning (ML) process, from data labeling to model training and deployment at the edge. In Part 2, we showed how to automate the labeling and model training parts of the pipeline.

The sample use case used for this series is a visual quality inspection solution that can detect defects on metal tags, which you can deploy as part of a manufacturing process. The following diagram shows the high-level architecture of the MLOps pipeline we defined in the beginning of this series. If you haven’t read it yet, we recommend checking out Part 1.

Architecture diagram

Automating the edge deployment of an ML model

After an ML model has been trained and evaluated, it needs to be deployed to a production system to generate business value by making predictions on incoming data. This process can quickly become complex in an edge setting where models need to be deployed and run on devices that are often located far away from the cloud environment in which the models have been trained. The following are some of the challenges unique to machine learning at the edge:

  • ML models often need to be optimized due to resource constraints on edge devices
  • Edge devices can’t be redeployed or even replaced like a server in the cloud, so you need a robust model deployment and device management process
  • Communication between devices and the cloud needs to be efficient and secure because it often traverses untrusted low-bandwidth networks

Let’s see how we can tackle these challenges with AWS services in addition to exporting the model in the ONNX format, which allows us to, for example, apply optimizations like quantization to reduce the model size for constraint devices. ONNX also provides optimized runtimes for the most common edge hardware platforms.

Breaking the edge deployment process down, we require two components:

  • A deployment mechanism for the model delivery, which includes the model itself and some business logic to manage and interact with the model
  • A workflow engine that can orchestrate the whole process to make this robust and repeatable

In this example, we use different AWS services to build our automated edge deployment mechanism, which integrates all the required components we discussed.

Firstly, we simulate an edge device. To make it straightforward for you to go through the end-to-end workflow, we use an Amazon Elastic Compute Cloud (Amazon EC2) instance to simulate an edge device by installing the AWS IoT Greengrass Core software on the instance. You can also use EC2 instances to validate the different components in a QA process before deploying to an actual edge production device. AWS IoT Greengrass is an Internet of Things (IoT) open-source edge runtime and cloud service that helps you build, deploy, and manage edge device software. AWS IoT Greengrass reduces the effort to build, deploy, and manage edge device software in a secure and scalable way. After you install the AWS IoT Greengrass Core software on your device, you can add or remove features and components, and manage your IoT device applications using AWS IoT Greengrass. It offers a lot of built-in components to make your life easier, such as the StreamManager and MQTT broker components, which you can use to securely communicate with the cloud, supporting end-to-end encryption. You can use those features to upload inference results and images efficiently.

In a production environment, you would typically have an industrial camera delivering images for which the ML model should produce predictions. For our setup, we simulate this image input by uploading a preset of images into a specific directory on the edge device. We then use these images as inference input for the model.

We divided the overall deployment and inference process into three consecutive steps to deploy a cloud-trained ML model to an edge environment and use it for predictions:

  1. Prepare – Package the trained model for edge deployment.
  2. Deploy – Transfer of model and inference components from the cloud to the edge device.
  3. Inference – Load the model and run inference code for image predictions.

The following architecture diagram shows the details of this three-step process and how we implemented it with AWS services.

Inference Process

In the following sections, we discuss the details for each step and show how to embed this process into an automated and repeatable orchestration and CI/CD workflow for both the ML models and corresponding inference code.


Edge devices often come with limited compute and memory compared to a cloud environment where powerful CPUs and GPUs can run ML models easily. Different model-optimization techniques allow you to tailor a model for a specific software or hardware platform to increase prediction speed without losing accuracy.

In this example, we exported the trained model in the training pipeline to the ONNX format for portability, possible optimizations, as well as optimized edge runtimes, and registered the model within Amazon SageMaker Model Registry. In this step, we create a new Greengrass model component including the latest registered model for subsequent deployment.


A secure and reliable deployment mechanism is key when deploying a model from the cloud to an edge device. Because AWS IoT Greengrass already incorporates a robust and secure edge deployment system, we’re using this for our deployment purposes. Before we look at our deployment process in detail, let’s do a quick recap on how AWS IoT Greengrass deployments work. At the core of the AWS IoT Greengrass deployment system are components, which define the software modules deployed to an edge device running AWS IoT Greengrass Core. These can either be private components that you build or public components that are provided either by AWS or the broader Greengrass community. Multiple components can be bundled together as part of a deployment. A deployment configuration defines the components included in a deployment and the deployment’s target devices. It can either be defined in a deployment configuration file (JSON) or via the AWS IoT Greengrass console when creating a new deployment.

We create the following two Greengrass components, which are then deployed to the edge device via the deployment process:

  • Packaged model (private component) – This component contains the trained and ML model in ONNX format.
  • Inference code (private component) – Aside from the ML model itself, we need to implement some application logic to handle tasks like data preparation, communication with the model for inference, and postprocessing of inference results. In our example, we’ve developed a Python-based private component to handle the following tasks:
    • Install the required runtime components like the Ultralytics YOLOv8 Python package.
    • Instead of taking images from a camera live stream, we simulate this by loading prepared images from a specific directory and preparing the image data according to the model input requirements.
    • Make inference calls against the loaded model with the prepared image data.
    • Check the predictions and upload inference results back to the cloud.

If you want to have a deeper look at the inference code we built, refer to the GitHub repo.


The model inference process on the edge device automatically starts after deployment of the aforementioned components is finished. The custom inference component periodically runs the ML model with images from a local directory. The inference result per image returned from the model is a tensor with the following content:

  • Confidence scores – How confident the model is regarding the detections
  • Object coordinates – The scratch object coordinates (x, y, width, height) detected by the model in the image

In our case, the inference component takes care of sending inference results to a specific MQTT topic on AWS IoT where it can be read for further processing. These messages can be viewed via the MQTT test client on the AWS IoT console for debugging. In a production setting, you can decide to automatically notify another system that takes care of removing faulty metal tags from the production line.


As seen in the preceding sections, multiple steps are required to prepare and deploy an ML model, the corresponding inference code, and the required runtime or agent to an edge device. Step Functions is a fully managed service that allows you to orchestrate these dedicated steps and design the workflow in the form of a state machine. The serverless nature of this service and native Step Functions capabilities like AWS service API integrations allow you to quickly set up this workflow. Built-in capabilities like retries or logging are important points to build robust orchestrations. For more details regarding the state machine definition itself, refer to the GitHub repository or check the state machine graph on the Step Functions console after you deploy this example in your account.

Infrastructure deployment and integration into CI/CD

The CI/CD pipeline to integrate and build all the required infrastructure components follows the same pattern illustrated in Part 1 of this series. We use the AWS Cloud Development Kit (AWS CDK) to deploy the required pipelines from AWS CodePipeline.

Deployment CDK


There are multiple ways to build an architecture for an automated, robust, and secure ML model edge deployment system, which are often very dependent on the use case and other requirements. However, here a few learnings we would like to share with you:

  • Evaluate in advance if the additional AWS IoT Greengrass compute resource requirements fit your case, especially with constrained edge devices.
  • Establish a deployment mechanism that integrates a verification step of the deployed artifacts before running on the edge device to ensure that no tampering happened during transmission.
  • It’s good practice to keep the deployment components on AWS IoT Greengrass as modular and self-contained as possible to be able to deploy them independently. For example, if you have a relatively small inference code module but a big ML model in terms of size, you don’t always want to the deploy them both if just the inference code has changed. This is especially important when you have limited bandwidth or high cost edge device connectivity.


This concludes our three-part series on building an end-to-end MLOps pipeline for visual quality inspection at the edge. We looked at the additional challenges that come with deploying an ML model at the edge like model packaging or complex deployment orchestration. We implemented the pipeline in a fully automated way so we can put our models into production in a robust, secure, repeatable, and traceable fashion. Feel free to use the architecture and implementation developed in this series as a starting point for your next ML-enabled project. If you have any questions how to architect and build such a system for your environment, please reach out. For other topics and use cases, refer to our Machine Learning and IoT blogs.

About the authors

Michael RothMichael Roth is a Senior Solutions Architect at AWS supporting Manufacturing customers in Germany to solve their business challenges through AWS technology. Besides work and family he’s interested in sports cars and enjoys Italian coffee.

Jörg WöhrleJörg Wöhrle is a Solutions Architect at AWS, working with manufacturing customers in Germany. With a passion for automation, Joerg has worked as a software developer, DevOps engineer, and Site Reliability Engineer in his pre-AWS life. Beyond cloud, he’s an ambitious runner and enjoys quality time with his family. So if you have a DevOps challenge or want to go for a run: let him know.

Johannes LangerJohannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

Google at ICCV 2023

Google at ICCV 2023

Google is proud to be a Platinum Sponsor of the International Conference on Computer Vision (ICCV 2023), a premier annual conference, which is being held this week in Paris, France. As a leader in computer vision research, Google has a strong presence at this year’s conference with 60 accepted papers and active involvement in 27 workshops and tutorials. Google is also proud to be a Platinum Sponsor for the LatinX in CV workshop. We look forward to sharing some of our extensive computer vision research and expanding our partnership with the broader research community.

Attending ICCV 2023? We hope you’ll visit the Google booth to chat with researchers who are actively pursuing the latest innovations in computer vision, and check out some of the scheduled booth activities (e.g., demos and Q&A sessions listed below). Visit the @GoogleAI Twitter account to find out more about the Google booth activities at ICCV 2023.

Take a look below to learn more about the Google research being presented at ICCV 2023 (Google affiliations in bold).

Board and Organizing Committee

General Chair: Cordelia Schmid

Finance Chair: Ramin Zabih

Industrial Relations Chair: Rahul Sukthankar

Publicity and Social Media Co-Chair: Boqing Gong

Google Research booth activities

Title: ImagenThings: Instant Personalized Image-to-Image Generation

Presenters: Xuhui Jia, Suraj Kothawade

Wednesday, October 4th at 12:30 PM CEST

Title: Open Images V7 (paper, dataset, blog post)

Presenters: Rodrigo Benenson, Jasper Uijlings, Jordi Pont-Tuset

Wednesday, October 4th at 3:30 PM CEST

Title: AI4Design (paper)

Presenters: Andrew Marmon, Peggy Chi, C.K. Ng

Thursday, October 5th at 10:30 AM CEST

Title: Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face Synthesis

Presenters: Marcel Bühler, Kripasindhu Sarkar

Thursday, October 5th at 12:30 PM CEST

Title: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Presenters: Yonatan Bitton

Thursday, October 5th at 1:00 PM CEST

Title: Image Search in Fact Check Explorer (blog post)

Presenters: Yair Alon, Avneesh Sud

Thursday, October 5th at 3:30 PM CEST

Title: UnLoc: A Unified Framework for Video Localization Tasks (paper)

Presenters: Arsha Nagrani, Xuehan Xiong

Friday, October 6th at 10:30 AM CEST

Title: Prompt-Tuning Latent Diffusion Models for Inverse Problems

Presenters: Hyungjin Chung

Friday, October 6th at 12:30 PM CEST

Title: Neural Implicit Representations for Real World Applications

Presenters: Federico Tombari, Fabian Manhardt, Marie-Julie Rakotosaona

Friday, October 6th at 3:30 PM CEST

Accepted papers

Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor

Xinyang Liu, Yijin Li, Yanbin Teng, Hujun Bao, Guofeng Zhang, Yinda Zhang, Zhaopeng Cui

ITI-GEN: Inclusive Text-to-Image Generation

Cheng Zhang, Xuanbai Chen, Siqi Chai, Chen Henry Wu, Dmitry Lagun, Thabo Beeler, Fernando De la Torre

ASIC: Aligning Sparse in-the-wild Image Collections

Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, Abhishek Kar

VQ3D: Learning a 3D-Aware Generative Model on ImageNet

Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hexiang Hu, Yi Luan, Yang Chen*, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, Ming-Wei Chang

Sigmoid Loss for Language Image Pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer

Tracking Everything Everywhere All at Once

Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, Noah Snavely

Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman

Delta Denoising Score

Amir Hertz*, Kfir Aberman, Daniel Cohen-Or*

DreamBooth3D: Subject-Driven Text-to-3D Generation

Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, Yuanzhen Li, Varun Jampani

Encyclopedic VQA: Visual Questions about Detailed Properties of Fine-grained Categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel*, Felipe Cadar*, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari

GECCO: Geometrically-Conditioned Point Diffusion Models

Michał J. Tyszkiewicz, Pascal Fua, Eduard Trulls

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

Qitong Wang, Long Zhao, Liangzhe Yuan, Ting Liu, Xi Peng

Neural Microfacet Fields for Inverse Rendering

Alexander Mai, Dor Verbin, Falko Kuester, Sara Fridovich-Keil

Rosetta Neurons: Mining the Common Units in a Model Zoo

Amil Dravid, Yossi Gandelsman, Alexei A. Efros, Assaf Shocher

Teaching CLIP to Count to Ten

Roni Paiss*, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel

Vox-E: Text-guided Voxel Editing of 3D Objects

Etai Sella, Gal Fiebelman, Peter Hedman, Hadar Averbuch-Elor

CC3D: Layout-Conditioned Generation of Compositional 3D Scenes

Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Xingguang Yan, Gordon Wetzstein, Leonidas Guibas, Andrea Tagliasacchi

Delving into Motion-Aware Matching for Monocular 3D Object Tracking

Kuan-Chih Huang, Ming-Hsuan Yang, Yi-Hsuan Tsai

Generative Multiplane Neural Radiance for 3D-Aware Image Generation

Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

M2T: Masking Transformers Twice for Faster Decoding

Fabian Mentzer, Eirikur Agustsson, Michael Tschannen

MULLER: Multilayer Laplacian Resizer for Vision

Zhengzhong Tu, Peyman Milanfar, Hossein Talebi

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

Ligong Han*, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, Feng Yang

Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond

Yang Zhao, Tingbo Hou, Yu-Chuan Su, Xuhui Jia, Yandong Li, Matthias Grundmann

Unified Visual Relationship Detection with Vision and Language Models

Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

3D Motion Magnification: Visualizing Subtle Motions from Time-Varying Radiance Fields

Brandon Y. Feng, Hadi Alzayer, Michael Rubinstein, William T. Freeman, Jia-Bin Huang

Global Features are All You Need for Image Retrieval and Reranking

Shihao Shao, Kaifeng Chen, Arjun Karpur, Qinghua Cui, André Araujo, Bingyi Cao

Introducing Language Guidance in Prompt-Based Continual Learning

Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, Muhammad Zeshan Afzal

Multiscale Structure Guided Diffusion for Image Deblurring

Mengwei Ren*, Mauricio Delbracio, Hossein Talebi, Guido Gerig, Peyman Milanfar

Robust Monocular Depth Estimation under Challenging Conditions

Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, Federico Tombari

Score-Based Diffusion Models as Principled Priors for Inverse Imaging

Berthy T. Feng*, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L. Bouman, William T. Freeman

Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Bingyi Cao, Mario Lipovsky, Pelin Dogan-Schonberger, Grzegorz Makosa, Boris Bluntschli, Mojtaba Seyedhosseini, Ondrej Chum, André Araujo

U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Yan Di, Chenyangguang Zhang, Ruida Zhang, Fabian Manhardt, Yongzhi Su, Jason Rambach, Didier Stricker, Xiangyang Ji, Federico Tombari

AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control

Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao

Learning Versatile 3D Shape Generation with Improved AR Models

Simian Luo, Xuelin Qian, Yanwei Fu, Yinda Zhang, Ying Tai, Zhenyu Zhang, Chengjie Wang, Xiangyang Xue

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views

Wentian Qu, Zhaopeng Cui, Yinda Zhang, Chenyu Meng, Cuixia Ma, Xiaoming Deng, Hongan Wang

PreSTU: Pre-Training for Scene-Text Understanding

Jihyung Kil*, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

Baowen Zhang, Jiahe Li, Xiaoming Deng, Yinda Zhang, Cuixia Ma, Hongan Wang

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

Muhammad Uzair Khattak, Syed Talal Wasi, Muzammal Nasee, Salman Kha, Ming-Hsuan Yan, Fahad Shahbaz Khan

Spectral Graphormer: Spectral Graph-Based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images

Tze Ho Elden Tse*, Franziska Mueller, Zhengyang Shen, Danhang Tang, Thabo Beeler, Mingsong Dou, Yinda Zhang, Sasa Petrovic, Hyung Jin Chang, Jonathan Taylor, Bardia Doosti

Synthesizing Diverse Human Motions in 3D Indoor Scenes

Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, Siyu Tang

Tracking by 3D Model Estimation of Unknown Objects in Videos

Denys Rozumnyi, Jiri Matas, Marc Pollefeys, Vittorio Ferrari, Martin R. Oswald

UnLoc: A Unified Framework for Video Localization Tasks

Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang*, Weina Ge, David Ross, Cordelia Schmid

Verbs in Action: Improving Verb Understanding in Video-language Models

Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

VLSlice: Interactive Vision-and-Language Slice Discovery

Eric Slyman, Minsuk Kahng, Stefan Lee

Yes, we CANN: Constrained Approximate Nearest Neighbors for Local Feature-Based Visual Localization

Dror Aiger, André Araujo, Simon Lynen

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu*, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

CLR: Channel-wise Lightweight Reprogramming for Continual Learning

Yunhao Ge, Yuecheng Li, Shuo Ni, Jiaping Zhao, Ming-Hsuan Yang, Laurent Itti

LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

Zezhou Cheng*, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, Ameesh Makadia

Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering

Dongting Hu, Zhenkai Zhang, Tingbo Hou, Tongliang Liu, Huan Fu, Mingming Gong

Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs

Frederik Warburg, Ethan Weber, Matthew Tancik, Aleksander Holynski, Angjoo Kanazawa

Segmenting Known Objects and Unseen Unknowns without Prior Knowledge

Stefano Gasperini, Alvaro Marcos-Ramiro, Michael Schmidt, Nassir Navab, Benjamin Busam, Federico Tombari

SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan

SwiftFormer: Efficient Additive Attention for Transformer-Based Real-time Mobile Vision Applications

Abdelrahman Shaker, Muhammad Maa, Hanoona Rashee, Salman Kha, Ming-Hsuan Yan, Fahad Shahbaz Kha

Agile Modeling: From Concept to Classifier in Minutes

Otilia Stretcu, Edward Vendrow, Kenji Hata, Krishnamurthy Viswanathan, Vittorio Ferrari, Sasan Tavakkol, Wenlei Zhou, Aditya Avinash, Enming Luo, Neil Gordon Alldrin, MohammadHossein Bateni, Gabriel Berger, Andrew Bunner, Chun-Ta Lu, Javier A Rey, Giulia DeSalvo, Ranjay Krishna, Ariel Fuxman

CAD-Estate: Large-Scale CAD Model Annotation in RGB Videos

Kevis-Kokitsi Maninis, Stefan Popov, Matthias Niessner, Vittorio Ferrari

Counting Crowds in Bad Weather

Zhi-Kai Huang, Wei-Ting Chen, Yuan-Chun Chiang, Sy-Yen Kuo, Ming-Hsuan Yang

DreamPose: Fashion Video Synthesis with Stable Diffusion

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman

InfiniCity: Infinite-Scale City Synthesis

Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, Sergey Tulyakov

SAMPLING: Scene-Adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang


Learning with Noisy and Unlabeled Data for Large Models beyond Categorization

Sifei Liu, Hongxu Yin, Shalini De Mello, Pavlo Molchanov, Jose M. Alvarez, Jan Kautz, Xiaolong Wang, Anima Anandkumar, Ming-Hsuan Yang, Trevor Darrell

Speaker: Varun Jampani


LatinX in AI

Platinum Sponsor

Panelists: Daniel Castro Chin, Andre Araujo

Invited Speaker: Irfan Essa

Volunteers: Ming-Hsuan Yang, Liangzhe Yuan, Pedro Velez, Vincent Etter

Scene Graphs and Graph Representation Learning

Organizer: Federico Tombari

International Workshop on Analysis and Modeling of Faces and Gestures

Speaker: Todd Zickler

3D Vision and Modeling Challenges in eCommerce

Speaker: Leonidas Guibas

BigMAC: Big Model Adaptation for Computer Vision

Organizer: Mathilde Caron

Adversarial Robustness In the Real World (AROW)

Organizer: Yutong Bai

GeoNet: 1st Workshop on Robust Computer Vision across Geographies

Speaker: Sara Beery

Organizer: Tarun Kalluri

Quo Vadis, Computer Vision?

Speaker: Bill Freeman

To NeRF or not to NeRF: A View Synthesis Challenge for Human Heads

Speaker: Thabo Beeler

Organizer: Stefanos Zafeiriou

New Ideas in Vision Transformers

Speaker: Cordelia Schmid

Organizer: Ming-Hsuan Yang

Representation Learning with Very Limited Images: The Potential of Self, Synthetic and Formula Supervision

Speaker: Manel Baradad Jurjo

Resource Efficient Deep Learning for Computer Vision

Speaker: Prateek Jain

Organizer: Jiahui Yu, Rishabh Tiwari, Jai Gupta

Computer Vision Aided Architectural Design

Speaker: Noah Snavely

AV4D: Visual Learning of Sounds in Spaces

Organizer: David Harwath

Vision-and-Language Algorithmic Reasoning

Speaker: François Chollet

Neural Fields for Autonomous Driving and Robotics

Speaker: Jon Barron

International Challenge on Compositional and Multimodal Perception

Organizer: Ranjay Krishna

Open-Vocabulary 3D Scene Understanding (OpenSUN3D)

Speaker: Thomas Funkhouser

Organizer: Francis Engelmann, Johanna Wald, Federico Tombari, Leonidas Guibas

Frontiers of Monocular 3D Perception: Geometric Foundation Models

Speaker: Leonidas Guibas

PerDream: PERception, Decision Making and REAsoning Through Multimodal Foundational Modeling

Organizer: Daniel McDuff

Recovering 6D Object Pose

Speaker: Fabian Manhardt, Martin Sundermeyer

Organizer: Martin Sundermeyer

Women in Computer Vision (WiCV)

Panelist: Arsha Nagrani

Language for 3D Scenes

Organizer: Leonidas Guibas

AI for 3D Content Creation

Speaker: Kai-Hung Chang

Organizer: Leonidas Guibas

Computer Vision for Metaverse

Speaker: Jon Barron, Thomas Funkhouser

Towards the Next Generation of Computer Vision Datasets

Speaker: Tom Duerig

* Work done while at Google

Announcing PyTorch Docathon H2 2023

We are excited to announce that we will be holding a Docathon for PyTorch on November 1, 2023! This event is an opportunity for our community to come together and improve the quality of our documentation.

During the Docathon, we will focus on updating and improving existing content, as well as adding new tutorials and docstrings. We encourage all members of the community to participate and contribute their expertise to make our documentation even better. This is a great opportunity to learn and collaborate together.

Check out our previous docathon success story here.

Why Participate

One of the best things about the Docathon is that you can make a tangible, positive impact on the quality of documentation in real time. This collaborative event brings together diverse team members from various companies, backgrounds, and roles, united to work towards a common goal. This event not only fosters team building and knowledge sharing but also presents an opportunity for individuals to acquire new skills, such as writing, editing, and utilizing documentation tools. Participating in a docathon can be particularly beneficial for team members who may lack experience in these areas.

And of course all participants will be recognized for their contributions. Top participants will receive special awards.

Event Details

  • Nov 1: Kick-off
  • Nov 1- Nov 12: Submissions and Feedback
  • Nov 13 – Nov 15: Final Reviews
  • Nov 15: Winner Announcements

Details for the Docathon to be announced at the kick-off call on November 1.

To participate in the Docathon and receive updates about the event, register here: RSVP

We are excited to see the improvements that will come out of this Docathon, and we look forward to your participation!

When Does Optimizing a Proper Loss Yield Calibration?

Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a…Apple Machine Learning Research