Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Large language models (LLMs) have captured the imagination and attention of developers, scientists, technologists, entrepreneurs, and executives across several industries. These models can be used for question answering, summarization, translation, and more in applications such as conversational agents for customer support, content creation for marketing, and coding assistants.

Recently, Meta released Llama 2 for both researchers and commercial entities, adding to the list of other LLMs, including MosaicML MPT and Falcon. In this post, we walk through how to fine-tune Llama 2 on AWS Trainium, a purpose-built accelerator for LLM training, to reduce training times and costs. We review the fine-tuning scripts provided by the AWS Neuron SDK (using NeMo Megatron-LM), the various configurations we used, and the throughput results we saw.

About the Llama 2 model

Similar to the previous Llama 1 model and other models like GPT, Llama 2 uses the Transformer’s decoder-only architecture. It comes in three sizes: 7 billion, 13 billion, and 70 billion parameters. Compared to Llama 1, Llama 2 doubles context length from 2,000 to 4,000, and uses grouped-query attention (only for 70B). Llama 2 pre-trained models are trained on 2 trillion tokens, and its fine-tuned models have been trained on over 1 million human annotations.

Distributed training of Llama 2

To accommodate Llama 2 with 2,000 and 4,000 sequence length, we implemented the script using NeMo Megatron for Trainium that supports data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). To be specific, with the new implementation of some features like untie word embedding, rotary embedding, RMSNorm, and Swiglu activation, we use the generic script of GPT Neuron Megatron-LM to support the Llama 2 training script.

Our high-level training procedure is as follows: for our training environment, we use a multi-instance cluster managed by the SLURM system for distributed training and scheduling under the NeMo framework.

First, download the Llama 2 model and training datasets and preprocess them using the Llama 2 tokenizer. For example, to use the RedPajama dataset, use the following command:

wget https://data.together.xyz/redpajama-data-1T/v1.0.0/book/book.jsonl

python nemo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py

For detailed guidance of downloading models and the argument of the preprocessing script, refer to Download LlamaV2 dataset and tokenizer.

Next, compile the model:

sbatch --nodes 4 compile.slurm ./llama_7b.sh

After the model is compiled, launch the training job with the following script that is already optimized with the best configuration and hyperparameters for Llama 2 (included in the example code):

sbatch --nodes 4 run.slurm ./llama_7b.sh

Lastly, we monitor TensorBoard to keep track of training progress:

tensorboard --logdir ./

For the complete example code and scripts we mentioned, refer to the Llama 7B tutorial and NeMo code in the Neuron SDK to walk through more detailed steps.

Fine-tuning experiments

We fine-tuned the 7B model on the OSCAR (Open Super-large Crawled ALMAnaCH coRpus) and QNLI (Question-answering NLI) datasets in a Neuron 2.12 environment (PyTorch). For each 2,000 and 4,000 sequence length, we optimized some configurations, such as batchsize and gradient_accumulation, for training efficiency. As a fine-tuning strategy, we adopted full fine-tuning of all parameters (about 500 steps), which can be extended to pre-training with longer steps and larger datasets (for example, 1T RedPajama). Sequence parallelism can also be enabled to allow NeMo Megatron to successfully fine-tune models with a larger sequence length of 4,000. The following table shows the configuration and throughput results of the Llama 7B fine-tuning experiment. The throughput scales almost linearly as the number of instances increase up to 4.

Distributed Library Datasets Sequence Length Number of Instances Tensor Parallel Data Parallel Pipeline Parellel Global Batch size Throughput (seq/s)
Neuron NeMo Megatron OSCAR 4096 1 8 4 1 256 3.7
. . 4096 2 8 4 1 256 7.4
. . 4096 4 8 4 1 256 14.6
. QNLI 4096 4 8 4 1 256 14.1

The last step is to verify the accuracy with the base model. We implemented a reference script for GPU experiments and confirmed the training curves for GPU and Trainium matched as shown in the following figure. The figure illustrates loss curves over the number of training steps on the QNLI dataset. Mixed-precision was adopted for GPU (blue), and bf16 with default stochastic rounding for Trainium (orange).

Training Curve

Conclusion

In this post, we showed that Trainium delivers high performance and cost-effective fine-tuning of Llama 2. For more resources on using Trainium for distributed pre-training and fine-tuning your generative AI models using NeMo Megatron, refer to AWS Neuron Reference for NeMo Megatron.


About the Authors

Hao Zhou is a Research Scientist with Amazon SageMaker. Before that, he worked on developing machine learning methods for fraud detection for Amazon Fraud Detector. He is passionate about applying machine learning, optimization, and generative AI techniques to various real-world problems. He holds a PhD in Electrical Engineering from Northwestern University.

Karthick Gopalswamy is an Applied Scientist with AWS. Before AWS, he worked as a scientist in Uber and Walmart Labs with a major focus on mixed integer optimization. At Uber, he focused on optimizing the public transit network with on-demand SaaS products and shared rides. At Walmart Labs, he worked on pricing and packing optimizations. Karthick has a PhD in Industrial and Systems Engineering with a minor in Operations Research from North Carolina State University. His research focuses on models and methodologies that combine operations research and machine learning.

Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Youngsuk Park is a Sr. Applied Scientist at AWS Annapurna Labs, working on developing and training foundation models on AI accelerators. Prior to that, Dr. Park worked on R&D for Amazon Forecast in AWS AI Labs as a lead scientist. His research lies in the interplay between machine learning, foundational models, optimization, and reinforcement learning. He has published over 20 peer-reviewed papers in top venues, including ICLR, ICML, AISTATS, and KDD, with the service of organizing workshop and presenting tutorials in the area of time series and LLM training. Before joining AWS, he obtained a PhD in Electrical Engineering from Stanford University.

Yida Wang is a principal scientist in the AWS AI team of Amazon. His research interest is in systems, high-performance computing, and big data analytics. He currently works on deep learning systems, with a focus on compiling and optimizing deep learning models for efficient training and inference, especially large-scale foundation models. The mission is to bridge the high-level models from various frameworks and low-level hardware platforms including CPUs, GPUs, and AI accelerators, so that different models can run in high performance on different devices.

Jun (Luke) Huan is a Principal Scientist at AWS AI Labs. Dr. Huan works on AI and Data Science. He has published more than 160 peer-reviewed papers in leading conferences and journals and has graduated 11 PhD students. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu Research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI start-up, and worked as the CEO and Chief Scientist in 2019–2021. Before joining the industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015–2018, he worked as a program director at the US NSF in charge of its big data program.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Simplify medical image classification using Amazon SageMaker Canvas

Simplify medical image classification using Amazon SageMaker Canvas

Analyzing medical images plays a crucial role in diagnosing and treating diseases. The ability to automate this process using machine learning (ML) techniques allows healthcare professionals to more quickly diagnose certain cancers, coronary diseases, and ophthalmologic conditions. However, one of the key challenges faced by clinicians and researchers in this field is the time-consuming and complex nature of building ML models for image classification. Traditional methods require coding expertise and extensive knowledge of ML algorithms, which can be a barrier for many healthcare professionals.

To address this gap, we used Amazon SageMaker Canvas, a visual tool that allows medical clinicians to build and deploy ML models without coding or specialized knowledge. This user-friendly approach eliminates the steep learning curve associated with ML, which frees up clinicians to focus on their patients.

Amazon SageMaker Canvas provides a drag-and-drop interface for creating ML models. Clinicians can select the data they want to use, specify the desired output, and then watch as it automatically builds and trains the model. Once the model is trained, it generates accurate predictions.

This approach is ideal for medical clinicians who want to use ML to improve their diagnosis and treatment decisions. With Amazon SageMaker Canvas, they can use the power of ML to help their patients, without needing to be an ML expert.

Medical image classification directly impacts patient outcomes and healthcare efficiency. Timely and accurate classification of medical images allows for early detection of diseases that aides in effective treatment planning and monitoring. Moreover, the democratization of ML through accessible interfaces like Amazon SageMaker Canvas, enables a broader range of healthcare professionals, including those without extensive technical backgrounds, to contribute to the field of medical image analysis. This inclusive approach fosters collaboration and knowledge sharing and ultimately leads to advancements in healthcare research and improved patient care.

In this post, we’ll explore the capabilities of Amazon SageMaker Canvas in classifying medical images, discuss its benefits, and highlight real-world use cases that demonstrate its impact on medical diagnostics.

Use case

Skin cancer is a serious and potentially deadly disease, and the earlier it is detected, the better chance there is for successful treatment. Statistically, skin cancer (e.g. Basal and squamous cell carcinomas) is one of the most common cancer types and leads to hundreds of thousands of deaths worldwide each year. It manifests itself through the abnormal growth of skin cells.

However, early diagnosis drastically increases the chances of recovery. Moreover, it may render surgical, radiographic, or chemotherapeutic therapies unnecessary or lessen their overall usage, helping to reduce healthcare costs.

The process of diagnosing skin cancer starts with a procedure called a dermoscopy[1], which inspects the general shape, size, and color characteristics of skin lesions. Suspected lesions then undergo further sampling and histological tests for confirmation of the cancer cell type. Doctors use multiple methods to detect skin cancer, starting with visual detection. The American Center for the Study of Dermatology developed a guide for the possible shape of melanoma, which is called ABCD (asymmetry, border, color, diameter) and is used by doctors for initial screening of the disease. If a suspected skin lesion is found, then the doctor takes a biopsy of the visible lesion on the skin and examines it microscopically for a benign or malignant diagnosis and the type of skin cancer. Computer vision models can play a valuable role in helping to identify suspicious moles or lesions, which enables earlier and more accurate diagnosis.

Creating a cancer detection model is a multi-step process, as outlined below:

  1. Gather a large dataset of images from healthy skin and skin with various types of cancerous or precancerous lesions. This dataset needs to be carefully curated to ensure accuracy and consistency.
  2. Use computer vision techniques to preprocess the images and extract relevant to differentiate between healthy and cancerous skin.
  3. Train an ML model on the preprocessed images, using a supervised learning approach to teach the model to distinguish between different skin types.
  4. Evaluate the performance of the model using a variety of metrics, such as precision and recall, to ensure that it accurately identifies cancerous skin and minimizes false positives.
  5. Integrate the model into a user-friendly tool that could be used by dermatologists and other healthcare professionals to aid in the detection and diagnosis of skin cancer.

Overall, the process of developing a skin cancer detection model from scratch typically requires significant resources and expertise. This is where Amazon SageMaker Canvas can help simplify the time and effort for steps 2 – 5.

Solution overview

To demonstrate the creation of a skin cancer computer vision model without writing any code, we use a dermatoscopy skin cancer image dataset published by Harvard Dataverse. We use the dataset, which can be found at HAM10000 and consists of 10,015 dermatoscopic images, to build a skin cancer classification model that predicts skin cancer classes. A few key points about the dataset:

  • The dataset serves as a training set for academic ML purposes.
  • It includes a representative collection of all important diagnostic categories in the realm of pigmented lesions.
  • A few categories in the dataset are: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc)
  • More than 50% of the lesions in the dataset are confirmed through histopathology (histo).
  • The ground truth for the rest of the cases is determined through follow-up examination (follow_up), expert consensus (consensus), or confirmation by in vivo confocal microscopy (confocal).
  • The dataset includes lesions with multiple images, which can be tracked using the lesion_id column within the HAM10000_metadata file.

We showcase how to simplify image classification for multiple skin cancer categories without writing any code using Amazon SageMaker Canvas. Given an image of a skin lesion, SageMaker Canvas image classification automatically classifies an image into benign or possible cancer.

Prerequisites

  • Access to an AWS account with permissions to create the resources described in the steps section.
  • An AWS Identity and Access Management (AWS IAM) user with full permissions to use Amazon SageMaker.

Walkthrough

  1. Set-up SageMaker domain
    1. Create an Amazon SageMaker domain using steps outlined here.
    2. Download the HAM10000 dataset.
  2. Set-up datasets
    1. Create an Amazon Simple Storage Service (Amazon S3) bucket with a unique name, which is image-classification-<ACCOUNT_ID> where ACCOUNT_ID is your unique AWS AccountNumber.

      Creating bucket

      Figure 1 Creating bucket

    2. In this bucket create two folders: training-data and test-data.

      Creating folders

      Figure 2 Create folders

    3. Under training-data, create seven folders for each of the skin cancer categories identified in the dataset: akiec, bcc, bkl, df, mel, nv, and vasc.

      Folder View

      Figure 3 Folder View

    4. The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Using the lesion_id-column, copy the corresponding images in the right folder (i.e., you may start with 100 images for each classification).

      List Objects to Import (Sample Images)

      Figure 4 Listing Objects to import (Sample Images)

  3. Use Amazon SageMaker Canvas
    1. Go to the Amazon SageMaker service in the console and select Canvas from the list. Once you are on the Canvas page, please select Open Canvas button.

      Navigate to SageMaker Canvas

      Figure 5 Navigate to Canvas

    2. Once you are on the Canvas page, select My models and then choose New Model on the right of your screen.

      Model Creation

      Figure 6 Creation of Model

    3. A new pop-up window opens up, where we name image_classify as the model’s name and select Image analysis under the Problem type.
  4. Import the dataset
    1. On the next page, please select Create dataset and in the pop-up box name the dataset as image_classify and select the Create button.

      Dataset creation

      Figure 7 Creating dataset

    2. On the next page, change the Data Source to Amazon S3. You can also directly upload the images (i.e., Local upload).

      Import dataset from S3 buckets

      Figure 8 Import Dataset from S3 buckets

    3. When you select Amazon S3, you’ll get the list of buckets present in your account. Select the parent bucket that holds the dataset into subfolder (e.g., image-classify-2023 and select Import data button. This allows Amazon SageMaker Canvas to quickly label the images based on the folder names.
    4. Once, the dataset is successfully imported, you’ll see the value in the Status column change to Ready from Processing.
    5. Now select your dataset by choosing Select dataset at the bottom of your page.
  5. Build your model
    1. On the Build page, you should see your data imported and labelled as per the folder name in Amazon S3.

      Labelling of Amazon S3 data

      Figure 9 Labelling of Amazon S3 data

    2. Select the Quick build button (i.e., the red-highlighted content in the following image) and you’ll see two options to build the model. First one is the Quick build and second one is Standard build. As name suggest quick build option provides speed over accuracy and it takes around 15 to 30 minutes to build the model. The standard build prioritizes accuracy over speed, with model building taking from 45 minutes to 4 hours to complete. Standard build runs experiments using different combinations of hyperparameters and generates many models in the backend (using SageMaker Autopilot functionality) and then picks the best model.
    3. Select Standard build to start building the model. It takes around 2–5 hours to complete.

      Standard build

      Figure 10 Doing Standard build

    4. Once model build is complete, you can see an estimated accuracy as shown in Figure 11.

      Model Prediction

      Figure 11 Model prediction

    5. If you select the Scoring tab, it should provide you insights into the model accuracy. Also, we can select the Advanced metrics button on the Scoring tab to view the precision, recall, and F1 score (A balanced measure of accuracy that takes class balance into account).
    6. The advanced metrics that Amazon SageMaker Canvas shows you depend on whether your model performs numeric, categorical, image, text, or time series forecasting predictions on your data. In this case, we believe recall is more important than precision because missing a cancer detection is far more dangerous than detecting correct. Categorical prediction, such as 2-category prediction or 3-category prediction, refers to the mathematical concept of classification. The advanced metric recall is the fraction of true positives (TP) out of all the actual positives (TP + false negatives). It measures the proportion of positive instances that were correctly predicted as positive by the model. Please refer this A deep dive into Amazon SageMaker Canvas advanced metrics for a deep dive on the advance metrics.
      Advanced metrics

      Figure 12 Advanced metrics

      This completes the model creation step in Amazon SageMaker Canvas.

  6. Test your model
    1. You can now choose the Predict button, which takes you to the Predict page, where you can upload your own images through Single prediction or Batch prediction. Please set the option of your choice and select Import to upload your image and test the model.

      Test your images

      Figure 13 Test your own images

    2. Let’s start by doing a single image prediction. Make sure you are on the Single Prediction and choose Import image. This takes you to a dialog box where you can choose to upload your image from Amazon S3, or do a Local upload. In our case, we select Amazon S3 and browse to our directory where we have the test images and select any image. Then select Import data.

      Navigate to SageMaker Canvas

      Figure 14 Single Image Prediction

    3. Once selected, you should see the screen says Generating prediction results. You should have your results in a few minutes as shown below.
    4. Now let’s try the Batch prediction. Select Batch prediction under Run predictions and select the Import new dataset button and name it BatchPrediction and hit the Create button.

      Single Image prediction results

      Figure 15 Single image prediction results

    5. On the next window, make sure you have selected Amazon S3 upload and browse to the directory where we have our test set and select the Import data button.

      Batch image prediction

      Figure 16 Batch Image Prediction

    6. Once the images are in Ready status, select the radio button for the created dataset and choose Generate predictions. Now, you should see the status of batch prediction batch to Generating predictions. Let’s wait for few minutes for the results.
    7. Once the status is in Ready state, choose the dataset name that takes you to a page showing the detailed prediction on all our images.

      Bacth prediction results

      Figure 17 Batch image prediction results

    8. Another important feature of Batch Prediction is to be able to verify the results and also be able to download the prediction in a zip or csv file for further usage or sharing.

      Download prediction

      Figure 18 Download prediction

With this you have successfully been able to create a model, train it, and test its prediction with Amazon SageMaker Canvas.

Cleaning up

Choose Log out in the left navigation pane to log out of the Amazon SageMaker Canvas application to stop the consumption of SageMaker Canvas workspace instance hours and release all resources.

Citation

[1]Fraiwan M, Faouri E. On the Automatic Detection and Classification of Skin Cancer Using Deep Transfer Learning. Sensors (Basel). 2022 Jun 30;22(13):4963. doi: 10.3390/s22134963. PMID: 35808463; PMCID: PMC9269808.

Conclusion

In this post, we showed you how medical image analysis using ML techniques can expedite the diagnosis skin cancer, and its applicability to diagnosing other diseases. However, building ML models for image classification is often complex and time-consuming, requiring coding expertise and ML knowledge. Amazon SageMaker Canvas addressed this challenge by providing a visual interface that eliminates the need for coding or specialized ML skills. This empowers healthcare professionals to use ML without a steep learning curve, allowing them to focus on patient care.

The traditional process of developing a cancer detection model is cumbersome and time-consuming. It involves gathering a curated dataset, preprocessing images, training a ML model, evaluate its performance, and integrate it into a user-friendly tool for healthcare professionals. Amazon SageMaker Canvas simplified the steps from preprocessing to integration, which reduced the time and effort required for building a skin cancer detection model.

In this post, we delved into the powerful capabilities of Amazon SageMaker Canvas in classifying medical images, shedding light on its benefits and presenting real-world use cases that showcase its profound impact on medical diagnostics. One such compelling use case we explored was skin cancer detection and how early diagnosis often significantly enhances treatment outcomes and reduces healthcare costs.

It is important to acknowledge that the accuracy of the model can vary depending on factors, such as the size of the training dataset and the specific type of model employed. These variables play a role in determining the performance and reliability of the classification results.

Amazon SageMaker Canvas can serve as an invaluable tool that assists healthcare professionals in diagnosing diseases with greater accuracy and efficiency. However, it is vital to note that it isn’t intended to replace the expertise and judgment of healthcare professionals. Rather, it empowers them by augmenting their capabilities and enabling more precise and expedient diagnoses. The human element remains essential in the decision-making process, and the collaboration between healthcare professionals and artificial intelligence (AI) tools, including Amazon SageMaker Canvas, is pivotal in providing optimal patient care.


About the authors

 Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

Jake Wen is a Solutions Architect at AWS, driven by a passion for Machine Learning, Natural Language Processing, and Deep Learning. He assists Enterprise customers in achieving modernization and scalable deployment in the Cloud. Beyond the tech world, Jake finds delight in skateboarding, hiking, and piloting air drones.

Sonu Kumar Singh is an AWS Solutions Architect, with a specialization in analytics domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact. At AWS his intention is to help customers extract value out of AWS’s 200+ cloud services and empower them in their cloud journey.

Dariush Azimi is a Solution Architect at AWS, with specialization in Machine Learning, Natural Language Processing (NLP), and microservices architecture with Kubernetes. His mission is to empower organizations to harness the full potential of their data through comprehensive end-to-end solutions encompassing data storage, accessibility, analysis, and predictive capabilities.

Read More

Create an HCLS document summarization application with Falcon using Amazon SageMaker JumpStart

Create an HCLS document summarization application with Falcon using Amazon SageMaker JumpStart

Healthcare and life sciences (HCLS) customers are adopting generative AI as a tool to get more from their data. Use cases include document summarization to help readers focus on key points of a document and transforming unstructured text into standardized formats to highlight important attributes. With unique data formats and strict regulatory requirements, customers are looking for choices to select the most performant and cost-effective model, as well as the ability to perform necessary customization (fine-tuning) to fit their business use case. In this post, we walk you through deploying a Falcon large language model (LLM) using Amazon SageMaker JumpStart and using the model to summarize long documents with LangChain and Python.

Solution overview

Amazon SageMaker is built on Amazon’s two decades of experience developing real-world ML applications, including product recommendations, personalization, intelligent shopping, robotics, and voice-assisted devices. SageMaker is a HIPAA-eligible managed service that provides tools that enable data scientists, ML engineers, and business analysts to innovate with ML. Within SageMaker is Amazon SageMaker Studio, an integrated development environment (IDE) purpose-built for collaborative ML workflows, which, in turn, contain a wide variety of quickstart solutions and pre-trained ML models in an integrated hub called SageMaker JumpStart. With SageMaker JumpStart, you can use pre-trained models, such as the Falcon LLM, with pre-built sample notebooks and SDK support to experiment with and deploy these powerful transformer models. You can use SageMaker Studio and SageMaker JumpStart to deploy and query your own generative model in your AWS account.

You can also ensure that the inference payload data doesn’t leave your VPC. You can provision models as single-tenant endpoints and deploy them with network isolation. Furthermore, you can curate and manage the selected set of models that satisfy your own security requirements by using the private model hub capability within SageMaker JumpStart and storing the approved models in there. SageMaker is in scope for HIPAA BAASOC123, and HITRUST CSF.

The Falcon LLM is a large language model, trained by researchers at Technology Innovation Institute (TII) on over 1 trillion tokens using AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised of 40 billion and 7 billion parameters, respectively, with fine-tuned versions trained for specific tasks, such as following instructions. Falcon performs well on a variety of tasks, including text summarization, sentiment analysis, question answering, and conversing. This post provides a walkthrough that you can follow to deploy the Falcon LLM into your AWS account, using a managed notebook instance through SageMaker JumpStart to experiment with text summarization.

The SageMaker JumpStart model hub includes complete notebooks to deploy and query each model. As of this writing, there are six versions of Falcon available in the SageMaker JumpStart model hub: Falcon 40B Instruct BF16, Falcon 40B BF16, Falcon 180B BF16, Falcon 180B Chat BF16, Falcon 7B Instruct BF16, and Falcon 7B BF16. This post uses the Falcon 7B Instruct model.

In the following sections, we show how to get started with document summarization by deploying Falcon 7B on SageMaker Jumpstart.

Prerequisites

For this tutorial, you’ll need an AWS account with a SageMaker domain. If you don’t already have a SageMaker domain, refer to Onboard to Amazon SageMaker Domain to create one.

Deploy Falcon 7B using SageMaker JumpStart

To deploy your model, complete the following steps:

  1. Navigate to your SageMaker Studio environment from the SageMaker console.
  2. Within the IDE, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
  3. Deploy the Falcon 7B Instruct model to an endpoint for inference.

Choosing Falcon-7B-Instruct from SageMaker JumpStart

This will open the model card for the Falcon 7B Instruct BF16 model. On this page, you can find the Deploy or Train options as well as links to open the sample notebooks in SageMaker Studio. This post will use the sample notebook from SageMaker JumpStart to deploy the model.

  1. Choose Open notebook.

SageMaker JumpStart Model Deployment Page

  1. Run the first four cells of the notebook to deploy the Falcon 7B Instruct endpoint.

You can see your deployed JumpStart models on the Launched JumpStart assets page.

  1. In the navigation pane, under SageMaker Jumpstart, choose Launched JumpStart assets.
  2. Choose the Model endpoints tab to view the status of your endpoint.

SageMaker JumpStart Launched Model Page

With the Falcon LLM endpoint deployed, you are ready to query the model.

Run your first query

To run a query, complete the following steps:

  1. On the File menu, choose New and Notebook to open a new notebook.

You can also download the completed notebook here.

Create SageMaker Studio notebook

  1. Select the image, kernel, and instance type when prompted. For this post, we choose the Data Science 3.0 image, Python 3 kernel, and ml.t3.medium instance.

Setting SageMaker Studio Notebook Kernel

  1. Import the Boto3 and JSON modules by entering the following two lines into the first cell:
import json
import boto3
  1. Press Shift + Enter to run the cell.
  2. Next, you can define a function that will call your endpoint. This function takes a dictionary payload and uses it to invoke the SageMaker runtime client. Then it deserializes the response and prints the input and generated text.
newline, bold, unbold = 'n', '33[1m', '33[0m'
endpoint_name = 'ENDPOINT_NAME'

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload).encode('utf-8'))
    model_predictions = json.loads(response['Body'].read())
    generated_text = model_predictions[0]['generated_text']
    print (
        f"Input Text: {payload['inputs']}{newline}"
        f"Generated Text: {bold}{generated_text}{unbold}{newline}")

The payload includes the prompt as inputs, together with the inference parameters that will be passed to the model.

  1. You can use these parameters with the prompt to tune the output of the model for your use case:
payload = {
    "inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:",
    "parameters":{
        "max_new_tokens": 50,
        "return_full_text": False,
        "do_sample": True,
        "top_k":10
        }
}

Query with a summarization prompt

This post uses a sample research paper to demonstrate summarization. The example text file is concerning automatic text summarization in biomedical literature. Complete the following steps:

  1. Download the PDF and copy the text into a file named document.txt.
  2. In SageMaker Studio, choose the upload icon and upload the file to your SageMaker Studio instance.

Uploading File to SageMaker Studio

Out of the box, the Falcon LLM provides support for text summarization.

  1. Let’s create a function that uses prompt engineering techniques to summarize document.txt:
def summarize(text_to_summarize):
    summarization_prompt = """Process the following text and then perform the instructions that follow:

{text_to_summarize}

Provide a short summary of the preceeding text.

Summary:"""
    payload = {
        "inputs": summarization_prompt,
        "parameters":{
            "max_new_tokens": 150,
            "return_full_text": False,
            "do_sample": True,
            "top_k":10
            }
    }
    response = query_endpoint(payload)
    print(response)
    
with open("document.txt") as f:
    text_to_summarize = f.read()

summarize(text_to_summarize)

You’ll notice that for longer documents, an error appears—Falcon, alongside all other LLMs, has a limit on the number of tokens passed as input. We can get around this limit using LangChain’s enhanced summarization capabilities, which allows for a much larger input to be passed to the LLM.

Import and run a summarization chain

LangChain is an open-source software library that allows developers and data scientists to quickly build, tune, and deploy custom generative applications without managing complex ML interactions, commonly used to abstract many of the common use cases for generative AI language models in just a few lines of code. LangChain’s support for AWS services includes support for SageMaker endpoints.

LangChain provides an accessible interface to LLMs. Its features include tools for prompt templating and prompt chaining. These chains can be used to summarize text documents that are longer than what the language model supports in a single call. You can use a map-reduce strategy to summarize long documents by breaking it down into manageable chunks, summarizing them, and combining them (and summarized again, if needed).

  1. Let’s install LangChain to begin:
%pip install langchain
  1. Import the relevant modules and break down the long document into chunks:
import langchain
from langchain import SagemakerEndpoint, PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 500,
                    chunk_overlap  = 20,
                    separators = [" "],
                    length_function = len
                )
input_documents = text_splitter.create_documents([text_to_summarize])
  1. To make LangChain work effectively with Falcon, you need to define the default content handler classes for valid input and output:
class ContentHandlerTextSummarization(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> json:
        response_json = json.loads(output.read().decode("utf-8"))
        generated_text = response_json[0]['generated_text']
        return generated_text.split("summary:")[-1]
    
content_handler = ContentHandlerTextSummarization()
  1. You can define custom prompts as PromptTemplate objects, the main vehicle for prompting with LangChain, for the map-reduce summarization approach. This is an optional step because mapping and combine prompts are provided by default if the parameters within the call to load the summarization chain (load_summarize_chain) are undefined.
map_prompt = """Write a concise summary of this text in a few complete sentences:

{text}

Concise summary:"""

map_prompt_template = PromptTemplate(
                        template=map_prompt, 
                        input_variables=["text"]
                      )


combine_prompt = """Combine all these following summaries and generate a final summary of them in a few complete sentences:

{text}

Final summary:"""

combine_prompt_template = PromptTemplate(
                            template=combine_prompt, 
                            input_variables=["text"]
                          )      
  1. LangChain supports LLMs hosted on SageMaker inference endpoints, so instead of using the AWS Python SDK, you can initialize the connection through LangChain for greater accessibility:
summary_model = SagemakerEndpoint(
                    endpoint_name = endpoint_name,
                    region_name= "us-east-1",
                    model_kwargs= {},
                    content_handler=content_handler
                )
  1. Finally, you can load in a summarization chain and run a summary on the input documents using the following code:
summary_chain = load_summarize_chain(llm=summary_model,
                                     chain_type="map_reduce", 
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
                                     verbose=True
                                    ) 
summary = summary_chain({"input_documents": input_documents, 'token_max': 700}, return_only_outputs=True)
print(summary["output_text"])   

Because the verbose parameter is set to True, you’ll see all of the intermediate outputs of the map-reduce approach. This is useful for following the sequence of events to arrive at a final summary. With this map-reduce approach, you can effectively summarize documents much longer than is normally allowed by the model’s maximum input token limit.

Clean up

After you’ve finished using the inference endpoint, it’s important to delete it to avoid incurring unnecessary costs through the following lines of code:

client = boto3.client('runtime.sagemaker')
client.delete_endpoint(EndpointName=endpoint_name)

Using other foundation models in SageMaker JumpStart

Utilizing other foundation models available in SageMaker JumpStart for document summarization requires minimal overhead to set up and deploy. LLMs occasionally vary with the structure of input and output formats, and as new models and pre-made solutions are added to SageMaker JumpStart, depending on the task implementation, you may have to make the following code changes:

  • If you are performing summarization via the summarize() method (the method without using LangChain), you may have to change the JSON structure of the payload parameter, as well as the handling of the response variable in the query_endpoint() function
  • If you are performing summarization via LangChain’s load_summarize_chain() method, you may have to modify the ContentHandlerTextSummarization class, specifically the transform_input() and transform_output() functions, to correctly handle the payload that the LLM expects and the output the LLM returns

Foundation models vary not only in factors such as inference speed and quality, but also input and output formats. Refer to the LLM’s relevant information page on expected input and output.

Conclusion

The Falcon 7B Instruct model is available on the SageMaker JumpStart model hub and performs on a number of use cases. This post demonstrated how you can deploy your own Falcon LLM endpoint into your environment using SageMaker JumpStart and do your first experiments from SageMaker Studio, allowing you to rapidly prototype your models and seamlessly transition to a production environment. With Falcon and LangChain, you can effectively summarize long-form healthcare and life sciences documents at scale.

For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building document summarization proofs of concept for your healthcare and life science-oriented GenAI applications using the method outlined in this post. When Amazon Bedrock is generally available, we will publish a follow-up post showing how you can implement document summarization using Amazon Bedrock and LangChain.


About the Authors

John Kitaoka is a Solutions Architect at Amazon Web Services. John helps customers design and optimize AI/ML workloads on AWS to help them achieve their business goals.

Josh Famestad is a Solutions Architect at Amazon Web Services. Josh works with public sector customers to build and execute cloud based approaches to deliver on business priorities.

Read More

Automate prior authorization using CRD with CDS Hooks and AWS HealthLake

Automate prior authorization using CRD with CDS Hooks and AWS HealthLake

Prior authorization is a crucial process in healthcare that involves the approval of medical treatments or procedures before they are carried out. This process is necessary to ensure that patients receive the right care and that healthcare providers are following the correct procedures. However, prior authorization can be a time-consuming and complex process that requires a lot of paperwork and communication between healthcare providers, insurance companies, and patients.

The prior authorization process for electronic health record (EHRs) consists of five steps:

  1. Determine whether prior authorization is required.
  2. Gather information necessary to support the prior authorization request.
  3. Submit the request for prior authorization.
  4. Monitor the prior authorization request for resolution.
  5. If needed, supplement the prior authorization request with additional required information (and resume at Step 4).

The Da Vinci Burden Reduction project has rearranged these steps for prior authorization into three interrelated implementation guides that are focused on reducing the clinician and payer burden:

  1. Coverage Requirements Discovery (CRD) – This provides decision support to providers at the time they’re ordering diagnostics, specifying treatments, making referrals, scheduling appointments, and so on.
  2. Documentation Templates and Rules (DTR) – This allows providers to download smart questionnaires and rules, such as Clinical Quality Language (CQL), and provides a SMART on FHIR app or EHR app that runs the questionnaires and rules to gather information relevant to a performed or planned service. Running the questionnaires and rules may also be performed by an application that is part of the provider’s EHR.
  3. Prior Authorization Support (PAS) – This allows provider systems to send (and payer systems to receive) prior authorization requests using FHIR, while still meeting regulatory mandates to have X12 278 used, where required, to transport the prior authorization, potentially simplifying processing for either exchange partner (or both).

In this post, we focus on the CRD implementation guide to determine prior authorization requirements and explain how CDS (Clinical Decision Support) Hooks uses AWS HealthLake to determine if prior authorization is required or not.

Solution overview

CRD is a protocol within the electronic prior authorization workflow that facilitates calls between EHRs and the payers using CDS services. When utilized, it provides information on coverage requirements to providers while patient care decisions are in progress. This enables provider staff to make more informed decisions and meet the requirements of their patient’s insurance coverage. Interaction between providers and payers is done seamlessly using CDS Hooks.

CDS Hooks is a Health Level Seven International (HL7) specification. CDS Hooks provides a way to embed additional, near-real-time functionality within a clinician’s workflow of an EHR. With CDS Hooks, eligibility practices like prior authorization can be properly optimized, along with other pre-certification requirements like the physician’s network participation. This function assists providers in making informed decisions by providing them with information on their patient’s condition, treatment options, and the forms that must be completed to facilitate their care. The strategic use of CDS Hooks allows clinicians to quickly develop more patient-centered care plans and assist the prior authorization process by disclosing critical administrative and clinical requirements. For more information on CDS Hooks and its specification, refer to the CDS Hooks website.

The following diagram illustrates how the CRD workflow is automated using HealthLake.

The workflow steps are as follows:

  1. A provider staff member logs into the EHR system to open the patient chart.
  2. The EHR system validates user credentials and invokes the patient-view hook to retrieve patient condition information.
  3. Amazon API Gateway invokes the Patient View Hooks AWS Lambda function.
  4. The Lambda function validates and retrieves the patient ID from the request and gets the patient condition information from HealthLake.
  5. After reviewing the patient condition, the user invokes the order-select hook to retrieve coverage requirements information for the respective drug.
  6. API Gateway invokes the Coverage Requirements Hooks Lambda function.
  7. The Lambda function retrieves claims information for the patient, runs CQL rules based on the medication submitted and claims information retrieved from HealthLake, and determines whether prior authorization is required.

The solution is available in the Determine Coverage Requirements Discovery using CDS Hooks with AWS HealthLake GitHub repo.

Prerequisites

This post assumes familiarity with the following services:

Deploy the application using the AWS SAM CLI

You can deploy the template using the AWS Management Console or the AWS SAM CLI. To use the CLI, complete the following steps:

  1. Install the AWS SAM CLI.
  2. Download the sample code from the AWS samples repository to your local system:
git clone https://github.com/aws-samples/aws-crd-hooks-with-awshealthlake-api
cd aws-crd-hooks-with-awshealthlake-api/
  1. Build the application using AWS SAM:
sam build
  1. Deploy the application using the guided process:
sam deploy --guided
# Replace MY_VALUE with proper resource names
Configuring SAM deploy

======================

Looking for config file [samconfig.toml] : Not found

Setting default arguments for 'sam deploy'

     =========================================

     Stack Name [sam-app]: aws-cds-hooks-with-healthlake

     AWS Region [us-east-1]: us-east-2

     #Shows you resources changes to be deployed and require a 'Y' to initiate deploy

     Confirm changes before deploy [y/N]:

     #SAM needs permission to be able to create roles to connect to the resources in your template

     Allow SAM CLI IAM role creation [Y/n]:

     #Preserves the state of previously provisioned resources when an operation fails

     Disable rollback [y/N]:

     cdsDemoServicesFunction has no authentication. Is this okay? [y/N]: y

     cqlQueryFunction has no authentication. Is this okay? [y/N]: y

     cqlQueryOrderFunction has no authentication. Is this okay? [y/N]: y

     Save arguments to configuration file [Y/n]: y

     SAM configuration file [samconfig.toml]:

     SAM configuration environment [default]:

The deployment may take 30 minutes or more while AWS creates a HealthLake data store and related resources in your AWS account. AWS SAM may time out and return you to your command line. This timeout stops AWS SAM from showing you the progress in the cloud, but doesn’t stop the deployment happening in the cloud. If you see a timeout, go to the AWS CloudFormation console and verify the CloudFormation stack deployment status. Integrate CDS Hooks with your clinical workflow when the CloudFormation stack deployment is complete.

Determine coverage requirements for prior authorization

The solution has two hooks, patient-view and order-select, to determine if prior authorization is required or not based on prior authorization rules from payer. CQL is used to evaluate prior authorization rules.

CDS Hooks can be integrated with EHR that supports CDS Hooks. Alternatively, if you don’t have EHR available for testing, you can use the publicly available sandbox as described in the GitHub repo. Note that the CDS Hooks sandbox is being used solely for the purpose of testing.

After your hooks are integrated with EHR, when a user navigates to the clinical workflow, the patient-view hook is run for the configured patient. Note that the patient ID from the clinical workflow should exist in HealthLake. The cards returned from the API indicate that the patient has a sinus infection health condition and the doctor may need to order a prescription.

You can navigate to the RX View tab to order a prescription. Acting as the doctor, choose the appropriate medication and enter other details as shown in the following screenshot.

The order-select hook is returned with the prior authorization eligibility card.

The next step is to submit a prior authorization using the SMART app or other mechanisms available to the provider.

Clean up

If you no longer need the AWS resources that you created by running this example, you can remove them by deleting the CloudFormation stack that you deployed:

sam delete --stack-name <<your-stack-name>>

Conclusion

In this post, we showed how HealthLake with CDS Hooks can help reduce the burden on providers and improve the member experience by determining coverage requirements for prior authorization as part of the prescription order clinical workflow. CDS Hooks along with HealthLake can help providers at the time they’re ordering diagnostics, specifying treatments, making referrals, and scheduling appointments.

If you are interested in implementing a coverage requirement discovery on AWS using this solution or want to learn more about the implementing prior authorization on AWS , you can contact an AWS Representative.


About the Authors

Manish Patel, a Global Partner Solutions Architect supporting Healthcare and Life Sciences at AWS. He has more than 20 years of experience building solutions for Medicare, Medicaid, Payers, Providers and Life Sciences customers. He drives go-to-market strategies along with partners to accelerate solution developments in areas such as Electronics Health Records, Medical Imaging, multi-model data solutions and Generative AI. He is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Shravan Vurputoor is a Senior Solutions Architect at AWS. As a trusted customer advocate, he helps organizations understand best practices around advanced cloud-based architectures, and provides advice on strategies to help drive successful business outcomes across a broad set of enterprise customers through his passion for educating, training, designing, and building cloud solutions.

Read More

Code Llama code generation models from Meta are now available via Amazon SageMaker JumpStart

Code Llama code generation models from Meta are now available via Amazon SageMaker JumpStart

Today, we are excited to announce Code Llama foundation models, developed by Meta, are available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. Code Llama is a state-of-the-art large language model (LLM) capable of generating code and natural language about code from both code and natural language prompts. Code Llama is free for research and commercial use. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Code Llama model via SageMaker JumpStart.

What is Code Llama

Code Llama is a model released by Meta that is built on top of Llama 2 and is a state-of-the-art model designed to improve productivity for programming tasks for developers by helping them create high quality, well-documented code. The models show state-of-the-art performance in Python, C++, Java, PHP, C#, TypeScript, and Bash, and have the potential to save developers’ time and make software workflows more efficient. It comes in three variants, engineered to cover a wide variety of applications: the foundational model (Code Llama), a Python specialized model (Code Llama-Python), and an instruction-following model for understanding natural language instructions (Code Llama-Instruct). All Code Llama variants come in three sizes: 7B, 13B, and 34B parameters. The 7B and 13B base and instruct variants support infilling based on surrounding content, making them ideal for code assistant applications.

The models were designed using Llama 2 as the base and then trained on 500 billion tokens of code data, with the Python specialized version trained on an incremental 100 billion tokens. The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.

The model is made available under the same community license as Llama 2.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment and customize models using SageMaker for model training and deployment.

You can now discover and deploy Code Llama models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Code Llama models are discoverable and can be deployed in in US East (N. Virginia), US West (Oregon) and Europe (Ireland) regions.

Customers must accept the EULA to deploy model visa SageMaker SDK.

Discover models

You can access Code Llama foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

On the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find Code Llama models in the Foundation Models: Text Generation carousel.

You can also find other model variants by choosing Explore all Text Generation Models or searching for Code Llama.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find two buttons, Deploy and Open Notebook, which will help you use the model.

Deploy

When you choose Deploy and acknowledge the terms, deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The example notebook that provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id="meta-textgeneration-llama-codellama-7b")
predictor = model.deploy()

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
   "inputs": "<s>[INST] How do I deploy a model on Amazon SageMaker? [/INST]",
   "parameters": {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}
}
predictor.predict(payload, custom_attributes="accept_eula=true")

Note that by default, accept_eula is set to false. You need to set accept_eula=true to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.

Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by = and pairs are separated by ;. If the user passes the same key more than once, the last value is kept and passed to the script handler (in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true is passed to the server, then accept_eula=true is kept and passed to the script handler.

Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

The following table lists all the Code Llama models available in SageMaker JumpStart along with the model IDs, default instance types, and the maximum supported tokens (sum of the number of input tokens and number of generated tokens for all concurrent requests) supported for each of these models.

Model Name Model ID Default Instance Type Max Supported Tokens
CodeLlama-7b meta-textgeneration-llama-codellama-7b ml.g5.2xlarge 10000
CodeLlama-7b-Instruct meta-textgeneration-llama-codellama-7b-instruct ml.g5.2xlarge 10000
CodeLlama-7b-Python meta-textgeneration-llama-codellama-7b-python ml.g5.2xlarge 10000
CodeLlama-13b meta-textgeneration-llama-codellama-13b ml.g5.12xlarge 32000
CodeLlama-13b-Instruct meta-textgeneration-llama-codellama-13b-instruct ml.g5.12xlarge 32000
CodeLlama-13b-Python meta-textgeneration-llama-codellama-13b-python ml.g5.12xlarge 32000
CodeLlama-34b meta-textgeneration-llama-codellama-34b ml.g5.48xlarge 48000
CodeLlama-34b-Instruct meta-textgeneration-llama-codellama-34b-instruct ml.g5.48xlarge 48000
CodeLlama-34b-Python meta-textgeneration-llama-codellama-34b-python ml.g5.48xlarge 48000

While the Code Llama models were trained on a context length of 16,000 tokens, the models have reported good performance on even larger context windows. The maximum supported tokens column in the preceding table is the upper limit on the supported context window on the default instance type. Since the Code Llama 7B model can only support 10,000 tokens on an ml.g5.2xlarge instance, we recommend deploying a 13B or 34B model version if larger contexts are required for your application.

By default, all models work for code generation tasks. The base and instruct models both respond to infilling tasks, though the base model had better quality output for the majority of sample queries. Finally, only instruct models work on instruct tasks. The following table illustrates which models had good performance (Good) and moderate performance (Moderate) on example queries in the demo notebooks.

. Code Generation Code Infilling Code instructions
CodeLlama-7b Good Good N/A
CodeLlama-7b-Instruct Good Moderate Good
CodeLlama-7b-Python Good N/A N/A
CodeLlama-13b Good Good N/A
CodeLlama-13b-Instruct Good Moderate Good
CodeLlama-13b-Python Good N/A N/A
CodeLlama-34b Good N/A N/A
CodeLlama-34b-Instruct Good N/A Good
CodeLlama-34b-Python Good N/A N/A

Code generation

The following examples were run on the CodeLlama-34b-Instruct model with payload parameters "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}:

Input
import socket

def ping_exponential_backoff(host: str):


Output
    """
    Ping a host using exponential backoff.
    """
    for i in range(1, 10):
        try:
            socket.gethostbyname(host)
            return True
        except socket.gaierror:
            time.sleep(2 ** i)
    return False

Input
import argparse

def main(string: str):
    print(string)
    print(string[::-1])

if __name__ == "__main__":


Output
    parser = argparse.ArgumentParser()
    parser.add_argument("string", help="String to be printed")
    args = parser.parse_args()
    main(args.string)

Code infilling

Code infilling involves returning generated code given surrounding context. This differs from the code generation task because, in addition to a prefix code segment, the model is also provided with a code segment suffix. Special tokens were used during fine-tuning to mark the beginning of the prefix (<PRE>), the beginning of the suffix (<SUF>), and the beginning of the middle (<MID>). Input sequences to the model should be in one of the following formats:

  • prefix-suffix-middle <PRE> {prefix} <SUF>{suffix} <MID>
  • suffix-prefix-middle<PRE> <SUF>{suffix} <MID> {prefix}

The following examples use the prefix-suffix-middle format on the CodeLlama-7b model with payload parameters {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}:

Input
<PRE> def remove_non_ascii(s: str) -> str:
    """ <SUF>
    return result
 <MID>


Output
    Remove non-ASCII characters from a string.

    :param s: The string to remove non-ASCII characters from.
    :return: The string with non-ASCII characters removed.
    """
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c

Input
# Installation instructions:
    ```bash
<FILL>
    ```
This downloads the LLaMA inference code and installs the repository as a local pip package.


Output
    git clone https://github.com/LLaMA-AI/LLaMA-inference.git
    cd LLaMA-inference
    pip install -e .

Input
class InterfaceManagerFactory(AbstractManagerFactory):
    def __init__(<FILL>
def main():
    factory = InterfaceManagerFactory(start=datetime.now())
    managers = []
    for i in range(10):
        managers.append(factory.build(id=i))


Output
self, start=None):
        self.start = start

    def build(self, id):
        return InterfaceManager(id, self.start)

Input
/-- A quasi-prefunctoid is 1-connected iff all its etalisations are 1-connected. -/
theorem connected_iff_etalisation [C D : precategoroid] (P : quasi_prefunctoid C D) :
  π₁ P = 0 ↔ <FILL> = 0 :=
begin
  split,
  { intros h f,
    rw pi_1_etalisation at h,
    simp [h],
    refl
  },
  { intro h,
    have := @quasi_adjoint C D P,
    simp [←pi_1_etalisation, this, h],
    refl
  }
end


Output
∀ f : C ⟶ D, π₁ (P.etalise f)

Code instructions

Meta also provided an instruction-tuned variant of Code Llama. Example queries in this section can only be applied to these instruction-tuned Code Llama models, which are the models with a model ID instruct suffix. The Code Llama format for instructions is the same as the Llama-2-chat prompt format, which we detail in Llama 2 foundation models are now available in SageMaker JumpStart

A simple user prompt may look like the following:

<s>[INST] {user_prompt} [/INST]

You may also add a system prompt with the following syntax:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt} [/INST]

Finally, you can have a conversational interaction with the model by including all previous user prompts and assistant responses in the input:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt_1} [/INST] {assistant_response_1} </s><s>[INST] {user_prompt_1} [/INST]

These examples were run on the CodeLlama-13b-Instruct model with payload parameters “parameters”: {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}:

Input
<s>[INST] In Bash, how do I list all text files in the current directory (excluding subdirectories) that have been modified in the last month? [/INST] 

Output
 You can use the `find` command in Bash to list all text files in the current directory (excluding subdirectories) that have been modified in the last month. Here's an example command:
```
find . -type f -name "*.txt" -mtime -30
```
Here's a breakdown of the command:

* `find .`: This tells `find` to search the current directory and all its subdirectories.
* `-type f`: This tells `find` to only search for files (not directories).
* `-name "*.txt"`: This tells `find` to only search for files with the `.txt` extension.
* `-mtime -30`: This tells `find` to only search for files that have been modified in the last 30 days (2,592,000 seconds).

The output of the command will be a list of all text files in the current directory (excluding subdirectories) that have been modified in the last month.

Input
<s>[INST] What is the difference between inorder and preorder traversal? Give an example in Python. [/INST] 

Output
 Inorder and preorder traversal are two different ways of traversing a binary tree.

Inorder traversal is a way of visiting the nodes of a binary tree in a specific order. In an inorder traversal, the left subtree is visited first, then the current node, and finally the right subtree is visited. This means that the nodes are visited in a left-to-right, top-to-bottom order.

Preorder traversal, on the other hand, is a way of visiting the nodes of a binary tree in a different order. In a preorder traversal, the current node is visited first, followed by the left subtree, and finally the right subtree. This means that the nodes are visited in a top-down, left-to-right order.

Here is an example of an inorder traversal of a binary tree in Python:
```
def inorder_traversal(root):
    if root is None:
        return []
    return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)
```
And here is an example of a preorder traversal of a binary tree in Python:
```
def preorder_traversal(root):
    if root is None:
        return []
    return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)
```
In this example, the `inorder_traversal` function returns a list of the values in the tree in the order of the keys, from smallest to largest. The `preorder_traversal` function returns a list of the values in the tree in the order of the keys, but with the current node visited before its children.

Input
<s>[INST] <<SYS>>
Provide answers in JavaScript
<</SYS>>

Write a function that computes the set of sums of all contiguous sublists of a given list. [/INST] 

Output
 function contiguousSublistSums(list) {
  let sums = [];
  for (let i = 0; i < list.length; i++) {
    let sum = 0;
    for (let j = i; j < list.length; j++) {
      sum += list[j];
      sums.push(sum);
    }
  }
  return sums;
}

console.log(contiguousSublistSums([1, 2, 3, 4, 5])); // [1, 3, 6, 10, 15]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6])); // [1, 3, 6, 10, 15, 21]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7])); // [1, 3, 6, 10, 15, 21, 28]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7, 8])); // [1, 3, 6, 10, 15, 21, 28, 36]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7, 8, 9])); // [1, 3, 6, 10, 15, 21, 28, 36, 45]

Clean up

After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Code Llama models in SageMaker Studio and deploy the model for generating code and natural language about code from both code and natural language prompts. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

Resources


About the authors


Gabriel Synnaeve
is a Research Director on the Facebook AI Research (FAIR) team at Meta. Prior to Meta, Gabriel was a postdoctoral fellow in Emmanuel Dupoux’s team at École Normale Supérieure in Paris, working on reverse-engineering the acquisition of language in babies. Gabriel received his PhD in Bayesian modeling applied to real-time strategy games AI from the University of Grenoble.

Eissa Jamil is a Partner Engineer RL, Generative AI at Meta.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Vivek Singh is a product manager with SageMaker JumpStart. He focuses on enabling customers to onboard SageMaker JumpStart to simplify and accelerate their ML journey to build Generative AI applications.

Read More

Build an end-to-end MLOps pipeline for visual quality inspection at the edge – Part 1

Build an end-to-end MLOps pipeline for visual quality inspection at the edge – Part 1

A successful deployment of a machine learning (ML) model in a production environment heavily relies on an end-to-end ML pipeline. Although developing such a pipeline can be challenging, it becomes even more complex when dealing with an edge ML use case. Machine learning at the edge is a concept that brings the capability of running ML models locally to edge devices. In order to deploy, monitor, and maintain these models at the edge, a robust MLOps pipeline is required. An MLOps pipeline allows to automate the full ML lifecycle from data labeling to model training and deployment.

Implementing an MLOps pipeline at the edge introduces additional complexities that make the automation, integration, and maintenance processes more challenging due to the increased operational overhead involved. However, using purpose-built services like Amazon SageMaker and AWS IoT Greengrass allows you to significantly reduce this effort. In this series, we walk you through the process of architecting and building an integrated end-to-end MLOps pipeline for a computer vision use case at the edge using SageMaker, AWS IoT Greengrass, and the AWS Cloud Development Kit (AWS CDK).

This post focuses on designing the overall MLOps pipeline architecture; Part 2 and Part 3 of this series focus on the implementation of the individual components. We have provided a sample implementation in the accompanying GitHub repository for you to try yourself. If you’re just getting started with MLOps at the edge on AWS, refer to MLOps at the edge with Amazon SageMaker Edge Manager and AWS IoT Greengrass for an overview and reference architecture.

Use case: Inspecting the quality of metal tags

As an ML engineer, it’s important to understand the business case you are working on. So before we dive into the MLOps pipeline architecture, let’s look at the sample use case for this post. Imagine a production line of a manufacturer that engraves metal tags to create customized luggage tags. The quality assurance process is costly because the raw metal tags need to be inspected manually for defects like scratches. To make this process more efficient, we use ML to detect faulty tags early in the process. This helps avoid costly defects at later stages of the production process. The model should identify possible defects like scratches in near-real time and mark them. In manufacturing shop floor environments, you often have to deal with no connectivity or constrained bandwidth and increased latency. Therefore, we want to implement an on-edge ML solution for visual quality inspection that can run inference locally on the shop floor and decrease the requirements in regards to connectivity. To keep our example straightforward, we train a model that marks detected scratches with bounding boxes. The following image is an example of a tag from our dataset with three scratches marked.

Metal tag with scratches

Defining the pipeline architecture

We have now gained clarity into our use case and the specific ML problem we aim to address, which revolves around object detection at the edge. Now it’s time to draft an architecture for our MLOps pipeline. At this stage, we aren’t looking at technologies or specific services yet, but rather the high-level components of our pipeline. In order to quickly retrain and deploy, we need to automate the whole end-to-end process: from data labeling, to training, to inference. However, there are a few challenges that make setting up a pipeline for an edge case particularly hard:

  • Building different parts of this process requires different skill sets. For instance, data labeling and training has a strong data science focus, edge deployment requires an Internet of Things (IoT) specialist, and automating the whole process is usually done by someone with a DevOps skill set.
  • Depending on your organization, this whole process might even be implemented by multiple teams. For our use case, we’re working under the assumption that separate teams are responsible for labeling, training, and deployment.
  • More roles and skill sets mean different requirements when it comes to tooling and processes. For instance, data scientists might want to monitor and work with their familiar notebook environment. MLOps engineers want to work using infrastructure as code (IaC) tools and might be more familiar with the AWS Management Console.

What does this mean for our pipeline architecture?

Firstly, it’s crucial to clearly define the major components of the end-to-end system that allows different teams to work independently. Secondly, well-defined interfaces between teams must be defined to enhance collaboration efficiency. These interfaces help minimize disruptions between teams, enabling them to modify their internal processes as needed as long as they adhere to the defined interfaces. The following diagram illustrates what this could look like for our computer vision pipeline.

MLOps pipeline scribble

Let’s examine the overall architecture of the MLOps pipeline in detail:

  1. The process begins with a collection of raw images of metal tags, which are captured using an edge camera device in the production environment to form an initial training dataset.
  2. The next step involves labeling these images and marking defects using bounding boxes. It’s essential to version the labeled dataset, ensuring traceability and accountability for the utilized training data.
  3. After we have a labeled dataset, we can proceed with training, fine-tuning, evaluating, and versioning our model.
  4. When we’re happy with our model performance, we can deploy the model to an edge device and run live inferences at the edge.
  5. While the model operates in production, the edge camera device generates valuable image data containing previously unseen defects and edge cases. We can use this data to further enhance our model’s performance. To accomplish this, we save images for which the model predicts with low confidence or makes erroneous predictions. These images are then added back to our raw dataset, initiating the entire process again.

It’s important to note that the raw image data, labeled dataset, and trained model serve as well-defined interfaces between the distinct pipelines. MLOps engineers and data scientists have the flexibility to choose the technologies within their pipelines as long as they consistently produce these artifacts. Most significantly, we have established a closed feedback loop. Faulty or low-confidence predictions made in production can be used to regularly augment our dataset and automatically retrain and enhance the model.

Target architecture

Now that the high-level architecture is established, it’s time to go one level deeper and look at how we could build this with AWS services. Note that the architecture shown in this post assumes you want to take full control of the whole data science process. However, if you’re just getting started with quality inspection at the edge, we recommend Amazon Lookout for Vision. It provides a way to train your own quality inspection model without having to build, maintain, or understand ML code. For more information, refer to Amazon Lookout for Vision now supports visual inspection of product defects at the edge.

However, if you want to take full control, the following diagram shows what an architecture could look like.

MLOps pipeline architecture

Similar to before, let’s walk through the workflow step by step and identify which AWS services suit our requirements:

  1. Amazon Simple Storage Service (Amazon S3) is used to store raw image data because it provides us with a low-cost storage solution.
  2. The labeling workflow is orchestrated using AWS Step Functions, a serverless workflow engine that makes it easy to orchestrate the steps of the labeling workflow. As part of this workflow, we use Amazon SageMaker Ground Truth to fully automate the labeling using labeling jobs and managed human workforces. AWS Lambda is used to prepare the data, start the labeling jobs, and store the labels in Amazon SageMaker Feature Store.
  3. SageMaker Feature Store stores the labels. It allows us to centrally manage and share our features and provides us with built-in data versioning capabilities, which makes our pipeline more robust.
  4. We orchestrate the model building and training pipeline using Amazon SageMaker Pipelines. It integrates with the other SageMaker features required via built-in steps. SageMaker Training jobs are used to automate the model training, and SageMaker Processing jobs are used to prepare the data and evaluate model performance. In this example, we’re using the Ultralytics YOLOv8 Python package and model architecture to train and export an object detection model to the ONNX ML model format for portability.
  5. If the performance is acceptable, the trained model is registered in Amazon SageMaker Model Registry with an incremental version number attached. It acts as our interface between the model training and edge deployment steps. We also manage the approval state of models here. Similar to the other services used, it’s fully managed, so we don’t have to take care of running our own infrastructure.
  6. The edge deployment workflow is automated using Step Functions, similar to the labeling workflow. We can use the API integrations of Step Functions to easily call the various required AWS service APIs like AWS IoT Greengrass to create new model components and afterwards deploy the components to the edge device.
  7. AWS IoT Greengrass is used as the edge device runtime environment. It manages the deployment lifecycle for our model and inference components at the edge. It allows us to easily deploy new versions of our model and inference components using simple API calls. In addition, ML models at the edge usually don’t run in isolation; we can use the various AWS and community provided components of AWS IoT Greengrass to connect to other services.

The architecture outlined resembles our high-level architecture shown before. Amazon S3, SageMaker Feature Store, and SageMaker Model Registry act as the interfaces between the different pipelines. To minimize the effort to run and operate the solution, we use managed and serverless services wherever possible.

Merging into a robust CI/CD system

The data labeling, model training, and edge deployment steps are core to our solution. As such, any change related to the underlying code or data in any of those parts should trigger a new run of the whole orchestration process. To achieve this, we need to integrate this pipeline into a CI/CD system that allows us to automatically deploy code and infrastructure changes from a versioned code repository into production. Similar to the previous architecture, team autonomy is an important aspect here. The following diagram shows what this could look like using AWS services.

CI/CD pipeline

Let’s walk through the CI/CD architecture:

  1. AWS CodeCommit acts as our Git repository. For the sake of simplicity, in our provided sample, we separated the distinct parts (labeling, model training, edge deployment) via subfolders in a single git repository. In a real-world scenario, each team might use different repositories for each part.
  2. Infrastructure deployment is automated using the AWS CDK and each part (labeling, training, and edge) gets its own AWS CDK app to allow independent deployments.
  3. The AWS CDK pipeline feature uses AWS CodePipeline to automate the infrastructure and code deployments.
  4. The AWS CDK deploys two code pipelines for each step: an asset pipeline and a workflow pipeline. We separated the workflow from the asset deployment to allow us to start the workflows separately in case there are no asset changes (for example, when there are new images available for training).
    • The asset code pipeline deploys all infrastructure required for the workflow to run successfully, such as AWS Identity and Access Management (IAM) roles, Lambda functions, and container images used during training.
    • The workflow code pipeline runs the actual labeling, training, or edge deployment workflow.
  5. Asset pipelines are automatically triggered on commit as well as when a previous workflow pipeline is complete.
  6. The whole process is triggered on a schedule using an Amazon EventBridge rule for regular retraining.

With the CI/CD integration, the whole end-to-end chain is now fully automated. The pipeline is triggered whenever code changes in our git repository as well as on a schedule to accommodate for data changes.

Thinking ahead

The solution architecture described represents the basic components to build an end-to-end MLOps pipeline at the edge. However, depending on your requirements, you might think about adding additional functionality. The following are some examples:

Conclusion

In this post, we outlined our architecture for building an end-to-end MLOps pipeline for visual quality inspection at the edge using AWS services. This architecture streamlines the entire process, encompassing data labeling, model development, and edge deployment, enabling us to swiftly and reliably train and implement new versions of the model. With serverless and managed services, we can direct our focus towards delivering business value rather than managing infrastructure.

In Part 2 of this series, we will delve one level deeper and look at the implementation of this architecture in more detail, specifically labeling and model building. If you want to jump straight to the code, you can check out the accompanying GitHub repo.


About the authors

Michael RothMichael Roth is a Senior Solutions Architect at AWS supporting Manufacturing customers in Germany to solve their business challenges through AWS technology. Besides work and family he’s interested in sports cars and enjoys Italian coffee.

Jörg WöhrleJörg Wöhrle is a Solutions Architect at AWS, working with manufacturing customers in Germany. With a passion for automation, Joerg has worked as a software developer, DevOps engineer, and Site Reliability Engineer in his pre-AWS life. Beyond cloud, he’s an ambitious runner and enjoys quality time with his family. So if you have a DevOps challenge or want to go for a run: let him know.

Johannes LangerJohannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

Read More

Build an end-to-end MLOps pipeline for visual quality inspection at the edge – Part 2

Build an end-to-end MLOps pipeline for visual quality inspection at the edge – Part 2

In Part 1 of this series, we drafted an architecture for an end-to-end MLOps pipeline for a visual quality inspection use case at the edge. It is architected to automate the entire machine learning (ML) process, from data labeling to model training and deployment at the edge. The focus on managed and serverless services reduces the need to operate infrastructure for your pipeline and allows you to get started quickly.

In this post, we delve deep into how the labeling and model building and training parts of the pipeline are implemented. If you’re particularly interested in the edge deployment aspect of the architecture, you can skip ahead to Part 3. We also provide an accompanying GitHub repo if you want to deploy and try this yourself.

Solution overview

The sample use case used for this series is a visual quality inspection solution that can detect defects on metal tags, which could be deployed as part of a manufacturing process. The following diagram shows the high-level architecture of the MLOps pipeline we defined in the beginning of this series. If you haven’t read it yet, we recommend checking out Part 1.

Architecture diagram

Automating data labeling

Data labeling is an inherently labor-intensive task that involves humans (labelers) to label the data. Labeling for our use case means inspecting an image and drawing bounding boxes for each defect that is visible. This may sound straightforward, but we need to take care of a number of things in order to automate this:

  • Provide a tool for labelers to draw bounding boxes
  • Manage a workforce of labelers
  • Ensure good label quality
  • Manage and version our data and labels
  • Orchestrate the whole process
  • Integrate it into the CI/CD system

We can do all of this with AWS services. To facilitate the labeling and manage our workforce, we use Amazon SageMaker Ground Truth, a data labeling service that allows you to build and manage your own data labeling workflows and workforce. You can manage your own private workforce of labelers, or use the power of external labelers via Amazon Mechanical Turk or third-party providers.

On top of that, the whole process can be configured and managed via the AWS SDK, which is what we use to orchestrate our labeling workflow as part of our CI/CD pipeline.

Labeling jobs are used to manage labeling workflows. SageMaker Ground Truth provides out-of-the-box templates for many different labeling task types, including drawing bounding boxes. For more details on how to set up a labeling job for bounding box tasks, check out Streamlining data labeling for YOLO object detection in Amazon SageMaker Ground Truth. For our use case, we adapt the task template for bounding box tasks and use human annotators provided by Mechanical Turk to label our images by default. The following screenshot shows what a labeler sees when working on an image.

Labeling UI

Let’s talk about label quality next. The quality of our labels will affect the quality of our ML model. When automating the image labeling with an external human workforce like Mechanical Turk, it’s challenging to ensure a good and consistent label quality due to the lack of domain expertise. Sometimes a private workforce of domain experts is required. In our sample solution, however, we use Mechanical Turk to implement automated labeling of our images.

There are many ways to ensure good label quality. For more information about best practices, refer to the AWS re:Invent 2019 talk, Build accurate training datasets with Amazon SageMaker Ground Truth. As part of this sample solution, we decided to focus on the following:

Finally, we need to think about how to store our labels so they can be reused for training later and enable traceability of used model training data. The output of a SageMaker Ground Truth labeling job is a file in JSON-lines format containing the labels and additional metadata. We decided to use the offline store of Amazon SageMaker Feature Store to store our labels. Compared to simply storing the labels on Amazon Simple Storage Service (Amazon S3), it provides us with a few distinct advantages:

  • It stores a complete history of feature values, combined with point-in-time queries. This allow us to easily version our dataset and ensure traceability.
  • As a central feature store, it promotes reusability and visibility of our data.

For an introduction to SageMaker Feature Store, refer to Getting started with Amazon SageMaker Feature Store. SageMaker Feature Store supports storing features in tabular format. In our example, we store the following features for each labeled image:

  • The location where the image is stored on Amazon S3
  • Image dimensions
  • The bounding box coordinates and class values
  • A status flag indicating whether the label has been approved for use in training
  • The labeling job name used to create the label

The following screenshot shows what a typical entry in the feature store might look like.

Feature store

With this format, we can easily query the feature store and work with familiar tools like Pandas to construct a dataset to be used for training later.

Orchestrating data labeling

Finally, it’s time to automate and orchestrate each of the steps of our labeling pipeline! For this we use AWS Step Functions, a serverless workflow service that provides us with API integrations to quickly orchestrate and visualize the steps in our workflow. We also use a set of AWS Lambda functions for some of the more complex steps, specifically the following:

  • Check if there are new images that require labeling in Amazon S3
  • Prepare the data in the required input format and start the labeling job
  • Prepare the data in the required input format and start the label verification job
  • Write the final set of labels to the feature store

The following figure shows what the full Step Functions labeling state machine looks like.

Labeling StepFunction

Labeling: Infrastructure deployment and integration into CI/CD

The final step is to integrate the Step Functions workflow into our CI/CD system and ensure that we deploy the required infrastructure. To accomplish this task, we use the AWS Cloud Development Kit (AWS CDK) to create all of the required infrastructure, like the Lambda functions and Step Functions workflow. With CDK Pipelines, a module of AWS CDK, we create a pipeline in AWS CodePipeline that deploys changes to our infrastructure and triggers an additional pipeline to start the Step Functions workflow. The Step Functions integration in CodePipeline makes this task very easy. We use Amazon EventBridge and CodePipeline Source actions to make sure that the pipeline is triggered on a schedule as well as when changes are pushed to git.

The following diagram shows what the CI/CD architecture for labeling looks like in detail.

Labeling CDK

Recap automating data labeling

We now have a working pipeline to automatically create labels from unlabeled images of metal tags using SageMaker Ground Truth. The images are picked up from Amazon S3 and fed into a SageMaker Ground Truth labeling job. After the images are labeled, we do a quality check using a label verification job. Finally, the labels are stored in a feature group in SageMaker Feature Store. If you want to try the working example yourself, check out the accompanying GitHub repository. Let’s look at how to automate model building next!

Automating model building

Similar to labeling, let’s have an in-depth look at our model building pipeline. At a minimum, we need to orchestrate the following steps:

  • Pull the latest features from the feature store
  • Prepare the data for model training
  • Train the model
  • Evaluate model performance
  • Version and store the model
  • Approve the model for deployment if performance is acceptable

The model building process is usually driven by a data scientist and is the outcome of a set of experiments done using notebooks or Python code. We can follow a simple three-step process to convert an experiment to a fully automated MLOps pipeline:

  1. Convert existing preprocessing, training, and evaluation code to command line scripts.
  2. Create a SageMaker pipeline definition to orchestrate model building. Use the scripts created in step one as part of the processing and training steps.
  3. Integrate the pipeline into your CI/CD workflow.

This three-step process is generic and can be used for any model architecture and ML framework of your choice. Let’s follow it and start with Step 1 to create the following scripts:

  • preprocess.py – This pulls labeled images from SageMaker Feature Store, splits the dataset, and transforms it into the required format for training our model, in our case the input format for YOLOv8
  • train.py – This trains an Ultralytics YOLOv8 object detection model using PyTorch to detect scratches on images of metal tags

Orchestrating model building

In Step 2, we bundle these scripts up into training and processing jobs and define the final SageMaker pipeline, which looks like the following figure.

SageMaker Pipeline

It consists of the following steps:

  1. A ProcessingStep to load the latest features from SageMaker Feature Store; split the dataset into training, validation, and test sets; and store the datasets as tarballs for training.
  2. A TrainingStep to train the model using the training, validation, and test datasets and export the mean Average Precision (mAP) metric for the model.
  3. A ConditionStep to evaluate if the mAP metric value of the trained model is above a configured threshold. If so, a RegisterModel step is run that registers the trained model in the SageMaker Model Registry.

If you are interested in the detailed pipeline code, check out the pipeline definition in our sample repository.

Training: Infrastructure deployment and integration into CI/CD

Now it’s time for Step 3: integration into the CI/CD workflow. Our CI/CD pipeline follows the same pattern illustrated in the labeling section before. We use the AWS CDK to deploy the required pipelines from CodePipeline. The only difference is that we use Amazon SageMaker Pipelines instead of Step Functions. The SageMaker pipeline definition is constructed and triggered as part of a CodeBuild action in CodePipeline.

Training CDK

Conclusion

We now have a fully automated labeling and model training workflow using SageMaker. We started by creating command line scripts from the experiment code. Then we used SageMaker Pipelines to orchestrate each of the model training workflow steps. The command line scripts were integrated as part of the training and processing steps. At the end of the pipeline, the trained model is versioned and registered in SageMaker Model Registry.

Check out Part 3 of this series, where we will take a closer look at the final step of our MLOps workflow. We will create the pipeline that compiles and deploys the model to an edge device using AWS IoT Greengrass!


About the authors

Michael RothMichael Roth is a Senior Solutions Architect at AWS supporting Manufacturing customers in Germany to solve their business challenges through AWS technology. Besides work and family he’s interested in sports cars and enjoys Italian coffee.

Jörg WöhrleJörg Wöhrle is a Solutions Architect at AWS, working with manufacturing customers in Germany. With a passion for automation, Joerg has worked as a software developer, DevOps engineer, and Site Reliability Engineer in his pre-AWS life. Beyond cloud, he’s an ambitious runner and enjoys quality time with his family. So if you have a DevOps challenge or want to go for a run: let him know.

Johannes LangerJohannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

Read More

Build an end-to-end MLOps pipeline for visual quality inspection at the edge – Part 3

Build an end-to-end MLOps pipeline for visual quality inspection at the edge – Part 3

This is Part 3 of our series where we design and implement an MLOps pipeline for visual quality inspection at the edge. In this post, we focus on how to automate the edge deployment part of the end-to-end MLOps pipeline. We show you how to use AWS IoT Greengrass to manage model inference at the edge and how to automate the process using AWS Step Functions and other AWS services.

Solution overview

In Part 1 of this series, we laid out an architecture for our end-to-end MLOps pipeline that automates the entire machine learning (ML) process, from data labeling to model training and deployment at the edge. In Part 2, we showed how to automate the labeling and model training parts of the pipeline.

The sample use case used for this series is a visual quality inspection solution that can detect defects on metal tags, which you can deploy as part of a manufacturing process. The following diagram shows the high-level architecture of the MLOps pipeline we defined in the beginning of this series. If you haven’t read it yet, we recommend checking out Part 1.

Architecture diagram

Automating the edge deployment of an ML model

After an ML model has been trained and evaluated, it needs to be deployed to a production system to generate business value by making predictions on incoming data. This process can quickly become complex in an edge setting where models need to be deployed and run on devices that are often located far away from the cloud environment in which the models have been trained. The following are some of the challenges unique to machine learning at the edge:

  • ML models often need to be optimized due to resource constraints on edge devices
  • Edge devices can’t be redeployed or even replaced like a server in the cloud, so you need a robust model deployment and device management process
  • Communication between devices and the cloud needs to be efficient and secure because it often traverses untrusted low-bandwidth networks

Let’s see how we can tackle these challenges with AWS services in addition to exporting the model in the ONNX format, which allows us to, for example, apply optimizations like quantization to reduce the model size for constraint devices. ONNX also provides optimized runtimes for the most common edge hardware platforms.

Breaking the edge deployment process down, we require two components:

  • A deployment mechanism for the model delivery, which includes the model itself and some business logic to manage and interact with the model
  • A workflow engine that can orchestrate the whole process to make this robust and repeatable

In this example, we use different AWS services to build our automated edge deployment mechanism, which integrates all the required components we discussed.

Firstly, we simulate an edge device. To make it straightforward for you to go through the end-to-end workflow, we use an Amazon Elastic Compute Cloud (Amazon EC2) instance to simulate an edge device by installing the AWS IoT Greengrass Core software on the instance. You can also use EC2 instances to validate the different components in a QA process before deploying to an actual edge production device. AWS IoT Greengrass is an Internet of Things (IoT) open-source edge runtime and cloud service that helps you build, deploy, and manage edge device software. AWS IoT Greengrass reduces the effort to build, deploy, and manage edge device software in a secure and scalable way. After you install the AWS IoT Greengrass Core software on your device, you can add or remove features and components, and manage your IoT device applications using AWS IoT Greengrass. It offers a lot of built-in components to make your life easier, such as the StreamManager and MQTT broker components, which you can use to securely communicate with the cloud, supporting end-to-end encryption. You can use those features to upload inference results and images efficiently.

In a production environment, you would typically have an industrial camera delivering images for which the ML model should produce predictions. For our setup, we simulate this image input by uploading a preset of images into a specific directory on the edge device. We then use these images as inference input for the model.

We divided the overall deployment and inference process into three consecutive steps to deploy a cloud-trained ML model to an edge environment and use it for predictions:

  1. Prepare – Package the trained model for edge deployment.
  2. Deploy – Transfer of model and inference components from the cloud to the edge device.
  3. Inference – Load the model and run inference code for image predictions.

The following architecture diagram shows the details of this three-step process and how we implemented it with AWS services.

Inference Process

In the following sections, we discuss the details for each step and show how to embed this process into an automated and repeatable orchestration and CI/CD workflow for both the ML models and corresponding inference code.

Prepare

Edge devices often come with limited compute and memory compared to a cloud environment where powerful CPUs and GPUs can run ML models easily. Different model-optimization techniques allow you to tailor a model for a specific software or hardware platform to increase prediction speed without losing accuracy.

In this example, we exported the trained model in the training pipeline to the ONNX format for portability, possible optimizations, as well as optimized edge runtimes, and registered the model within Amazon SageMaker Model Registry. In this step, we create a new Greengrass model component including the latest registered model for subsequent deployment.

Deploy

A secure and reliable deployment mechanism is key when deploying a model from the cloud to an edge device. Because AWS IoT Greengrass already incorporates a robust and secure edge deployment system, we’re using this for our deployment purposes. Before we look at our deployment process in detail, let’s do a quick recap on how AWS IoT Greengrass deployments work. At the core of the AWS IoT Greengrass deployment system are components, which define the software modules deployed to an edge device running AWS IoT Greengrass Core. These can either be private components that you build or public components that are provided either by AWS or the broader Greengrass community. Multiple components can be bundled together as part of a deployment. A deployment configuration defines the components included in a deployment and the deployment’s target devices. It can either be defined in a deployment configuration file (JSON) or via the AWS IoT Greengrass console when creating a new deployment.

We create the following two Greengrass components, which are then deployed to the edge device via the deployment process:

  • Packaged model (private component) – This component contains the trained and ML model in ONNX format.
  • Inference code (private component) – Aside from the ML model itself, we need to implement some application logic to handle tasks like data preparation, communication with the model for inference, and postprocessing of inference results. In our example, we’ve developed a Python-based private component to handle the following tasks:
    • Install the required runtime components like the Ultralytics YOLOv8 Python package.
    • Instead of taking images from a camera live stream, we simulate this by loading prepared images from a specific directory and preparing the image data according to the model input requirements.
    • Make inference calls against the loaded model with the prepared image data.
    • Check the predictions and upload inference results back to the cloud.

If you want to have a deeper look at the inference code we built, refer to the GitHub repo.

Inference

The model inference process on the edge device automatically starts after deployment of the aforementioned components is finished. The custom inference component periodically runs the ML model with images from a local directory. The inference result per image returned from the model is a tensor with the following content:

  • Confidence scores – How confident the model is regarding the detections
  • Object coordinates – The scratch object coordinates (x, y, width, height) detected by the model in the image

In our case, the inference component takes care of sending inference results to a specific MQTT topic on AWS IoT where it can be read for further processing. These messages can be viewed via the MQTT test client on the AWS IoT console for debugging. In a production setting, you can decide to automatically notify another system that takes care of removing faulty metal tags from the production line.

Orchestration

As seen in the preceding sections, multiple steps are required to prepare and deploy an ML model, the corresponding inference code, and the required runtime or agent to an edge device. Step Functions is a fully managed service that allows you to orchestrate these dedicated steps and design the workflow in the form of a state machine. The serverless nature of this service and native Step Functions capabilities like AWS service API integrations allow you to quickly set up this workflow. Built-in capabilities like retries or logging are important points to build robust orchestrations. For more details regarding the state machine definition itself, refer to the GitHub repository or check the state machine graph on the Step Functions console after you deploy this example in your account.

Infrastructure deployment and integration into CI/CD

The CI/CD pipeline to integrate and build all the required infrastructure components follows the same pattern illustrated in Part 1 of this series. We use the AWS Cloud Development Kit (AWS CDK) to deploy the required pipelines from AWS CodePipeline.

Deployment CDK

Learnings

There are multiple ways to build an architecture for an automated, robust, and secure ML model edge deployment system, which are often very dependent on the use case and other requirements. However, here a few learnings we would like to share with you:

  • Evaluate in advance if the additional AWS IoT Greengrass compute resource requirements fit your case, especially with constrained edge devices.
  • Establish a deployment mechanism that integrates a verification step of the deployed artifacts before running on the edge device to ensure that no tampering happened during transmission.
  • It’s good practice to keep the deployment components on AWS IoT Greengrass as modular and self-contained as possible to be able to deploy them independently. For example, if you have a relatively small inference code module but a big ML model in terms of size, you don’t always want to the deploy them both if just the inference code has changed. This is especially important when you have limited bandwidth or high cost edge device connectivity.

Conclusion

This concludes our three-part series on building an end-to-end MLOps pipeline for visual quality inspection at the edge. We looked at the additional challenges that come with deploying an ML model at the edge like model packaging or complex deployment orchestration. We implemented the pipeline in a fully automated way so we can put our models into production in a robust, secure, repeatable, and traceable fashion. Feel free to use the architecture and implementation developed in this series as a starting point for your next ML-enabled project. If you have any questions how to architect and build such a system for your environment, please reach out. For other topics and use cases, refer to our Machine Learning and IoT blogs.


About the authors

Michael RothMichael Roth is a Senior Solutions Architect at AWS supporting Manufacturing customers in Germany to solve their business challenges through AWS technology. Besides work and family he’s interested in sports cars and enjoys Italian coffee.

Jörg WöhrleJörg Wöhrle is a Solutions Architect at AWS, working with manufacturing customers in Germany. With a passion for automation, Joerg has worked as a software developer, DevOps engineer, and Site Reliability Engineer in his pre-AWS life. Beyond cloud, he’s an ambitious runner and enjoys quality time with his family. So if you have a DevOps challenge or want to go for a run: let him know.

Johannes LangerJohannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

Read More