Research Focus: Week of July 17, 2023

Research Focus: Week of July 17, 2023

Microsoft Research Focus 20 | Week of July 17, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking

Retrosynthesis is an important task in organic chemistry. It’s designed to propose a list of candidate reactants that are likely to lead to a given product. Recent data-driven approaches to retrosynthesis have achieved promising results. However, they might make predictions based on the training data distribution, a phenomenon known as frequency bias, which can generate lower quality predictions.

In a new paper: RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking, researchers from Microsoft and academic colleagues introduce RetroRanker, a ranking model built upon graph neural networks, which is designed to mitigate frequency bias in predictions of existing retrosynthesis models. In order to lower the rankings of chemically unreasonable predictions, RetroRanker incorporates potential reaction changes of each set of predicted reactants in obtaining the given product. The predicted re-ranked results on publicly available retrosynthesis benchmarks show that RetroRanker can improve results on most state-of-the-art models. Preliminary studies also indicate that RetroRanker can enhance the performance of multi-step retrosynthesis.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is one of the most widely used. Yet despite its popularity, PPO may suffer from mode collapse, instability, and poor sample efficiency.

In a new paper: Fine-Tuning Language Models with Advantage-Induced Policy Alignment, researchers from Microsoft show that these issues can be alleviated by a novel algorithm called Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. This research demonstrates empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model’s initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, the researchers also provide a theoretical justification supporting the design of their loss function.


NEW RESEARCH

A project-driven distributed energy resource dataset for the U.S. grid

Designing future energy systems to accommodate variable renewable energy and third-party owned devices requires information with high spatial and temporal granularity. Existing public datasets focus on specific resource classes (ex. bulk generators, residential solar, or electric vehicles), and are not useful for informing holistic planning or policy decisions. Further, with the growing presence of distributed energy resources (DERs) located in the distribution grid, datasets and models which focus only on the bulk system will no longer be sufficient.

In a new paper: Towards closing the data gap: A project-driven distributed energy resource dataset for the U.S. Grid, researchers from Microsoft address this modelling need with a project-driven dataset of DERs for the contiguous U.S., generated using only publicly available data. They integrate the resources into a high-resolution test system of the U.S. grid. This model, and the DER dataset, enable planners, operators, and policy makers to pose questions and conduct data-driven analysis of rapid decarbonization pathways for the electricity system. They further pose a set of research questions in their research project database.


NEW RESEARCH

End-to-end Privacy Preserving Training and Inference for Air Pollution Forecasting with Data from Rival Fleets

Privacy-preserving machine learning promises to train machine learning models by combining data spread across multiple data silos. Theoretically, secure multiparty computation (MPC) allows multiple data owners to train models on their joint data without revealing data to each other. However, prior implementations have had limitations affecting accuracy, breadth of supported models, and latency overheads that impact their relevance.

In a new paper: End-to-end Privacy Preserving Training and Inference for Air Pollution Forecasting with Data from Rival Fleets, researchers from Microsoft address the practical problem of secure training and inference of models for urban sensing problems. This includes traffic congestion estimation and air pollution monitoring in large cities, where data can be contributed by rival fleet companies while balancing the latency-accuracy trade-offs using MPC-based techniques.

This work includes a custom ML model that can be efficiently trained with MPC within a desirable latency, and an end-to-end system of private training and inference that provably matches the training accuracy of cleartext ML training. This trained model allows users to make sensitive queries in a privacy-preserving manner while carefully handling potentially invalid queries.


NEW RESEARCH

ASL Citizen – A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition

About 70 million deaf people worldwide use a sign language as their primary language, and at least 71 countries mandate the provision of services in sign language. Nonetheless, most existing information resources (like search engines or news sites) are written, and do not offer equitable access. Intelligent sign language systems could help expand access, but development has been impeded by a severe lack of appropriate data.

To help advance the state of sign language modeling, a team at Microsoft collaborated with colleagues at multiple institutions to create ASL Citizen, the first crowdsourced isolated sign language dataset. It contains about 84,000 videos of 2,700 distinct signs from American Sign Language (ASL), making it the largest isolated sign language recognition (ISLR) dataset available. Unlike prior datasets, it features everyday signers in everyday recording scenarios, and was collected with Deaf community involvement, consent, and compensation. The dataset improves state-of-the-art performance in single-sign recognition from about 30% accuracy to 63% accuracy, over a large vocabulary and tested on participants unseen in training.

This dataset is released alongside a new paper: ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition, which reframes ISLR as a dictionary retrieval task and establishes state-of-the-art baselines. Code and a searchable dictionary view of the crowdsourced dataset are also provided.


NEW RESOURCE

MABIM: Multi-agent Benchmark for Inventory Management

Multi-agent reinforcement learning (MARL) empowers multiple agents to accomplish shared objectives through collaboration and competition in specific environments. This approach has applications in diverse fields such as robotics, autonomous driving, gaming, economics, finance, and healthcare. The success of reinforcement learning algorithms depends on a variety of interactive learning environments. These environments enable agents to optimize decision-making strategies across numerous complex scenarios. Despite the emergence of various learning environments in the MARL domain, there remains a shortage of environments that address multiple challenges while offering flexible customization and expansion.

To tackle various MARL challenges, researchers from Microsoft recently released a versatile learning environment: Multi-agent Benchmark for Inventory Management (MABIM). Based on inventory management problems found in operations research, MABIM establishes a MARL benchmark evaluation framework that supports multi-echelon, multi-product inventory networks. This framework allows for the customization of diverse environments, simulating an array of challenging scenarios.

MABIM comprises 51 challenging tasks and includes features such as high operational efficiency, a Gym standard interface, comprehensive strategy visualization tools, and real-data-based capabilities to facilitate MARL research. Initial experiments using MABIM have revealed intriguing findings. For example, as the number of agents increases, the Independent Proximal Policy Optimization (IPPO) algorithm experiences difficulty training and the QTRAN algorithm becomes unstable. IPPO displays short-sighted behavior in resource-limited competitive environments, adopting long-term unprofitable strategies to evade immediate losses. Pure MARL algorithms have difficulty learning effective upstream and downstream strategies in environments that necessitate cooperation. In non-stationary environments, MARL strategies outperform conventional operations research algorithms.


NEW RESEARCH

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

Recently, visual synthesis has attracted a great deal of interest in the field of generative models. Existing work has demonstrated the ability to generate high-quality images. However, videos in real applications are more challenging than images due to their length. A feature film typically runs more than 90 minutes. Cartoons often run for 30 minutes. Even for short video applications like TikTok, the recommended length is 21 to 34 seconds.

In a recent paper: NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation researchers from Microsoft propose a novel architecture for extremely long video generation. Most current work generates long videos segment-by-segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, this new approach adopts a coarse-to-fine process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows direct training on long videos to reduce the training-inference gap, and makes it possible to generate all segments in parallel.

The post Research Focus: Week of July 17, 2023 appeared first on Microsoft Research.

Read More

Sailing Seas of Data: Startup Charts Autonomous Oceanic Monitoring

Sailing Seas of Data: Startup Charts Autonomous Oceanic Monitoring

Saildrone is making a splash in autonomous oceanic monitoring.

The startup’s nautical data collection technology has tracked hurricanes up close in the North Atlantic, discovered a 3,200-foot underwater mountain in the Pacific Ocean and begun to help map the entirety of the world’s ocean floor.

Based in the San Francisco Bay Area, the company develops autonomous uncrewed surface vehicles (USVs) that carry a wide range of sensors. Its data streams are processed on NVIDIA Jetson modules for AI at the edge and are being optimized in prototypes with the NVIDIA DeepStream software development kit for intelligent video analytics.

Saildrone is seeking to make ocean intelligence collection cost-effective, offering data-gathering systems for science, fisheries, weather forecasting, ocean mapping and maritime security.

It has three different USVs, and its Mission Portal control center service is used for monitoring customized missions and visualizing data in near real time. Also, some of Saildrone’s historical data is freely available to the public.

“We’ve sailed into three major hurricanes, and right through the eye of Hurricane Sam, and all the vehicles came out the other side — they are pretty robust platforms,” said Blythe Towal, vice president of software engineering at Saildrone, referring to a powerful cyclone that threatened Bermuda in 2021 .

Saildrone, founded in 2012, has raised $190 million in funding. The startup is a member of NVIDIA Inception, a program that provides companies with technology support and AI platforms guidance.

Keeping an AI on Earth’s Waters

Saildrone is riding a wave of interest for use of its crewless data collection missions in environmental studies of oceans and lakes.

The University of Hawaii at Manoa has enlisted the help of three 23-foot Saildrone Explorer USVs to study the impact of ocean acidification on climate change. The six-month mission around the islands of Hawaii, Maui, Oahu and Kaui will be used to help evaluate the ocean’s health around the state.

Ocean acidification is a reduction in its pH, and contributing factors include the burning of fossil fuels and farming. These can have an impact on coral, oysters, clams, sea urchins and calcareous plankton, which can threaten marine ecosystems.

Saildrone recently partnered with Seabed 2030 to completely map the world’s oceans. Seabed 2030 is a collaboration between the Nippon Foundation and the General Bathymetric Chart of the Oceans, or GEBCO, to map ocean floors worldwide by 2030.

“Saildrone’s vision is of a healthy ocean and a sustainable planet,” said Saildrone founder and CEO Richard Jenkins. “A complete map of the ocean floor is fundamental to achieving that vision.”

The scientific community worldwide is embracing NVIDIA AI for climate studies, including for hyper-local climate modeling, AI to improve sequestering carbon, renewable energy research and many other areas. Dedicating its own expertise, NVIDIA is developing the world’s most powerful AI supercomputer for predicting climate change, named Earth-2, which will be used to create a digital twin of Earth in Omniverse.

Energy-Efficient Data Processing 

Saildrone USVs enable researchers to collect more data using fewer resources than traditional boats and crews, conserving energy and keeping crews out of danger.

The USVs are built for harsh weather and long missions. One of its USVs recently completed a 370-day voyage monitoring carbon dioxide, sailing from Rhode Island across the North Atlantic to Cabo Verde, down to the equator off the west coast of Africa, and back to Florida.

Running mostly on solar and wind power requires energy-efficient computing to handle so much data processing.

“With solar power, being able to keep our compute load power efficiency lower than a typical computing platform running GPUs by implementing NVIDIA Jetson is important for enabling us to do these kinds of missions,” said Towal.

Oceanic Surveying Meets Edge AI

Saildrone relies on the NVIDIA JetPack SDK for access to a full development environment for hardware-accelerated edge AI on the Jetson platform. It runs machine learning on the module for image-based vessel detection to aid navigation.

Saildrone pilots set waypoints and optimize the routes using metocean data — which includes meteorological and oceanographic information — returned from the vehicle. All of the USVs are monitored around the clock, and operators can change course remotely via the cloud if needed.

Machine learning is mostly run locally on the Jetson module— but can run on the cloud as well with a satellite connection — because bandwidth can be limited and costly to shuttle from its robust suite of sensors producing high-resolution imagery.

The USVs have oceanographic sensors for measurement of wind, temperature, salinity and dissolved carbon. The company also enables research of ocean and lake floors with bathymetric sensors, including deep sonar mapping with single- or multi-beam for going deeper or wider. And its perceptual sensor suite includes radar and visual underwater acoustic sensors.

DeepStream Goes Deep Sea

Saildrone taps into the NVIDIA DeepStream SDK for its vision AI applications and services. Developers can build seamless streaming pipelines for AI-based video, audio and image analytics using the kit.

Offering a 10x throughput improvement, DeepStream can be applied from edge to cloud to develop optimized intelligent video applications that handle multiple video, image and audio streams.

Saildrone will rely on DeepStream for image preprocessing and model inference, which enables machine learning at the edge, even at sea while powered by sun and wind.

Learn more about NVIDIA Jetson modules and the DeepStream SDK.

Read More

Enhance Amazon Lex with conversational FAQ features using LLMs

Enhance Amazon Lex with conversational FAQ features using LLMs

Amazon Lex is a service that allows you to quickly and easily build conversational bots (“chatbots”), virtual agents, and interactive voice response (IVR) systems for applications such as Amazon Connect.

Artificial intelligence (AI) and machine learning (ML) have been a focus for Amazon for over 20 years, and many of the capabilities that customers use with Amazon are driven by ML. Today, large language models (LLMs) are transforming the way developers and enterprises solve historically complex challenges related to natural language understanding (NLU). We announced Amazon Bedrock recently, which democratizes Foundational Model access for developers to easily build and scale generative AI-based applications, using familiar AWS tools and capabilities. One of the challenges enterprises face is to incorporate their business knowledge into LLMs to deliver accurate and relevant responses. When leveraged effectively, enterprise knowledge bases can be used to deliver tailored self-service and assisted-service experiences, by delivering information that helps customers solve problems independently and/or augmenting an agent’s knowledge. Today, a bot developer can improve self-service experiences without utilizing LLMs in a couple of ways. First, by creating intents, sample utterances, and responses, thereby covering all anticipated user questions within an Amazon Lex bot. Second, developers can also integrate bots with search solutions, which can index documents stored across a wide range of repositories and find the most relevant document to answer their customer’s question. These methods are effective, but require developer resources making getting started difficult.

One of the benefits offered by LLMs is the ability to create relevant and compelling conversational self-service experiences. They do so by leveraging enterprise knowledge base(s) and delivering more accurate and contextual responses. This blog post introduces a powerful solution for augmenting Amazon Lex with LLM-based FAQ features using the Retrieval Augmented Generation (RAG). We will review how the RAG approach augments Amazon Lex FAQ responses using your company data sources. In addition, we will also demonstrate Amazon Lex integration with LlamaIndex, which is an open-source data framework that provides knowledge source and format flexibility to the bot developer. As a bot developer gains confidence with using a LlamaIndex to explore LLM integration, they can scale the Amazon Lex capability further. They can also use enterprise search services such as Amazon Kendra, which is natively integrated with Amazon Lex.

In this solution, we showcase the practical application of an Amazon Lex chatbot with LLM-based RAG enhancement. We use the Zappos customer support use case as an example to demonstrate the effectiveness of this solution, which takes the user through an enhanced FAQ experience (with LLM), rather than directing them to fallback (default, without LLM).

Solution overview

RAG combines the strengths of traditional retrieval-based and generative AI based approaches to Q&A systems. This methodology harnesses the power of large language models, such as Amazon Titan or open-source models (for example, Falcon), to perform generative tasks in retrieval systems. It also takes into account the semantic context from stored documents more effectively and efficiently.

RAG starts with an initial retrieval step to retrieve relevant documents from a collection based on the user’s query. It then employs a language model to generate a response by considering both the retrieved documents and the original query. By integrating RAG into Amazon Lex, we can provide accurate and comprehensive answers to user queries, resulting in a more engaging and satisfying user experience.

The RAG approach requires document ingestion so that embeddings can be created to enable LLM-based search. The following diagram shows how the ingestion process creates the embeddings that are then used by the chatbot during fallback to answer the customer’s question.

With this solution architecture, you should choose the most suitable LLM for your use case. It also provides an inference endpoint choice between Amazon Bedrock (in limited preview) and models hosted on Amazon SageMaker JumpStart, offering additional LLM flexibility.

The document is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. The S3 bucket has an event listener attached that invokes an AWS Lambda function on changes to the bucket. The event listener ingests the new document and places the embeddings in another S3 bucket. The embeddings are then used by the RAG implementation in the Amazon Lex bot during the fallback intent to answer the customer’s question. The next diagram shows the architecture of how an FAQ bot within Lex can be enhanced with LLMs and RAG.

Let’s explore how we can integrate RAG based on LlamaIndex into an Amazon Lex bot. We provide code examples and an AWS Cloud Development Kit (AWS CDK) import to assist you in setting up the integration. You can find the code examples in our GitHub repository. The following sections provide a step-by-step guide to help you set up the environment and deploy the necessary resources.

How RAG works with Amazon Lex

The flow of RAG involves an iterative process where the retriever component retrieves relevant passages, the question and passages help construct the prompt, and the generation component produces a response. This combination of retrieval and generation techniques allows the RAG model to take advantage of the strengths of both approaches, providing accurate and contextually appropriate answers to user questions. The workflow provides the following capabilities:

  • Retriever engine – The RAG model begins with a retriever component responsible for retrieving relevant documents from a large corpus. This component typically uses an information retrieval technique like TF-IDF or BM25 to rank and select documents that are likely to contain the answer to a given question. The retriever scans the document corpus and retrieves a set of relevant passages.
  • Prompt helper – After the retriever has identified the relevant passages, the RAG model moves to prompt creation. The prompt is a combination of the question and the retrieved passages, serving as additional context for the prompt, which is used as input to the generator component. To create the prompt, the model typically augments the question with the selected passages in a specific format.
  • Response generation – The prompt, consisting of the question and relevant passages, is fed into the generation component of the RAG model. The generation component is usually a language model capable of reasoning through the prompt to generate a coherent and relevant response.
  • Final response – Finally, the RAG model selects the highest-ranked answer as the output and presents it as the response to the original question. The selected answer can be further postprocessed or formatted as necessary before being returned to the user. In addition, the solution enables the filtering of the generated response if the retrieval results yields a low confidence score, implying that it likely falls outside the distribution (OOD).

LlamaIndex: An open-source data framework for LLM-based applications

In this post, we demonstrate the RAG solution based on LlamaIndex. LlamaIndex is an open-source data framework specifically designed to facilitate LLM-based applications. It offers a robust and scalable solution for managing document collection in different formats. With LlamaIndex, bot developers are empowered to effortlessly integrate LLM-based QA (question answering) capabilities into their applications, eliminating the complexities associated with managing solutions catered to large-scale document collections. Furthermore, this approach proves to be cost-effective for smaller-sized document repositories.

Prerequisites

You should have the following prerequisites:

Set up your development environment

The main third-party package requirements are llama_index and sagemaker sdk. Follow the specified commands in our GitHub repository’s README to set up your environment properly.

Deploy the required resources

This step involves creating an Amazon Lex bot, S3 buckets, and a SageMaker endpoint. Additionally, you need to Dockerize the code in the Docker image directory and push the images to Amazon Elastic Container Registry (Amazon ECR) so that it can run in Lambda. Follow the specified commands in our GitHub repository’s README to deploy the services.

During this step, we demonstrate LLM hosting via SageMaker Deep Learning Containers. Adjust the settings according to your computation needs:

  • Model – To find a model that meets your requirements, you can explore resources like the Hugging Face model hub. It offers a variety of models such as Falcon 7B or Flan-T5-XXL. Additionally, you can find detailed information about various officially supported model architectures, helping you make an informed decision. For more information about different model types, refer to optimized architectures.
  • Model inference endpoint – Define the path of the model (for example, Falcon 7B), choose your instance type (for example, g5.4xlarge), and use quantization (for example, int-8 quantization).Note: This solution provides you the flexibility to choose another model inferencing endpoint. You can also use Amazon Bedrock, which provides access to other LLMs such as Amazon Titan.Note: This solution provides you the flexibility to choose another model inferencing endpoint. You can also use Amazon Bedrock, which provides access to other LLMs such as Amazon Titan.

Set up your document index via LlamaIndex

To set up your document index, first upload your document data. We assume that you have the source of your FAQ content, such as a PDF or text file.

After the document data is uploaded, the LlamaIndex system will automatically initiate the process of creating the document index. This task is performed by a Lambda function, which generates the index and saves it to an S3 bucket.

To enable efficient retrieval of relevant information, configure the document retriever using the LlamaIndex Retriever Query Engine. This engine offers several customization options, such as the following:

  • Embedding models – You can choose your embedding model, such as Hugging Face embedding.
  • Confidence cutoff – Specify a confidence cutoff threshold to determine the quality of retrieval results. If the confidence score falls below this threshold, you can choose to provide out-of-scope responses, indicating that the query is beyond the scope of the indexed documents.

Test the integration

Define your bot definition with a fallback intent and use the Amazon Lex console to test your FAQ requests. For more details, please refer to GitHub repository. The following screenshot shows an example conversation with the bot.

Tips to boost your bot efficiency

The following tips could potentially further improve the efficiency of your bot:

  • Index storage – Store your index in an S3 bucket or a service with vector database capabilities such as Amazon OpenSearch. By utilizing cloud-based storage solutions, you can enhance the accessibility and scalability of your index, leading to faster retrieval times and improved overall performance. Also, Refer to this blog post for an Amazon Lex bot that utilizes an Amazon Kendra search solution.
  • Retrieval optimization – Experiment with different sizes of embedding models for the retriever. The choice of embedding model can significantly impact the input requirements of your LLM. Finding the optimal balance between model size and retrieval performance can result in improved efficiency and faster response times.
  • Prompt engineering – Experiment with different prompt formats, lengths, and styles to optimize the performance and quality of your bot’s answers.
  • LLM model selection – Select the most suitable LLM model for your specific use case. Consider factors such as model size, language capabilities, and compatibility with your application requirements. Choosing the right LLM model ensures optimal performance and efficient utilization of system resources.

Contact center conversations can span from self-service to a live human interaction. For use cases involving human-to-human interactions over Amazon Connect, you can use Wisdom to search and find content across multiple repositories, such as frequently asked questions (FAQs), wikis, articles, and step-by-step instructions for handling different customer issues.

Clean up

To avoid incurring future expenses, proceed with deleting all the resources that were deployed as part of this exercise. We have provided a script to shut down the SageMaker endpoint gracefully. Usage details are in the README. Additionally, to remove all the other resources you can run cdk destroy in the same directory as the other cdk commands to deprovision all the resources in your stack.

Summary

This post discussed the following steps to enhance Amazon Lex with LLM-based QA features using the RAG strategy and LlamaIndex:

  • Install the necessary dependencies, including LlamaIndex libraries
  • Set up model hosting via Amazon SageMaker or Amazon Bedrock (in limited preview)
  • Configure LlamaIndex by creating an index and populating it with relevant documents
  • Integrate RAG into Amazon Lex by modifying the configuration and configuring RAG to use LlamaIndex for document retrieval
  • Test the integration by engaging in conversations with the chatbot and observing its retrieval and generation of accurate responses

By following these steps, you can seamlessly incorporate powerful LLM-based QA capabilities and efficient document indexing into your Amazon Lex chatbot, resulting in more accurate, comprehensive, and contextually aware interactions with users. As a follow up, we also invite you to review our next blog post, which explores enhancing the Amazon Lex FAQ experience using URL ingestion and LLMs.


About the authors

Max Henkel-Wallace is a Software Development Engineer at AWS Lex. He enjoys working leveraging technology to maximize customer success. Outside of work he is passionate about cooking, spending time with friends, and backpacking.

Song Feng is a Senior Applied Scientist at AWS AI Labs, specializing in Natural Language Processing and Artificial Intelligence. Her research explores various aspects of these fields including document-grounded dialogue modeling, reasoning for task-oriented dialogues, and interactive text generation using multimodal data.

Saket Saurabh is an engineer with AWS Lex team. He works on improving Lex developer experience to help developers build more human-like chat bots. Outside of work, he enjoys traveling, discovering diverse cuisines, and learn about different cultures.

f

Read More

Enhance Amazon Lex with LLMs and improve the FAQ experience using URL ingestion

Enhance Amazon Lex with LLMs and improve the FAQ experience using URL ingestion

In today’s digital world, most consumers would rather find answers to their customer service questions on their own rather than taking the time to reach out to businesses and/or service providers. This blog post explores an innovative solution to build a question and answer chatbot in Amazon Lex that uses existing FAQs from your website. This AI-powered tool can provide quick, accurate responses to real-world inquiries, allowing the customer to quickly and easily solve common problems independently.

Single URL ingestion

Many enterprises have a published set of answers for FAQs for their customers available on their website. In this case, we want to offer customers a chatbot that can answer their questions from our published FAQs. In the blog post titled Enhance Amazon Lex with conversational FAQ features using LLMs, we demonstrated how you can use a combination of Amazon Lex and LlamaIndex to build a chatbot powered by your existing knowledge sources, such as PDF or Word documents. To support a simple FAQ, based on a website of FAQs, we need to create an ingestion process that can crawl the website and create embeddings that can be used by LlamaIndex to answer customer questions. In this case, we will build on the bot created in the previous blog post, which queries those embeddings with a user’s utterance and returns the answer from the website FAQs.

The following diagram shows how the ingestion process and the Amazon Lex bot work together for our solution.

In the solution workflow, the website with FAQs is ingested via AWS Lambda. This Lambda function crawls the website and stores the resulting text in an Amazon Simple Storage Service (Amazon S3) bucket. The S3 bucket then triggers a Lambda function that uses LlamaIndex to create embeddings that are stored in Amazon S3. When a question from an end-user arrives, such as “What is your return policy?”, the Amazon Lex bot uses its Lambda function to query the embeddings using a RAG-based approach with LlamaIndex. For more information about this approach and the pre-requisites, refer to the blog post, Enhance Amazon Lex with conversational FAQ features using LLMs.

After the pre-requisites from the aforementioned blog are complete, the first step is to ingest the FAQs into a document repository that can be vectorized and indexed by LlamaIndex. The following code shows how to accomplish this:

import logging
import sys
import requests
import html2text
from llama_index.readers.schema.base import Document
from llama_index import GPTVectorStoreIndex
from typing import List

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


class EZWebLoader:

def __init__(self, default_header: str = None):
self._html_to_text_parser = html2text()
if default_header is None:
self._default_header = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
else:
self._default_header = default_header

def load_data(self, urls: List[str], headers: str = None) -> List[Document]:
if headers is None:
headers = self._default_header

documents = []
for url in urls:
response = requests.get(url, headers=headers).text
response = self._html2text.html2text(response)
documents.append(Document(response))
return documents

url = "http://www.zappos.com/general-questions"
loader = EZWebLoader()
documents = loader.load_data([url])
index = GPTVectorStoreIndex.from_documents(documents)

In the preceding example, we take a predefined FAQ website URL from Zappos and ingest it using the EZWebLoader class. With this class, we have navigated to the URL and loaded all the questions that are in the page into an index. We can now ask a question like “Does Zappos have gift cards?” and get the answers directly from our FAQs on the website. The following screenshot shows the Amazon Lex bot test console answering that question from the FAQs.

We were able to achieve this because we had crawled the URL in the first step and created embedddings that LlamaIndex could use to search for the answer to our question. Our bot’s Lambda function shows how this search is run whenever the fallback intent is returned:

import time
import json
import os
import logging
import boto3
from llama_index import StorageContext, load_index_from_storage


logger = logging.getLogger()
logger.setLevel(logging.DEBUG)


def download_docstore():
# Create an S3 client
s3 = boto3.client('s3')

# List all objects in the S3 bucket and download each one
try:
bucket_name = 'faq-bot-storage-001'
s3_response = s3.list_objects_v2(Bucket=bucket_name)

if 'Contents' in s3_response:
for item in s3_response['Contents']:
file_name = item['Key']
logger.debug("Downloading to /tmp/" + file_name)
s3.download_file(bucket_name, file_name, '/tmp/' + file_name)

logger.debug('All files downloaded from S3 and written to local filesystem.')

except Exception as e:
logger.error(e)
raise e

#download the doc store locally
download_docstore()

storage_context = StorageContext.from_defaults(persist_dir="/tmp/")
# load index
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()


def lambda_handler(event, context):
"""
Route the incoming request based on intent.
The JSON body of the request is provided in the event slot.
"""
# By default, treat the user request as coming from the America/New_York time zone.
os.environ['TZ'] = 'America/New_York'
time.tzset()
logger.debug("===== START LEX FULFILLMENT ====")
logger.debug(event)
slots = {}
if "currentIntent" in event and "slots" in event["currentIntent"]:
slots = event["currentIntent"]["slots"]
intent = event["sessionState"]["intent"]

dialogaction = {"type": "Delegate"}
message = []
if str.lower(intent["name"]) == "fallbackintent":
#execute query from the input given by the user
response = str.strip(query_engine.query(event["inputTranscript"]).response)
dialogaction["type"] = "Close"
message.append({'content': f'{response}', 'contentType': 'PlainText'})

final_response = {
"sessionState": {
"dialogAction": dialogaction,
"intent": intent
},
"messages": message
}

logger.debug(json.dumps(final_response, indent=1))
logger.debug("===== END LEX FULFILLMENT ====")

return final_response

This solution works well when a single webpage has all the answers. However, most FAQ sites are not built on a single page. For instance, in our Zappos example, if we ask the question “Do you have a price matching policy?”, then we get a less-than-satisfactory answer, as shown in the following screenshot.

In the preceding interaction, the price-matching policy answer isn’t helpful for our user. This answer is short because the FAQ referenced is a link to a specific page about the price matching policy and our web crawl was only for the single page. Achieving better answers will mean crawling these links as well. The next section shows how to get answers to questions that require two or more levels of page depth.

N-level crawling

When we crawl a web page for FAQ knowledge, the information we want can be contained in linked pages. For example, in our Zappos example, we ask the question “Do you have a price matching policy?” and the answer is “Yes please visit <link> to learn more.” If someone asks “What is your price matching policy?” then we want to give a complete answer with the policy. Achieving this means we have the need to traverse links to get the actual information for our end-user. During the ingestion process, we can use our web loader to find the anchor links to other HTML pages and then traverse them. The following code change to our web crawler allows us to find links in the pages we crawl. It also includes some additional logic to avoid circular crawling and allow a filter by a prefix.

import logging
import requests
import html2text
from llama_index.readers.schema.base import Document
from typing import List
import re


def find_http_urls_in_parentheses(s: str, prefix: str = None):
pattern = r'((https?://[^)]+))'
urls = re.findall(pattern, s)

matched = []
if prefix is not None:
for url in urls:
if str(url).startswith(prefix):
matched.append(url)
else:
matched = urls

return list(set(matched)) # remove duplicates by converting to set, then convert back to list



class EZWebLoader:

def __init__(self, default_header: str = None):
self._html_to_text_parser = html2text
if default_header is None:
self._default_header = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
else:
self._default_header = default_header

def load_data(self,
urls: List[str],
num_levels: int = 0,
level_prefix: str = None,
headers: str = None) -> List[Document]:

logging.info(f"Number of urls: {len(urls)}.")

if headers is None:
headers = self._default_header

documents = []
visited = {}
for url in urls:
q = [url]
depth = num_levels
for page in q:
if page not in visited: #prevent cycles by checking to see if we already crawled a link
logging.info(f"Crawling {page}")
visited[page] = True #add entry to visited to prevent re-crawling pages
response = requests.get(page, headers=headers).text
response = self._html_to_text_parser.html2text(response) #reduce html to text
documents.append(Document(response))
if depth > 0:
#crawl linked pages
ingest_urls = find_http_urls_in_parentheses(response, level_prefix)
logging.info(f"Found {len(ingest_urls)} pages to crawl.")
q.extend(ingest_urls)
depth -= 1 #reduce the depth counter so we go only num_levels deep in our crawl
else:
logging.info(f"Skipping {page} as it has already been crawled")
logging.info(f"Number of documents: {len(documents)}.")
return documents

url = "http://www.zappos.com/general-questions"
loader = EZWebLoader()
#crawl the site with 1 level depth and prefix of "/c/" for customer service root
documents = loader.load_data([url] 
num_levels=1, level_prefix="https://www.zappos.com/c/")
index = GPTVectorStoreIndex.from_documents(documents)

In the preceding code, we introduce the ability to crawl N levels deep, and we give a prefix that allows us to restrict crawling to only things that begin with a certain URL pattern. In our Zappos example, the customer service pages all are rooted from zappos.com/c, so we include that as a prefix to limit our crawls to a smaller and more relevant subset. The code shows how we can ingest up to two levels deep. Our bot’s Lambda logic remains the same because nothing has changed except the crawler ingests more documents.

We now have all the documents indexed and we can ask a more detailed question. In the following screenshot, our bot provides the correct answer to the question “Do you have a price matching policy?”

We now have a complete answer to our question about price matching. Instead of simply being told “Yes see our policy,” it gives us the details from the second-level crawl.

Clean up

To avoid incurring future expenses, proceed with deleting all the resources that were deployed as part of this exercise. We have provided a script to shut down the Sagemaker endpoint gracefully. Usage details are in the README. Additionally, to remove all the other resources you can run cdk destroy in the same directory as the other cdk commands to deprovision all the resources in your stack.

Conclusion

The ability to ingest a set of FAQs into a chatbot enables your customers to find the answers to their questions with straightforward, natural language queries. By combining the built-in support in Amazon Lex for fallback handling with a RAG solution such as a LlamaIndex, we can provide a quick path for our customers to get satisfying, curated, and approved answers to FAQs. By applying N-level crawling into our solution, we can allow for answers that could possibly span multiple FAQ links and provide deeper answers to our customer’s queries. By following these steps, you can seamlessly incorporate powerful LLM-based Q and A capabilities and efficient URL ingestion into your Amazon Lex chatbot. This results in more accurate, comprehensive, and contextually aware interactions with users.


About the authors

Max Henkel-Wallace is a Software Development Engineer at AWS Lex. He enjoys working leveraging technology to maximize customer success. Outside of work he is passionate about cooking, spending time with friends, and backpacking.

Song Feng is a Senior Applied Scientist at AWS AI Labs, specializing in Natural Language Processing and Artificial Intelligence. Her research explores various aspects of these fields including document-grounded dialogue modeling, reasoning for task-oriented dialogues, and interactive text generation using multimodal data.

John Baker is a Principal SDE at AWS where he works on Natural Language Processing, Large Language Models and other ML/AI related projects. He has been with Amazon for 9+ years and has worked across AWS, Alexa and Amazon.com. In his spare time, John enjoys skiing and other outdoor activities throughout the Pacific Northwest.

Read More

SimPer: Simple self-supervised learning of periodic targets

SimPer: Simple self-supervised learning of periodic targets

Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.

Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).

Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.

Feature similarity is different in the context of periodic representations as compared to static features (e.g., images). For example, videos that are offset by short time delays or are reversed should be similar to the original sample, whereas videos that have been upsampled or downsampled by a factor x should be different from the original sample by a factor of x.

To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.

The SimPer framework

SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.

To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.

Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.

Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.

To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.

SimPer constructs negative views of data through transformations in the frequency domain. The input sequence x has an underlying associated periodic signal. SimPer transforms x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo speed or frequency labels for unlabeled input x (periodicity-variant augmentations τ). SimPer takes transformations that do not change the identity of the input and defines these as periodicity-invariant augmentations σ, thus creating different positive views of the sample. Then, it sends these augmented views to the encoder f, which extracts corresponding features.

Results

To evaluate SimPer’s performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.

Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.

Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE).

Conclusion and applications

We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.

Acknowledgements

We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.

Read More

Build an email spam detector using Amazon SageMaker

Build an email spam detector using Amazon SageMaker

Spam emails, also known as junk mail, are sent to a large number of users at once and often contain scams, phishing content, or cryptic messages. Spam emails are sometimes sent manually by a human, but most often they are sent using a bot. Examples of spam emails include fake ads, chain emails, and impersonation attempts. There is a risk that a particularly well-disguised spam email may land in your inbox, which can be dangerous if clicked on. It’s important to take extra precautions to protect your device and sensitive information.

As technology is improving, the detection of spam emails becomes a challenging task due to its changing nature. Spam is quite different from other types of security threats. It may at first appear like an annoying message and not a threat, but it has an immediate effect. Also spammers often adapt new techniques. Organizations who provide email services want to minimize spam as much as possible to avoid any damage to their end customers.

In this post, we show how straightforward it is to build an email spam detector using Amazon SageMaker. The built-in BlazingText algorithm offers optimized implementations of Word2vec and text classification algorithms. Word2vec is useful for various natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, and machine translation. Text classification is essential for applications like web searches, information retrieval, ranking, and document classification.

Solution overview

This post demonstrates how you can set up email spam detector and filter spam emails using SageMaker. Let’s see how a spam detector typically works, as shown in the following diagram.

Emails are sent through a spam detector. An email is sent to the spam folder if the spam detector detects it as spam. Otherwise, it’s sent to the customer’s inbox.

We walk you through the following steps to set up our spam detector model:

  1. Download the sample dataset from the GitHub repo.
  2. Load the data in an Amazon SageMaker Studio notebook.
  3. Prepare the data for the model.
  4. Train, deploy, and test the model.

Prerequisites

Before diving into this use case, complete the following prerequisites:

  1. Set up an AWS account.
  2. Set up a SageMaker domain.
  3. Create an Amazon Simple Storage Service (Amazon S3) bucket. For instructions, see Create your first S3 bucket.

Download the dataset

Download the email_dataset.csv from GitHub and upload the file to the S3 bucket.

The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

Load the data in SageMaker Studio

To perform the data load, complete the following steps:

  1. Download the spam_detector.ipynb file from GitHub and upload the file in SageMaker Studio.
  2. In your Studio notebook, open the spam_detector.ipynb notebook.
  3. If you are prompted to choose a Kernel, choose the Python 3 (Data Science 3.0) kernel and choose Select. If not, verify that the right kernel has been automatically selected.

  1. Import the required Python library and set the roles and the S3 buckets. Specify the S3 bucket and prefix where you uploaded email_dataset.csv.

  1. Run the data load step in the notebook.

  1. Check if the dataset is balanced or not based on the Category labels.

We can see our dataset is balanced.

Prepare the data

The BlazingText algorithm expects the data in the following format:

__label__<label> "<features>"

Here’s an example:

__label__0 “This is HAM"
__label__1 "This is SPAM"

Check Training and Validation Data Format for the BlazingText Algorithm.

You now run the data preparation step in the notebook.

  1. First, you need to convert the Category column to an integer. The following cell replaces the SPAM value with 1 and the HAM value with 0.

  1. The next cell adds the prefix __label__ to each Category value and tokenizes the Message column.

  1. The next step is to split the dataset into train and validation datasets and upload the files to the S3 bucket.

Train the model

To train the model, complete the following steps in the notebook:

  1. Set up the BlazingText estimator and create an estimator instance passing the container image.

  1. Set the learning mode hyperparameter to supervised.

BlazingText has both unsupervised and supervised learning modes. Our use case is text classification, which is supervised learning.

  1. Create the train and validation data channels.

  1. Start training the model.

  1. Get the accuracy of the train and validation dataset.

Deploy the model

In this step, we deploy the trained model as an endpoint. Choose your preferred instance

Test the model

Let’s provide an example of three email messages that we want to get predictions for:

  • Click on below link, provide your details and win this award
  • Best summer deal here
  • See you in the office on Friday.

Tokenize the email message and specify the payload to use when calling the REST API.

Now we can predict the email classification for each email. Call the predict method of the text classifier, passing the tokenized sentence instances (payload) into the data argument.

Clean up

Finally , you can delete the endpoint to avoid any unexpected cost.

Also, delete the data file from S3 bucket.

Conclusion

In this post, we walked you through the steps to create an email spam detector using the SageMaker BlazingText algorithm. With the BlazingText algorithm, you can scale to large datasets. BlazingText is used for textual analysis and text classification problems, and has both unsupervised and supervised learning modes. You can use the algorithm for use cases like customer sentiment analysis and text classification.

To learn more about the BlazingText algorithm, check out BlazingText algorithm.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Read More

Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart

Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart

Today, we are excited to announce that Llama 2 foundation models developed by Meta are available for customers through Amazon SageMaker JumpStart. The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned LLMs, called Llama-2-chat, are optimized for dialogue use cases. You can easily try out these models and use them with SageMaker JumpStart, which is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.

In this post, we walk through how to use Llama 2 models via SageMaker JumpStart.

What is Llama 2

Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. According to Meta, the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The tuned models are intended for assistant-like chat, whereas pre-trained models can be adapted for a variety of natural language generation tasks. Regardless of which version of the model a developer uses, the responsible use guide from Meta can assist in guiding additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a broad selection of open source foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

You can now discover and deploy Llama 2 with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Llama 2 models are available today in Amazon SageMaker Studio, initially in us-east 1 and us-west 2 regions.

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

Once you’re on the SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find two flagship Llama 2 models in the Foundation Models: Text Generation carousel. If you don’t see Llama 2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

You can also find other four model variants by choosing Explore all Text Generation Models or searching for llama in the search box.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.

When you choose either button, a pop-up will show the end-user license agreement and acceptable use policy for you to acknowledge.

Upon acknowledging, you will proceed to the next step to use the model.

Deploy a model

When you choose Deploy and acknowledge the terms, model deployment will start. Alternatively, you can deploy through the example notebook that shows up by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel
my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-70b-f")
predictor = my_model.deploy()

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    “inputs”:  
      [
        [
         {"role": "system", "content": "Always answer with Haiku"},
         {"role": "user", "content": "I am going to Paris, what should I see?"},
        ]   
      ],
   "parameters":{"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}

Fine-tuned chat models (Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat) accept a history of chat between the user and the chat assistant, and generate the subsequent chat. The pre-trained models (Llama-2-7b, Llama-2-13b, Llama-2-70b) requires a string prompt and perform text completion on the provided prompt. See the following code:

predictor.predict(payload, custom_attributes="accept_eula=true")

Note that by default, accept_eula is set to false. You need to set accept_eula=true to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.

Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by = and pairs are separated by ;. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true is passed to the server, then  accept_eula=true is kept and passed to the script handler.

Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

The following table lists all the Llama models available in SageMaker JumpStart along with the model_ids, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models.

Model Name Model ID Max Total Tokens Default Instance Type
Llama-2-7b meta-textgeneration-llama-2-7b 4096 ml.g5.2xlarge
Llama-2-7b-chat meta-textgeneration-llama-2-7b-f 4096 ml.g5.2xlarge
Llama-2-13b meta-textgeneration-llama-2-13b 4096 ml.g5.12xlarge
Llama-2-13b-chat meta-textgeneration-llama-2-13b-f 4096 ml.g5.12xlarge
Llama-2-70b meta-textgeneration-llama-2-70b 4096 ml.g5.48xlarge
Llama-2-70b-chat meta-textgeneration-llama-2-70b-f 4096 ml.g5.48xlarge

Note that SageMaker endpoints have a timeout limit of 60s. Thus, even though the model may be able to generate 4096 tokens, if text generation takes more than 60s, request will fail. For 7B, 13B, and 70B models, we recommend to set max_new_tokens no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.

Inference and example prompts for Llama-2-70b

You can use Llama models for text completion for any piece of text. Through text generation, you can perform a variety of tasks, such as answering questions, language translation, sentiment analysis, and many more. Input payload to the endpoint looks like the following code:

payload = {
    “inputs”: str,
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}.

Input: Simply put, the theory of relativity states that  

 Output: 
1) the laws of physics are the same for all observers in uniform motion relative to one another, and 2) the speed of light is the same for all observers, regardless of their relative motion or of the motion of the source of the light.
In the late 1800s, scientists believed that light travelled through a medium they called the luminiferous ether. But in 1887, Albert Michelson and Edward Morley designed an experiment to measure the speed of light in the ether, and found that the speed of light was the same no matter how they oriented their equipment. This result seemed to contradict the idea of the ether, and Michelson and Morley’s experiment became one of the most famous failed experiments in history.
In 1905, Albert Einstein published a paper that used the results of the Michelson-Morley experiment to develop the theory of special relativity. In 1915, Einstein published a paper on general relativity, which is a theory of gravitation.
In physics, the speed of light is considered to be a constant, and its value is 299,792,458 metres

Input: I believe the meaning of life is 

 Output: 
to be happy, to find joy in life, to love and be loved. To have the courage to be yourself and to be kind to others. To be able to express yourself, to create and to be inspired. To be able to give and receive. To be able to learn and grow. To be able to find beauty in the world and to be able to appreciate it. To be able to find peace and contentment. To be able to find meaning and purpose. To be able to make a difference in the world. To be able to find joy in the simple things. To be able to find joy in the difficult things. To be able to find joy in the journey. To be able to find joy in the destination. To be able to find joy in the present moment. To be able to find joy in the future. To be able to find joy in the past. To be able to find joy in the present. To be able to find joy in the future. To be able to find joy in the past. To be able to find joy in the present. To be able to find joy in the future. To be able to find joy in the past. To be able to find joy in the present. To be able to find joy

Input: A brief message congratulating the team on the launch:

        Hi everyone,
        
        I just  

 Output: 
wanted to send a quick note to congratulate everyone on the launch of our new website. I know it’s been a long time coming and I’m sure we’re all relieved that it’s finally out there.
        
        Thanks to everyone for their hard work and dedication over the last few months. I know it wasn’t always easy, but I think we can all be proud of the final product.
        
        I’m looking forward to seeing how our users respond to the new site and I’m confident that it will be a success.
        
        Thanks again, and enjoy the rest of your day!
        
        Best,
        
        [Your Name]
        
        [Your Title]
        
        [Your Company]
        
        [Your Email]
        
        [Your Phone Number]
        
        [Your Website]
        
        [Your Social Media Handles]
        
        [Your Address]
        
        [Your City, State, Zip Code]
        
        [Your Country]
        
        [Your Postal Code]
        
        [Your Website]
        
        [

In the next example, we show how to use Llama models with few-shot in-context learning, where we provide training samples available to the model. Note that we only make inference on the deployed model and during this process, model weights don’t change.

Input: Translate English to French:
        
        sea otter => loutre de mer
        peppermint => menthe poivrée
        plush girafe => girafe peluche
        cheese => 

 Output: 
fromage
        chocolate => chocolat
        car => voiture
        hamster => hamster
        beer => bière
        bicycle => bicyclette
        banana => banane
        strawberry => fraise
        pizza => pizza
        cheeseburger => cheeseburger
        pineapple => ananas
        carrot => carotte
        watermelon => pastèque
        whale => baleine
        eggplant => aubergine
        cat => chat
        penguin => manchot
        lion => lion
        elephant => éléphant
        panda => panda
        tiger => tigre
        salmon => saumon
        squirrel => écureuil
        kangaroo => kangourou
        turtle => tortue
        octopus => poulpe
        frog => grenouille
        pig => cochon
        goldfish => poisson rouge
        tiger shark => requin tigre
        gorilla => gorille
        dol

Inference and example prompts for Llama-2-70b-chat

With Llama-2-Chat models, which are optimized for dialogue use cases, the input to the chat model endpoints is the previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas that define the chat assistant’s behavior. The input payload to the endpoint looks like the following code:

payload = {
    “inputs”:  [
     [
      (optional){"role": "system", "content": str},
      (repeated k times) {"role": "user", "content": str}, {"role": "assistant", "content": str},
      {"role": "user", "content": str},
     ]
    ],
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with the inference parameters {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}.

Input:
User: what is the recipe of mayonnaise?

Output:
Assistant:  Sure, I can help you with that! Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings. Here's a basic recipe for homemade mayonnaise:

Ingredients:

* 2 egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon (15 ml) vinegar or lemon juice
* 1/2 teaspoon salt
* 1/4 teaspoon black pepper

Instructions:

1. In a medium-sized bowl, whisk together the egg yolks, salt, and black pepper until well combined.
2. Slowly pour the oil into the bowl while continuously whisking the mixture. It's important to add the oil very slowly, otherwise the mixture can break and you'll end up with a separated, curdled mess.
3. Continue whisking until the mixture thickens and emulsifies, which should take about 5-7 minutes. You'll know it's ready when the mixture has doubled in volume and has a thick, creamy consistency.
4. Add the vinegar or lemon juice and continue whisking for another minute or two, until the flavors are fully incorporated.
5. Taste the mayonnaise and adjust the seasoning as needed. You can add more salt, pepper, or acid if desired.
6. Cover the bowl with plastic wrap and refrigerate the mayonnaise for at least 30 minutes before serving. This will allow the flavors to meld together and the mayonnaise to thicken further.

That's it! This recipe makes about 1 1/2 cups (360 ml) of mayonnaise, which should be enough for a small to medium-sized batch of sandwiches or salads. You can also customize the recipe by adding different flavorings, such as garlic, mustard, or herbs, to create a unique taste. Enjoy!

In the following example, the user has had a conversation with the assistant about tourist sites in Paris. Next, the user is inquiring about the first option recommended by the chat assistant.

Input:
User: I am going to Paris, what should I see?
Assistant: Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.
User: What is so great about #1?

Output:
Assistant:  The Eiffel Tower is a world-renowned iconic landmark that symbolizes Paris and France. It was built for the World's Fair in 1889 and was initially intended to be a temporary structure, but it has become a permanent part of the Parisian skyline. Here are some reasons why the Eiffel Tower is so great:

1. Engineering marvel: The Eiffel Tower was a groundbreaking feat of engineering at the time of its construction. It stands at 324 meters (1,063 feet) tall and was the tallest structure in the world for over 40 years.
2. Iconic design: The Eiffel Tower's distinctive shape and design make it instantly recognizable. It has been the subject of countless works of art, photographs, and films, and has become a symbol of Paris and France.
3. Panoramic views: The Eiffel Tower offers breathtaking views of the city from its observation decks. Visitors can enjoy a 360-degree view of Paris, including famous landmarks such as the Seine River, the Arc de Triomphe, and the Montmartre neighborhood.
4. Historical significance: The Eiffel Tower has played a significant role in Paris's history, having been the site of numerous important events, including the famous "Battle of the Eiffel Tower" during World War II.
5. Cultural impact: The Eiffel Tower has had a significant impact on popular culture, inspiring countless works of art, literature, and music. It has also been the subject of numerous documentaries and films, including the classic movie "An American in Paris."
6. Romantic setting: The Eiffel Tower is known for its romantic atmosphere, making it a popular destination for couples and newlyweds. It has been the site of countless marriage proposals and has been featured in numerous romantic films and literature.
7. Accessibility: The Eiffel Tower is easily accessible by public transportation, making it a convenient destination for visitors. It is also wheelchair-friendly, with elevators and ramps providing access to all levels.

Overall, the Eiffel Tower is a must-see destination for anyone visiting Paris, offering a unique and unforgettable experience that combines history, culture, engineering, and rom

In the following examples, we set the system’s configuration:

Input:
System: Always answer with Haiku
User: I am going to Paris, what should I see?

Output:
Assistant:  Eiffel Tower high
Art in the Louvre Museum
River Seine flows

Input:
System: Always answer with emojis
User: How to go from Beijing to NY?

Output:
Assistant:  Here are the emojis that represent the steps to go from Beijing to New York:

🇨🇳🛫🛬🇺🇸🗽🏙🚕💨🛩🛬🛫😴💤🛩🛬🛫😍

Clean up

After you’re done running the notebook, make sure to delete all resources so that all the resources that you created in the process are deleted and your billing is stopped:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Llama 2 models in SageMaker Studio. With this, you have access to six Llama 2 foundation models that contain billions of parameters. Because foundation models are pre-trained, they can also help lower training and infrastructure costs and enable customization for your use case. To get started with SageMaker JumpStart, visit the following resources:


About the authors

June Won is a product manager with SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping application and last mile delivery.

Vivek MadanDr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Sundar Ranganathan is the Global Head of GenAI/Frameworks GTM Specialists at AWS. He focuses on developing GTM strategy for large language models, GenAI, and large-scale ML workloads across AWS services like Amazon EC2, EKS, EFA, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.

Read More