The Amazon AGI SF Lab will focus on developing new foundational capabilities for enabling useful AI agents.Read More
Accelerating ML experimentation with enhanced security: AWS PrivateLink support for Amazon SageMaker with MLflow
With access to a wide range of generative AI foundation models (FM) and the ability to build and train their own machine learning (ML) models in Amazon SageMaker, users want a seamless and secure way to experiment with and select the models that deliver the most value for their business. In the initial stages of an ML project, data scientists collaborate closely, sharing experimental results to address business challenges. However, keeping track of numerous experiments, their parameters, metrics, and results can be difficult, especially when working on complex projects simultaneously. MLflow, a popular open-source tool, helps data scientists organize, track, and analyze ML and generative AI experiments, making it easier to reproduce and compare results.
SageMaker is a comprehensive, fully managed ML service designed to provide data scientists and ML engineers with the tools they need to handle the entire ML workflow. Amazon SageMaker with MLflow is a capability in SageMaker that enables users to create, manage, analyze, and compare their ML experiments seamlessly. It simplifies the often complex and time-consuming tasks involved in setting up and managing an MLflow environment, allowing ML administrators to quickly establish secure and scalable MLflow environments on AWS. See Fully managed MLFlow on Amazon SageMaker for more details.
Enhanced security: AWS VPC and AWS PrivateLink
When working with SageMaker, you can decide the level of internet access to provide to your users. For example, you can give users access permission to download popular packages and customize the development environment. However, this can also introduce potential risks of unauthorized access to your data. To mitigate these risks, you can further restrict which traffic can access the internet by launching your ML environment in an Amazon Virtual Private Cloud (Amazon VPC). With an Amazon VPC, you can control the network access and internet connectivity of your SageMaker environment, or even remove direct internet access to add another layer of security. See Connect to SageMaker through a VPC interface endpoint to understand the implications of running SageMaker within a VPC and the differences when using network isolation.
SageMaker with MLflow now supports AWS PrivateLink, which enables you to transfer critical data from your VPC to MLflow Tracking Servers through a VPC endpoint. This capability enhances the protection of sensitive information by making sure that data sent to the MLflow Tracking Servers is transferred within the AWS network, avoiding exposure to the public internet. This capability is available in all AWS Regions where SageMaker is currently available, excluding China Regions and GovCloud (US) Regions. To learn more, see Connect to an MLflow tracking server through an Interface VPC Endpoint.
In this blogpost, we demonstrate a use case to set up a SageMaker environment in a private VPC (without internet access), while using MLflow capabilities to accelerate ML experimentation.
Solution overview
You can find the reference code for this sample in GitHub. The high-level steps are as follows:
- Deploy infrastructure with the AWS Cloud Development Kit (AWS CDK) including:
- A SageMaker environment in a private VPC without internet access.
- AWS CodeArtifact, which provides a private PyPI repository so that SageMaker can use it to download necessary packages.
- VPC endpoints, which enable the SageMaker environment to connect to other AWS services (Amazon Simple Storage Service (Amazon S3), AWS CodeArtifact, Amazon Elastic Container Registry (Amazon ECR), Amazon CloudWatch, SageMaker Managed MLflow, and so on) through AWS PrivateLink without exposing the environment to the public internet.
- Run ML experimentation with MLflow using the @remote decorator from the open-source SageMaker Python SDK.
The overall solution architecture is shown in the following figure.
For your reference, this blog post demonstrates a solution to create a VPC with no internet connection using an AWS CloudFormation template.
Prerequisites
You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, see Creating an AWS account.
Deploy infrastructure with AWS CDK
The first step is to create the infrastructure using this CDK stack. You can follow the deployment instructions from the README.
Let’s first have a closer look at the CDK stack itself.
It defines multiple VPC endpoints, including the MLflow endpoint as shown in the following sample:
vpc.add_interface_endpoint(
"mlflow-experiments",
service=ec2.InterfaceVpcEndpointAwsService.SAGEMAKER_EXPERIMENTS,
private_dns_enabled=True,
subnets=ec2.SubnetSelection(subnets=subnets),
security_groups=[studio_security_group]
)
We also try to restrict the SageMaker execution IAM role so that you can use SageMaker MLflow only when you’re in the right VPC.
You can further restrict the VPC endpoint for MLflow by attaching a VPC endpoint policy.
Users outside the VPC can potentially connect to Sagemaker MLflow through the VPC endpoint to MLflow. You can add restrictions so that user access to SageMaker MLflow is only allowed from your VPC.
studio_execution_role.attach_inline_policy(
iam.Policy(self, "mlflow-policy",
statements=[
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
actions=["sagemaker-mlflow:*"],
resources=["*"],
conditions={"StringEquals": {"aws:SourceVpc": vpc.vpc_id } }
)
]
)
)
After successful deployment, you should be able to see the new VPC in the AWS Management Console for Amazon VPC without internet access, as shown in the following screenshot.
A CodeArtifact domain and a CodeArtifact repository with external connection to PyPI should also be created, as shown in the following figure, so that SageMaker can use it to download necessary packages without internet access. You can verify the creation of the domain and the repository by going to the CodeArtifact console. Choose “Repositories” under “Artifacts” from the navigation pane and you will see the repository “pip”.
ML experimentation with MLflow
Setup
After the CDK stack creation, a new SageMaker domain with a user profile should also be created. Launch Amazon SageMaker Studio and create a JupyterLab Space. In the JupyterLab Space, choose an instance type of ml.t3.medium
, and select an image with SageMaker Distribution 2.1.0
.
To check that the SageMaker environment has no internet connection, open the JupyterLab space and check the internet connection by running the curl command in a terminal.
SageMaker with MLflow now supports MLflow version 2.16.2
to accelerate generative AI and ML workflows from experimentation to production. An MLflow 2.16.2
tracking server is created along with the CDK stack.
You can find the MLflow tracking server Amazon Resource Name (ARN) either from the CDK output or from the SageMaker Studio UI by clicking “MLFlow” icon, as shown in the following figure. You can click the “copy” button next to the “mlflow-server” to copy the MLflow tracking server ARN.
As an example dataset to train the model, download the reference dataset from the public UC Irvine ML repository to your local PC, and name it predictive_maintenance_raw_data_header.csv
.
Upload the reference dataset from your local PC to your JupyterLab Space as shown in the following figure.
To test your private connectivity to the MLflow tracking server, you can download the sample notebook that has been uploaded automatically during the creation of the stack in a bucket within your AWS account. You can find the an S3 bucket name in the CDK output, as shown in the following figure.
From the JupyterLab app terminal, run the following command:
aws s3 cp --recursive <YOUR-BUCKET-URI> ./
You can now open the private-mlflow.ipynb notebook.
In the first cell, fetch credentials for the CodeArtifact PyPI repository so that SageMaker can use pip from the private AWS CodeArtifact repository. The credentials will expire in 12 hours. Make sure to log on again when they expire.
%%bash
AWS_ACCOUNT=$(aws sts get-caller-identity --output text --query 'Account')
aws codeartifact login --tool pip --repository pip --domain code-artifact-domain --domain-owner ${AWS_ACCOUNT} --region ${AWS_DEFAULT_REGION}
Experimentation
After setup, start the experimentation. The scenario is using the XGBoost algorithm to train a binary classification model. Both the data processing job and model training job use @remote decorator so that the jobs are running in the SageMaker-associated private subnets and security group from your private VPC.
In this case, the @remote decorator looks up the parameter values from the SageMaker configuration file (config.yaml). These parameters are used for data processing and training jobs. We define the SageMaker-associated private subnets and security group in the configuration file. For the full list of supported configurations for the @remote decorator, see Configuration file in the SageMaker Developer Guide.
Note that we specify in PreExecutionCommands
the aws codeartifact login
command to point SageMaker to the private CodeAritifact repository. This is needed to make sure that the dependencies can be installed at runtime. Alternatively, you can pass a reference to a container in your Amazon ECR through ImageUri
, which contains all installed dependencies.
We specify the security group and subnets information in VpcConfig
.
config_yaml = f"""
SchemaVersion: '1.0'
SageMaker:
PythonSDK:
Modules:
TelemetryOptOut: true
RemoteFunction:
# role arn is not required if in SageMaker Notebook instance or SageMaker Studio
# Uncomment the following line and replace with the right execution role if in a local IDE
# RoleArn: <replace the role arn here>
# ImageUri: <replace with your image if you want to avoid installing dependencies at run time>
S3RootUri: s3://{bucket_prefix}
InstanceType: ml.m5.xlarge
Dependencies: ./requirements.txt
IncludeLocalWorkDir: true
PreExecutionCommands:
- "aws codeartifact login --tool pip --repository pip --domain code-artifact-domain --domain-owner {account_id} --region {region}"
CustomFileFilter:
IgnoreNamePatterns:
- "data/*"
- "models/*"
- "*.ipynb"
- "__pycache__"
VpcConfig:
SecurityGroupIds:
- {security_group_id}
Subnets:
- {private_subnet_id_1}
- {private_subnet_id_2}
"""
Here’s how you can setup an MLflow experiment similar to this.
from time import gmtime, strftime
# Mlflow (replace these values with your own, if needed)
project_prefix = project_prefix
tracking_server_arn = mlflow_arn
experiment_name = f"{project_prefix}-sm-private-experiment"
run_name=f"run-{strftime('%d-%H-%M-%S', gmtime())}"
Data preprocessing
During the data processing, we use the @remote
decorator to link parameters in config.yaml to your preprocess
function.
Note that MLflow tracking starts from the mlflow.start_run()
API.
The mlflow.autolog()
API can automatically log information such as metrics, parameters, and artifacts.
You can use log_input()
method to log a dataset to the MLflow artifact store.
@remote(keep_alive_period_in_seconds=3600, job_name_prefix=f"{project_prefix}-sm-private-preprocess")
def preprocess(df, df_source: str, experiment_name: str):
mlflow.set_tracking_uri(tracking_server_arn)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=f"Preprocessing") as run:
mlflow.autolog()
columns = ['Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Machine failure']
cat_columns = ['Type']
num_columns = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']
target_column = 'Machine failure'
df = df[columns]
mlflow.log_input(
mlflow.data.from_pandas(df, df_source, targets=target_column),
context="DataPreprocessing",
)
...
model_file_path="/opt/ml/model/sklearn_model.joblib"
os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
joblib.dump(featurizer_model, model_file_path)
return X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model
Run the preprocessing job, then go to the MLflow UI (shown in the following figure) to see the tracked preprocessing job with the input dataset.
X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model = preprocess(df=df,
df_source=input_data_path,
experiment_name=experiment_name)
You can open an MLflow UI from SageMaker Studio as the following figure. Click “Experiments” from the navigation pane and select your experiment.
From the MLflow UI, you can see the processing job that just run.
You can also see security details in the SageMaker Studio console in the corresponding training job as shown in the following figure.
Model training
Similar to the data processing job, you can also use @remote
decorator with the training job.
Note that the log_metrics()
method sends your defined metrics to the MLflow tracking server.
@remote(keep_alive_period_in_seconds=3600, job_name_prefix=f"{project_prefix}-sm-private-train")
def train(X_train, y_train, X_val, y_val,
eta=0.1,
max_depth=2,
gamma=0.0,
min_child_weight=1,
verbosity=0,
objective='binary:logistic',
eval_metric='auc',
num_boost_round=5):
mlflow.set_tracking_uri(tracking_server_arn)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=f"Training") as run:
mlflow.autolog()
# Creating DMatrix(es)
dtrain = xgboost.DMatrix(X_train, label=y_train)
dval = xgboost.DMatrix(X_val, label=y_val)
watchlist = [(dtrain, "train"), (dval, "validation")]
print('')
print (f'===Starting training with max_depth {max_depth}===')
param_dist = {
"max_depth": max_depth,
"eta": eta,
"gamma": gamma,
"min_child_weight": min_child_weight,
"verbosity": verbosity,
"objective": objective,
"eval_metric": eval_metric
}
xgb = xgboost.train(
params=param_dist,
dtrain=dtrain,
evals=watchlist,
num_boost_round=num_boost_round)
predictions = xgb.predict(dval)
print ("Metrics for validation set")
print('')
print (pd.crosstab(index=y_val, columns=np.round(predictions),
rownames=['Actuals'], colnames=['Predictions'], margins=True))
rounded_predict = np.round(predictions)
val_accuracy = accuracy_score(y_val, rounded_predict)
val_precision = precision_score(y_val, rounded_predict)
val_recall = recall_score(y_val, rounded_predict)
# Log additional metrics, next to the default ones logged automatically
mlflow.log_metric("Accuracy Model A", val_accuracy * 100.0)
mlflow.log_metric("Precision Model A", val_precision)
mlflow.log_metric("Recall Model A", val_recall)
from sklearn.metrics import roc_auc_score
val_auc = roc_auc_score(y_val, predictions)
mlflow.log_metric("Validation AUC A", val_auc)
model_file_path="/opt/ml/model/xgboost_model.bin"
os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
xgb.save_model(model_file_path)
return xgb
Define hyperparameters and run the training job.
eta=0.3
max_depth=10
booster = train(X_train, y_train, X_val, y_val,
eta=eta,
max_depth=max_depth)
In the MLflow UI you can see the tracking metrics as shown in the figure below. Under “Experiments” tab, go to “Training” job of your experiment task. It is under “Overview” tab.
You can also view the metrics as graphs. Under “Model metrics” tab, you can see the model performance metrics that configured as part of the training job log.
With MLflow, you can log your dataset information alongside other key metrics, such as hyperparameters and model evaluation. Find more details in the blogpost LLM experimentation with MLFlow.
Clean up
To clean up, first delete all spaces and applications created within the SageMaker Studio domain. Then destroy the infrastructure created by running the following code.
cdk destroy
Conclusion
SageMaker with MLflow allows ML practitioners to create, manage, analyze, and compare ML experiments on AWS. To enhance security, SageMaker with MLflow now supports AWS PrivateLink. All MLflow Tracking Server versions including 2.16.2
integrate seamlessly with this feature, enabling secure communication between your ML environments and AWS services without exposing data to the public internet.
For an extra layer of security, you can set up SageMaker Studio within your private VPC without Internet access and execute your ML experiments in this environment.
SageMaker with MLflow now supports MLflow 2.16.2
. Setting up a fresh installation provides the best experience and full compatibility with the latest features.
About the Authors
Xiaoyu Xing is a Solutions Architect at AWS. She is driven by a profound passion for Artificial Intelligence (AI) and Machine Learning (ML). She strives to bridge the gap between these cutting-edge technologies and a broader audience, empowering individuals from diverse backgrounds to learn and leverage AI and ML with ease. She is helping customers to adopt AI and ML solutions on AWS in a secure and responsible way.
Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.
Tomer Shenhar is a Product Manager at AWS. He specializes in responsible AI, driven by a passion to develop ethically sound and transparent AI solutions.
Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 are now available on SageMaker JumpStart
Today, we are excited to announce that Mistral-NeMo-Base-2407 and Mistral-NeMo-Instruct-2407—twelve billion parameter large language models from Mistral AI that excel at text generation—are available for customers through Amazon SageMaker JumpStart. You can try these models with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy and use the Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 models for a variety of real-world use cases.
Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 overview
Mistral NeMo, a powerful 12B parameter model developed through collaboration between Mistral AI and NVIDIA and released under the Apache 2.0 license, is now available on SageMaker JumpStart. This model represents a significant advancement in multilingual AI capabilities and accessibility.
Key features and capabilities
Mistral NeMo features a 128k token context window, enabling processing of extensive long-form content. The model demonstrates strong performance in reasoning, world knowledge, and coding accuracy. Both pre-trained base and instruction-tuned checkpoints are available under the Apache 2.0 license, making it accessible for researchers and enterprises. The model’s quantization-aware training facilitates optimal FP8 inference performance without compromising quality.
Multilingual support
Mistral NeMo is designed for global applications, with strong performance across multiple languages including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. This multilingual capability, combined with built-in function calling and an extensive context window, helps make advanced AI more accessible across diverse linguistic and cultural landscapes.
Tekken: Advanced tokenization
The model uses Tekken, an innovative tokenizer based on tiktoken. Trained on over 100 languages, Tekken offers improved compression efficiency for natural language text and source code.
SageMaker JumpStart overview
SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as DBRX, for a variety of tasks.
You can now discover and deploy both Mistral NeMo models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and machine learning operations (MLOps) controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping to support data security.
Prerequisites
To try out both NeMo models in SageMaker JumpStart, you will need the following prerequisites:
- An AWS account that will contain all your AWS resources.
- An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see Identity and Access Management for Amazon SageMaker.
- Access to Amazon SageMaker Studio, a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
- Access to accelerated instances (GPUs) for hosting the model.
- This model requires an
ml.g6.12xlarge
instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to launch an endpoint to host Mistral NeMo from SageMaker JumpStart, you may need to request a service quota increase to access anml.g6.12xlarge
instance for endpoint usage. You can request service quota increases through the console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.
Discover Mistral NeMo models in SageMaker JumpStart
You can access NeMo models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.
In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.
Then choose HuggingFace.
From the SageMaker JumpStart landing page, you can search for NeMo in the search box. The search results will list Mistral NeMo Instruct and Mistral NeMo Base.
You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find the Deploy button to deploy the model and create an endpoint.
Deploy the model in SageMaker JumpStart
Deployment starts when you choose the Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.
Deploy the model with the SageMaker Python SDK
To deploy using the SDK, we start by selecting the Mistral NeMo Base model, specified by the model_id
with the value huggingface-llm-mistral-nemo-base-2407
. You can deploy your choice of the selected models on SageMaker with the following code. Similarly, you can deploy NeMo Instruct using its own model ID.
This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA value must be explicitly defined as True to accept the end-user license agreement (EULA). Also make sure that you have the account-level service limit for using ml.g6.12xlarge
for endpoint usage as one or more instances. You can follow the instructions in AWS service quotas to request a service quota increase. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
An important thing to note here is that we’re using the djl-lmi v12 inference container, so we’re following the large model inference chat completions API schema when sending a payload to both Mistral-NeMo-Base-2407 and Mistral-NeMo-Instruct-2407.
Mistral-NeMo-Base-2407
You can interact with the Mistral-NeMo-Base-2407 model like other standard text generation models, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide some example prompts and sample output. Keep in mind that the base model is not instruction fine-tuned.
Text completion
Tasks involving predicting the next token or filling in missing tokens in a sequence:
The following is the output:
Mistral NeMo Instruct
The Mistral-NeMo-Instruct-2407 model is a quick demonstration that the base model can be fine-tuned to achieve compelling performance. You can follow the steps provided to deploy the model and use the model_id
value of huggingface-llm-mistral-nemo-instruct-2407
instead.
The instruction-tuned NeMo model can be tested with the following tasks:
Code generation
Mistral NeMo Instruct demonstrates benchmarked strengths for coding tasks. Mistral states that their Tekken tokenizer for NeMo is approximately 30% more efficient at compressing source code. For example, see the following code:
The following is the output:
The model demonstrates strong performance on code generation tasks, with the completion_tokens
offering insight into how the tokenizer’s code compression effectively optimizes the representation of programming languages using fewer tokens.
Advanced math and reasoning
The model also reports strengths in mathematic and reasoning accuracy. For example, see the following code:
The following is the output:
In this task, let’s test Mistral’s new Tekken tokenizer. Mistral states that the tokenizer is two times and three times more efficient at compressing Korean and Arabic, respectively.
Here, we use some text for translation:
We set our prompt to instruct the model on the translation to Korean and Arabic:
We can then set the payload:
The following is the output:
The translation results demonstrate how the number of completion_tokens
used is significantly reduced, even for tasks that are typically token-intensive, such as translations involving languages like Korean and Arabic. This improvement is made possible by the optimizations provided by the Tekken tokenizer. Such a reduction is particularly valuable for token-heavy applications, including summarization, language generation, and multi-turn conversations. By enhancing token efficiency, the Tekken tokenizer allows for more tasks to be handled within the same resource constraints, making it an invaluable tool for optimizing workflows where token usage directly impacts performance and cost.
Clean up
After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. Use the following code:
Conclusion
In this post, we showed you how to get started with Mistral NeMo Base and Instruct in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repository.
About the authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.
Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services provided by AWS, including model offerings from top tier foundation model providers.
Model produces pseudocode for security controls in seconds
New tool harnesses large language models to create rules for the configuration of AWS services and the processing of alerts.Read More
Advancing AI trust with new responsible AI tools, capabilities, and resources
As generative AI continues to drive innovation across industries and our daily lives, the need for responsible AI has become increasingly important. At AWS, we believe the long-term success of AI depends on the ability to inspire trust among users, customers, and society. This belief is at the heart of our long-standing commitment to building and using AI responsibly. Responsible AI goes beyond mitigating risks and aligning to relevant standards and regulations. It’s about proactively building trust and unlocking AI’s potential to drive business value. A comprehensive approach to responsible AI empowers organizations to innovate boldly and achieve transformative business outcomes. New joint research conducted by Accenture and AWS underscores this, highlighting responsible AI as a key driver of business value — boosting product quality, operational efficiency, customer loyalty, brand perception, and more. Nearly half of the surveyed companies acknowledge responsible AI as pivotal in driving AI-related revenue growth. Why? Responsible AI builds trust, and trust accelerates adoption and innovation.
With trust as a cornerstone of AI adoption, we are excited to announce at AWS re:Invent 2024 new responsible AI tools, capabilities, and resources that enhance the safety, security, and transparency of our AI services and models and help support customers’ own responsible AI journeys.
Taking proactive steps to manage AI risks and foster trust and interoperability
AWS is the first major cloud service provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. ISO/IEC 42001 is an international management system standard that outlines the requirements for organizations to manage AI systems responsibly throughout their lifecycle. Technical standards, such as ISO/IEC 42001, are significant because they provide a common framework for responsible AI development and deployment, fostering trust and interoperability in an increasingly global and AI-driven technological landscape. Achieving ISO/IEC 42001 certification means that an independent third party has validated that AWS is taking proactive steps to manage risks and opportunities associated with AI development, deployment, and operation. With this certification, we reinforce our commitments to providing AI services that help you innovate responsibly with AI.
Expanding safeguards in Amazon Bedrock Guardrails to improve transparency and safety
In April 2024, we announced the general availability of Amazon Bedrock Guardrails, which makes it easier to apply safety and responsible AI checks for your gen AI applications. Amazon Bedrock Guardrails delivers industry-leading safety protections by blocking up to 85% more harmful content on top of native protections provided by foundation models (FMs) and filtering over 75% of hallucinated responses from models using contextual grounding checks for Retrieval Augmented Generation (RAG) and summarization use cases. The ability to implement these safeguards was a big step forward in building trust in AI systems. Despite the advancements in FMs, models can still produce hallucinations—a challenge many of our customers face. For use cases where accuracy is critical, customers need the use of mathematically sound techniques and explainable reasoning to help generate accurate FM responses.
To address this need, we are adding new safeguards to Amazon Bedrock Guardrails to help prevent factual errors due to FM hallucinations and offer verifiable proofs. With the launch of the Automated Reasoning checks in Amazon Bedrock Guardrails (preview), AWS becomes the first and only major cloud provider to integrate automated reasoning in our generative AI offerings. Automated Reasoning checks help prevent factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with provided facts and aren’t based on hallucinated or inconsistent data. Used alongside other techniques such as prompt engineering, RAG, and contextual grounding checks, Automated Reasoning checks add a more rigorous and verifiable approach to enhancing the accuracy of LLM-generated outputs. Encoding your domain knowledge into structured policies helps your conversational AI applications provide reliable and trustworthy information to your users.
Click on the image below to see a demo of Automated Reasoning checks in Amazon Bedrock Guardrails.
As organizations increasingly use applications with multimodal data to drive business value, improve decision-making, and enhance customer experiences, the need for content filters extends beyond text. Amazon Bedrock Guardrails now supports multimodal toxicity detection (in preview) with support for image content, helping organizations to detect and filter undesirable and potentially harmful image content while retaining safe and relevant visuals. Multimodal toxicity detection helps remove the heavy lifting required to build your own safeguards for image data or invest time in manual evaluation that can be error-prone and tedious. Amazon Bedrock Guardrails helps you to responsibly create AI applications, helping build trust with your users.
Improving generative AI application responses and quality with new Amazon Bedrock evaluation capabilities
With more general-purpose FMs to choose from, organizations now have a wide range of options to power their generative AI applications. However, selecting the optimal model for a specific use case requires efficiently comparing models based on an organization’s preferred quality and responsible AI metrics. While evaluation is an important part of building trust and transparency, it demands substantial time, expertise, and resources for every new use case, making it challenging to choose the model that delivers the most accurate and safe customer experience. Amazon Bedrock Evaluations addresses this by helping you evaluate, compare, and select the best FMs for your use case. You can now use an LLM-as-a-judge (in preview) for model evaluations to perform tests and evaluate other models with human-like quality on your dataset. You can choose from LLMs hosted on Amazon Bedrock to be the judge, with a variety of quality and responsible AI metrics such as correctness, completeness, and harmfulness. You can also bring your own prompt dataset to customize the evaluation with your data, and compare results across evaluation jobs to make decisions faster. Previously, you had a choice between human-based model evaluation and automatic evaluation with exact string matching and other traditional natural language processing (NLP) metrics. These methods, though fast, didn’t provide a strong correlation with human evaluators. Now, with LLM-as-a-judge, you can get human-like evaluation quality at a much lower cost than full human-based evaluations while saving up to weeks of time. Many organizations still want the final assessment to be from expert human annotators. For this, Amazon Bedrock still offers full human-based evaluations with an option to bring your own workforce or have AWS manage your custom evaluation.
To equip FMs with up-to-date and proprietary information, organizations use RAG, a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. However, evaluating and optimizing RAG applications can be challenging due to the complexity of optimizing retrieval and generation components. To address this, we’ve introduced RAG evaluation support in Amazon Bedrock Knowledge Bases (in preview). This new evaluation capability now allows you to assess and optimize RAG applications conveniently and quickly, right where your data and LLMs already reside. Powered by LLM-as-a-judge technology, RAG evaluations offer a choice of several judge models and metrics, such as context relevance, context coverage, correctness, and faithfulness (hallucination detection). This seamless integration promotes regular assessments, fostering a culture of continuous improvement and transparency in AI application development. By saving both cost and time compared to human-based evaluations, these tools empower organizations to enhance their AI applications, building trust through consistent improvement.
The model and RAG evaluation capabilities both provide natural language explanations for each score in the output file and on the AWS Management Console. The scores are normalized from 0 to 1 for ease of interpretability. Rubrics are published in full with the judge prompts in the documentation so non-scientists can understand how scores are derived. To learn more about model and RAG evaluation capabilities, see News blog.
Introducing Amazon Nova, built with responsible AI at the core
Amazon Nova is a new generation of state-of-the-art FMs that deliver frontier intelligence and industry leading price-performance. Amazon Nova FMs incorporate built-in safeguards to detect and remove harmful content from data, rejecting inappropriate user inputs, and filtering model outputs. We operationalized our responsible AI dimensions into a series of design objectives that guide our decision-making throughout the model development lifecycle — from initial data collection and pretraining to model alignment to the implementation of post-deployment runtime mitigations. Amazon Nova Canvas and Amazon Nova Reel come with controls to support safety, security, and IP needs with responsible AI. This includes watermarking, content moderation, and C2PA support (available in Amazon Nova Canvas) to add metadata by default to generated images. Amazon’s safety measures to combat the spread of misinformation, child sexual abuse material (CSAM), and chemical, biological, radiological, or nuclear (CBRN) risks also extend to Amazon Nova models. For more information on how Amazon Nova was built responsibly, read the Amazon Science blog.
Enhancing transparency with new resources to advance responsible generative AI
At re:Invent 2024, we announced the availability of new AWS AI Service Cards for Amazon Nova Reel, Amazon Canvas, Amazon Nova Micro, Lite, and Pro, Amazon Titan Image Generator, and Amazon Titan Text Embeddings to increase transparency of Amazon FMs. These cards provide comprehensive information on the intended use cases, limitations, responsible AI design choices, and best practices for deployment and performance optimization. A key component of Amazon’s responsible AI documentation, AI Service Cards offer customers and the broader AI community a centralized resource to understand the development process we undertake to build our services in a responsible way that addresses fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. As generative AI continues to grow and evolve, transparency on how technology is developed, tested, and used will be a vital component to earn the trust of organizations and their customers alike. You can explore all 16 AI Service Cards on Responsible AI Tools and Resources.
We also updated the AWS Responsible Use of AI Guide. This document offers considerations for designing, developing, deploying, and operating AI systems responsibly, based on our extensive learnings and experience in AI. It was written with a set of diverse AI stakeholders and perspectives in mind—including, but not limited to, builders, decision-makers, and end-users. At AWS, we are committed to continuing to bring transparency resources like these to the broader community—and to iterate and gather feedback on the best ways forward.
Delivering breakthrough innovation with trust at the forefront
At AWS, we’re dedicated to fostering trust in AI, empowering organizations of all sizes to build and use AI effectively and responsibly. We are excited about the responsible AI innovations announced at re:Invent this week. From new safeguards and evaluation techniques in Amazon Bedrock to state-of-the-art Amazon Nova FMs to fostering trust and transparency with ISO/IEC 42001 certification and new AWS AI Service Cards, you have more tools, resources and built-in protections to help you innovate responsibly and unlock value with generative AI.
We encourage you to explore these new tools and resources:
- AWS achieves ISO/IEC 42001 AI Management System accredited certification
- Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview)
- Amazon Bedrock Guardrails supports multimodal toxicity detection with image support
- New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock
- Amazon Nova and our commitment to responsible AI
- Responsible AI at AWS website
- AWS AI Service Cards
- AWS Responsible Use of AI Guide
About the author
Dr. Baskar Sridharan is the Vice President for AI/ML and Data Services & Infrastructure, where he oversees the strategic direction and development of key services, including Bedrock, SageMaker, and essential data platforms like EMR, Athena, and Glue.
Deploy RAG applications on Amazon SageMaker JumpStart using FAISS
Generative AI has empowered customers with their own information in unprecedented ways, reshaping interactions across various industries by enabling intuitive and personalized experiences. This transformation is significantly enhanced by Retrieval Augmented Generation (RAG), which is a generative AI pattern where the large language model (LLM) being used references a knowledge corpus outside of its training data to generate a response. RAG has become a popular choice to improve performance of generative AI applications by taking advantage of additional information in the knowledge corpus to augment an LLM. Customers often prefer RAG for optimizing generative AI output over other techniques like fine-tuning due to cost benefits and quicker iteration.
In this post, we show how to build a RAG application on Amazon SageMaker JumpStart using Facebook AI Similarity Search (FAISS).
RAG applications on AWS
RAG models have proven useful for grounding language generation in external knowledge sources. By retrieving relevant information from a knowledge base or document collection, RAG models can produce responses that are more factual, coherent, and relevant to the user’s query. This can be particularly valuable in applications like question answering, dialogue systems, and content generation, where incorporating external knowledge is crucial for providing accurate and informative outputs.
Additionally, RAG has shown promise for improving understanding of internal company documents and reports. By retrieving relevant context from a corporate knowledge base, RAG models can assist with tasks like summarization, information extraction, and question answering on complex, domain-specific documents. This can help employees quickly find important information and insights buried within large volumes of internal materials.
A RAG workflow typically has four components: the input prompt, document retrieval, contextual generation, and output. A workflow begins with a user providing an input prompt, which is searched in a large knowledge corpus, and the most relevant documents are returned. These returned documents along with the original query are then fed into the LLM, which uses the additional conditional context to produce a more accurate output to users. RAG has become a popular technique to optimize generative AI applications because it uses external data that can be frequently modified to dynamically retrieve user output without the need retrain a model, which is both costly and compute intensive.
The next component in this pattern that we have chosen is SageMaker JumpStart. It provides significant advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with prepackaged artifacts, ease of use through a user-friendly interface, and scalability with seamless integration to the broader AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart allows you to quickly deploy both LLMs and embeddings models without spending too much time on configurations for scalability.
Solution overview
To implement our RAG workflow on SageMaker JumpStart, we use a popular open source Python library known as LangChain. Using LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. Let’s review these different components and how we bring them together:
- LLM (inference) – We need an LLM that will do the actual inference and answer our end-user’s initial prompt. For our use case, we use Meta Llama 3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints that allows you to simply pass in the endpoint name to define an LLM object in the library.
- Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we are doing a similarity search on the input text to see what documents share similarities and possess the knowledge to help augment our response. For this example, we use the BGE Hugging Face embeddings model available through SageMaker JumpStart.
- Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use FAISS, which allows for similarity search as well. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve. Other vector store options include Amazon OpenSearch Service as you scale your experiments.
The following architecture diagram illustrates how you can use a vector index such as FAISS as a knowledge base and embeddings store.
Standalone vector indexes like FAISS can significantly improve the search and retrieval of vector embeddings, but they lack capabilities that exist in any database. The following is an overview of the primary benefits to using a vector index for RAG workflows:
- Efficiency and speed – Vector indexes are highly optimized for fast, memory-efficient similarity search. Because vector databases are built on top of vector indexes, there are additional features that typically contribute additional latency. To build a highly efficient and low-latency RAG workflow, you can use a vector index (such as FAISS) deployed on a single machine with GPU acceleration.
- Simplified deployment and maintenance – Because vector indexes don’t require the effort of spinning up and maintaining a database instance, they’re a great option to quickly deploy a RAG workflow if continuous updates, high concurrency, or distributed storage aren’t a requirement.
- Control and customization – Vector indexes offer granular control over parameters, the index type, and performance trade-offs, letting you optimize for exact or approximate searches based on the RAG use case.
- Memory efficiency – You can tune a vector index to minimize memory usage, especially when using data compression techniques such as quantization. This is advantageous in scenarios where memory is limited and high scalability is required so that more data can be stored in memory on a single machine.
In short, a vector index like FAISS is advantageous when trying to maximize speed, control, and efficiency with minimal infrastructure components and stable data.
In the following sections, we walk through the following notebook, which implements FAISS as the vector store in the RAG solution. In this notebook, we use several years of Amazon’s Letter to Shareholders as a text corpus and perform Q&A on the letters. We use this notebook to demonstrate advanced RAG techniques with Meta Llama 3 8B on SageMaker JumpStart using the FAISS embedding store.
We explore the code using the simple LangChain vector store wrapper, RetrievalQA and ParentDocumentRetriever. RetreivalQA is more advanced than a LangChain vector store wrapper and offers more customizations. ParentDocumentRetriever helps with advanced RAG options like invocation of parent documents for response generation, which enriches the LLM’s outputs with a layered and thorough context. We will see how the responses progressively get better as we move from simple to advanced RAG techniques.
Prerequisites
To run this notebook, you need access to an ml.t3.medium instance.
To deploy the endpoints for Meta Llama 3 8B model inference, you need the following:
- At least one ml.g5.12xlarge instance for Meta Llama 3 endpoint usage
- At least one ml.g5.2xlarge instance for embedding endpoint usage
Additionally, you may need to request a Service Quota increase.
Set up the notebook
Complete the following steps to create a SageMaker notebook instance (you can also use Amazon SageMaker Studio with JupyterLab):
- On the SageMaker console, choose Notebooks in the navigation pane.
- Choose Create notebook instance.
- For Notebook instance type, choose t3.medium.
- Under Additional configuration, for Volume size in GB, enter 50 GB.
This configuration might need to change depending on the RAG solution you are working with and the amount of data you will have on the file system itself.
- For IAM role, choose Create a new role.
- Create an AWS Identity and Access Management (IAM) role with SageMaker full access and any other service-related policies that are necessary for your operations.
- Expand the Git repositories section and for Git repository URL, enter
https://github.com/aws-samples/sagemaker-genai-hosting-examples.git
.
- Accept defaults for the rest of the configurations and choose Create notebook instance.
- Wait for the notebook to be InService and then choose the Open JupyterLab link to launch JupyterLab.
- Open
genai-recipes/RAG-recipes/llama3-rag-langchain-smjs.ipynb
to work through the notebook.
Deploy the model
Before you start building the end-to-end RAG workflow, it’s necessary to deploy the LLM and embeddings model of your choice. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all pre-packaged for optimal inference. These are then exposed using SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:
LangChain comes with built-in support for SageMaker JumpStart and endpoint-based models, so you can encapsulate the endpoints with these constructs so they can later be fit into the encompassing RAG chain:
After you have set up the models, you can focus on the data preparation and setup of the FAISS vector store.
Data preparation and vector store setup
For this RAG use case, we take public documents of Amazon’s Letter to Shareholders as the text corpus and document source that we will be working with:
LangChain comes with built-in processing for PDF documents, and you can use this to load the data from the text corpus. You can also tune or iterate over parameters such as chunk size depending on the documents that you’re working with for your use case.
You can then combine the documents and embeddings models and point towards FAISS as your vector store. LangChain has widespread support for different LLMs such as SageMaker JumpStart, and also has built-in API calls for integrating with FAISS, which we use in this case:
You can then make sure the vector store is performing as expected by sending a few sample queries and reviewing the output that is returned:
LangChain inference
Now that you have set up the vector store and models, you can encapsulate this into a singular chain object. In this case, we use a RetrievalQA Chain tailored for RAG applications provided by LangChain. With this chain, you can customize the document fetching process and control parameters such as number of documents to retrieve. We define a prompt template and pass in our retriever as well as these tertiary parameters:
You can then test some sample inference and trace the relevant source documents that helped answer the query:
Optionally, if you want to further augment or enhance your RAG applications for more advanced use cases with larger documents, you can also explore using options such as a parent document retriever chain. Depending on your use case, it’s crucial to identify the different RAG processes and architectures that can optimize your generative AI application.
Clean up
After you have built the RAG application with FAISS as a vector index, make sure to clean up the resources that were used. You can delete the LLM endpoint using the delete_endpoint Boto3 API call. In addition, make sure to stop your SageMaker notebook instance to not incur any further charges.
Conclusion
RAG can revolutionize customer interactions across industries by providing personalized and intuitive experiences. RAG’s four-component workflow—input prompt, document retrieval, contextual generation, and output—allows for dynamic, up-to-date responses without the need for costly model retraining. This approach has gained popularity due to its cost-effectiveness and ability to quickly iterate.
In this post, we saw how SageMaker JumpStart has simplified the process of building and deploying generative AI applications, offering pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem. We also saw how using FAISS as a vector index can enable quick retrieval from a large corpus of information, while keeping costs and operational overhead low.
To learn more about RAG on SageMaker, see Retrieval Augmented Generation, or contact your AWS account team to discuss your use cases.
About the Authors
Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.
Ankith Ede is a Solutions Architect at Amazon Web Services based in New York City. He specializes in helping customers build cutting-edge generative AI, machine learning, and data analytics-based solutions for AWS startups. He is passionate about helping customers build scalable and secure cloud-based solutions.
Sid Rampally is a Customer Solutions Manager at AWS, driving generative AI acceleration for life sciences customers. He writes about topics relevant to his customers, focusing on data engineering and machine learning. In his spare time, Sid enjoys walking his dog in Central Park and playing hockey.
Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans
Today, organizations are constantly seeking ways to use advanced large language models (LLMs) for their specific needs. These organizations are engaging in both pre-training and fine-tuning massive LLMs, with parameter counts in the billions. This process aims to enhance model efficacy for a wide array of applications across diverse sectors, including healthcare, financial services, and marketing. However, customizing these larger models requires access to the latest and accelerated compute resources.
In this post, we demonstrate how you can address this requirement by using Amazon SageMaker HyperPod training plans, which can bring down your training cluster procurement wait time. A training plan provides simple and predictable access to accelerated compute resources (supporting P4d, P5, P5e, P5en, and trn2 as of the time of writing), allowing you to use this compute capacity to run model training on either Amazon SageMaker training jobs or SageMaker HyperPod.
We guide you through a step-by-step implementation on how you can use the (AWS CLI) or the AWS Management Console to find, review, and create optimal training plans for your specific compute and timeline needs. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.
You can check out the launch of this new feature in Meet your training timelines and budget with new Amazon SageMaker HyperPod flexible training plans.
Business challenges
As organizations strive to harness the power of LLMs for competitive advantage, they face a significant hurdle: securing sufficient and reliable compute capacity for model training. The scale of these models demands cutting-edge accelerated compute hardware. However, the high cost and limited availability of such resources create a bottleneck for many businesses. This scarcity not only impacts timelines, but also stretches budgets, potentially delaying critical AI initiatives. As a result, organizations are seeking solutions that can provide consistent, scalable, and cost-effective access to high-performance computing resources, enabling them to train and fine-tune LLMs without compromising on speed or quality.
Solution overview
SageMaker HyperPod training plans, a new SageMaker capability, address this challenge by offering you a simple-to-use console UI or AWS CLI experience to search, review, create, and manage training plans.
Capacity provisioned through SageMaker training plans can be used with either SageMaker training jobs or SageMaker HyperPod. If you want to focus on model development rather than infrastructure management and prefer ease of use with a managed experience, SageMaker training jobs are an excellent choice. For organizations requiring granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal solution. To better understand these services and choose the one most appropriate for your use case, refer to Generative AI foundation model training on Amazon SageMaker, which provides detailed information about both options.
The following diagram provides an overview of the main steps involved in requesting capacity using SageMaker training plans for SageMaker training jobs.

Figure 1: The main steps involved in procuring capacity via SageMaker HyperPod training plans. Note: This workflow arbitrarily uses SageMaker training jobs as the target; you may choose to use SageMaker HyperPod too.
At a high level, the steps to create a training plan are as follows:
- Search the training plans that best match your capacity requirements, such as instance type, instance count, start time, and duration. SageMaker finds the optimal plans across one or more segments.
- After reviewing the available training plan offerings, you can reserve the plan that meets your requirements.
- Schedule your SageMaker training jobs by using a training plan with a
training-job
target resource. Note, we are only usingtraining-job
for illustration purposes. You may also usehyperpod-cluster
as your target resource. - Describe and list your existing training plans. When the capacity is available, it will be allocated to the scheduled training job.
In the following sections, we shift our focus to the solution walkthrough associated with training plans.
Prerequisites
Complete the following prerequisite steps:
- If you’re using an AWS Identity and Access Management (IAM) user for this solution, make sure that your user has the
AmazonSageMakerFullAccess
policy attached to it. To learn more about how to attach a policy to an IAM user, see Adding IAM identity permissions (console). - If you’re setting up the AWS CLI for the first time, follow the instructions at Getting started with the AWS CLI.
- If you choose to use the AWS CLI, make sure you are on the most up-to-date AWS CLI version.
Create a training plan
In this post, we discuss two ways to create a training plan: using the SageMaker console or the AWS CLI.
Create a SageMaker training plan using the SageMaker console
The SageMaker console user experience for creating a training plan is similar for both training jobs and SageMaker HyperPod. In this post, for demonstration purposes, we show how to create a training plan for a SageMaker HyperPod cluster.
- On the SageMaker console, choose Training plans in the navigation pane.
- Create a new training plan.
- For Target, select HyperPod cluster.
- Under Instance attributes, specify your instance type (ml.p5.48xlarge) and instance count (16).
- Under Date settings to search for an available plan, choose your preferred training date and duration (for example, 10 days).
- Choose Find training plan.

Figure 2: You can search for available training plan offerings via the SageMaker console! Choose your target, select your instance type and count, and specify duration.
SageMaker suggests a training plan that is split into two 5-day segments. This includes the total upfront price for the plan as well as the estimated data transfer cost based on the data location you provided.

Figure 3: SageMaker suggests a training plan based on your inputs. In this example, SageMaker suggests a training plan split across two 5-day segments. You will also see the total upfront price.
- Review and purchase your plan.

Figure 4: Once you’re happy with your selection, you can review and purchase your training plan!
After you create the training plan, you can see the list of training plans created. The plan initially enters a Pending state, awaiting payment. Once the payment is processed (unless the payment cycle has changed), the plan will transition to the Scheduled state. At this point, you can begin queuing jobs or creating clusters using the plan. On the plan’s start date, it becomes Active, and resources are allocated. Your training tasks can then start running (pending resource availability).
Make sure you pay for the training plan using the AWS Billing and Cost Management console for your plan to show up on your SageMaker console. You will receive an invoice to resolve before being able to proceed.

Figure 5: You can list out your training plans on the SageMaker console. You can start using your plan once it transitions to the Active state.
Create a SageMaker training plan using the AWS CLI
Complete the following steps to create a training plan using the AWS CLI:
- Start by calling the API, passing your capacity requirements as input parameters, to search for all matching training plan offerings.
The following example searches for training plan offerings suitable for two ml.p5.48xlarge
instances for 96 hours in the us-west-2
region. In this example, we also have filters for what time frame we want to use the training plan, and we also filter for training plans that can be used for SageMaker HyperPod cluster workloads using the target-resources
parameter:
Each TrainingPlanOffering
returned in the response is identified by a unique TrainingPlanOfferingId
. The first offering in the list represents the best match for your requirements. In this case, the SageMaker SearchTrainingPlanOfferings
API returns a single available TrainingPlanOffering
that matches the specified capacity requirements:
Make sure that your SageMaker HyperPod training job subnets are in the same Availability Zone as your training plan.
- After you choose the training plan that best suits your schedule and requirements, you can reserve it by calling the
CreateTrainingPlan
API as follows:
You will see an output that looks like the following:
After you create the training plan, you will have to pay. Be on the lookout for an invoice. You can also find this on the AWS Billing and Cost Management console.
- You can list all the training plans that are created in your AWS account (and Region) by calling the
ListTrainingPlans
API:
This will give you a summary of the training plans in your account. After you have your training plan (the newly created p5-training-plan
), you can check its details using either the console or the DescribeTrainingPlan
API as follows:
Use a training plan with SageMaker HyperPod
When your training plan status transitions to Scheduled, you can use it for new instance groups in either a new or existing SageMaker HyperPod cluster. You can use both the CreateCluster
and UpdateCluster
APIs to create a new SageMaker HyperPod cluster with your training plan, or update an existing cluster respectively. You can also choose to directly use the SageMaker console.
For a given SageMaker HyperPod cluster, training plans are attached at the instance group level, separately per each instance group. If desired, one SageMaker HyperPod cluster can have one or more training plans attached to multiple instance groups. You always have the option to omit a training plan and instead continue using On-Demand capacity as previously for other combinations of instance groups. However, you can’t mix training plan capacity with On-Demand capacity within the same instance group. You can also choose to have a partial cluster launch for every instance group. This means that even if all the requested capacity isn’t available, you can still spin up a cluster with capacity already available to you.
When a training plan is active, this is the time window when the TrainingPlanOfferings
within it are scheduled to start and stop. Each time a TrainingPlanOffering
starts, instance groups will automatically scale up to the specified count, and the instance group TrainingPlanStatus
will reflect as Active
. When a TrainingPlanOffering
is scheduled to stop, your cluster’s instance groups will automatically scale down to zero, and the instance group TrainingPlanStatus
will reflect as Expired
.
Use a training plan with SageMaker HyperPod on the console
You can choose to either create a new cluster and create an instance group, or edit an existing cluster and edit an existing instance group. In the configuration, choose the same instance type that was chosen for a training plan and specify the desired instance count. The Instance capacity option will appear only when you choose an instance type that is supported for training plans. Choose the dropdown menu to scroll through valid training plans. The available training plan selections are listed by name and are filtered for only those that match the chosen instance type, that have at least the specified instance count, that were created with hyperpod-cluster
as the target resource, and currently have a status of Scheduled or Active. Double-check these conditions if you don’t see an expected training plan name, and make sure that the expected training plan was created in the same account and in the same Region. The default selection is to use no training plan. Repeat the process for each instance group that should have a training plan.

Figure 6: You can create an instance group for a SageMaker HyperPod cluster with the instances in your training plan. Make sure to choose the right training plan listed under “Instance capacity”
Use a training plan with SageMaker HyperPod with the AWS CLI
Complete the following steps to use your training plan with the AWS CLI:
- Create a SageMaker HyperPod cluster from scratch. For instructions, refer to the Amazon SageMaker HyperPod workshop or the Amazon EKS Support in Amazon SageMaker HyperPod workshop.
The following cluster configuration file defines a SageMaker HyperPod SLURM cluster named ml-cluster
. The steps for using training plans will be the same, regardless of if you choose SLURM or Amazon Elastic Kubernetes Service (Amazon EKS) as the orchestrator. This cluster contains an instance group named controller-machine
with 1 ml.m5.12xlarge
instance as the head node of a SLURM cluster, and it will not use a training plan for the controller-machine instance group. We also define a worker instance group named worker-group-1
that specifies 2 ml.p5.48xlarge
instances, which will be sourced from your training plan. Note the line "TrainingPlanArn"
—this is where you specify your training plan by the full Amazon Resource Name (ARN). If you followed the steps in the prior sections, this should be the value of the environment variable TRAINING_PLAN_ARN
. The following cluster configuration also skips some configuration parameters, such as VPCConfig
and InstanceStorageConfig
. Refer to the workshop or the following script for a complete SageMaker HyperPod cluster configuration file.
You can then create the cluster using the following code:
These next steps assume that you already have a SageMaker HyperPod cluster created. This section is relevant if you’d like to add an instance group that uses your training plan reserved instances to your existing cluster.
- To update an existing cluster, you can define another file called
update-cluster-config.json
as follows. If you followed the instructions in the workshop to provision the cluster, you can use the providedcreate_config.sh
to get the values for yourenv_vars
before sourcing them.
In this file, we define an additional worker group named worker-group-2
consisting of 2 ml.p5.48xlarge
instances. Again, notice the line “TrainingPlanArn”—this is where you specify your training plan by the full ARN.
Make sure that you also update provisioning_parameters.json
, and upload the updated file to your S3 bucket for SageMaker to use while provisioning the new worker group:
- Because this file is uploaded to Amazon Simple Storage Service (Amazon S3) for SageMaker to use while provisioning your cluster, you need to first copy that file over from Amazon S3:
aws s3 cp s3://${BUCKET}/src/provisioning_parameters.json provisioning_parameters.json
- Assuming your existing cluster has a controller machine group and a worker group with an ml.g5.48xlarge, you can add the lines in bold to your existing yaml file:
This step adds in the new worker group that you just created, which consists of your 2 ml.p5.48xlarge
nodes from your training plan.
- Now you can re-upload the updated
provisioning-parameters.json
file to Amazon S3:
- Now, with both
cluster-config.json
(nowupdate-cluster-config.json
) andprovisioning-parameters.json
updated, you can add the training plan nodes to the cluster:
Use a training plan with a SageMaker training job
SageMaker training jobs offer two primary methods for execution: an AWS CLI command and the Python SDK. The AWS CLI approach provides direct control and is ideal for scripting, allowing you to create training jobs with a single command. The Python SDK offers a more programmatic interface, enabling seamless integration with existing Python workflows and using the high-level features in SageMaker. In this section, we look at how you can use a training plan with both options.
Run a training job on a training plan using the AWS CLI
The following example demonstrates how to create a SageMaker training job and associate it with a provided training plan using the CapacityScheduleConfig
attribute in the create-training-job
AWS CLI command:
After creating the training job, you can verify that it was properly assigned to the training plan by calling the DescribeTrainingJob
API:
Run a training job on a training plan using the SageMaker Python SDK
The following example demonstrates how to create a SageMaker training job using the SageMaker Python SDK’s Training estimator. It also shows how to associate the job with a provided training plan by using the capacity_schedules
attribute in the estimator object when using the SageMaker Python SDK.
For more information on the SageMaker estimator, see Use a SageMaker estimator to run a training job.
Make sure the SageMaker Python SDK version is updated to the latest version.
After creating the training job, you can verify that it was properly assigned to the training plan by calling the DescribeTrainingJob
API:
Clean up
To clean up your resources to avoid incurring more charges, complete the following steps:
- Delete the SageMaker HyperPod cluster and associated resources such as storage, VPC, and IAM roles.
- Delete any S3 buckets created.
- Make sure that the training plan created is used and completes the fulfillment lifecycle.
Conclusion
SageMaker training plans represent a significant leap forward in addressing the compute capacity challenges faced by organizations working with LLMs. By providing quick access to high-performance GPU resources, it streamlines the process of model training and fine-tuning. This solution not only reduces wait times for cluster provisioning, but also offers flexibility in choosing between SageMaker training jobs and SageMaker HyperPod, catering to diverse organizational needs. Ultimately, SageMaker training plans empower businesses to overcome resource constraints and accelerate their AI initiatives, leading to more efficient and effective usage of advanced language models across various industries.
To get started with a SageMaker training plan and explore its capabilities for your specific LLM training needs, refer to Reserve capacity with training plans and try out the step-by-step implementation guide provided in this post.
Special thanks to Fei Ge, Oscar Hsu, Takuma Yoshitani, and Yiting Li for their support in the launch of this post.
About the Authors
Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.
Kanwaljit Khurmi is an AI/ML Principal Solutions Architect at Amazon Web Services. He works with AWS product teams, engineering, and customers to provide guidance and technical assistance for improving the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Sean Smith is a Sr Specialist Solution Architect at AWS for HPC and generative AI. Prior to that, Sean worked as a Software Engineer on AWS Batch and CfnCluster, becoming the first engineer on the team that created AWS ParallelCluster.
Ty Bergstrom is a Software Engineer at Amazon Web Services. He works on the Hyperpod Clusters platform for Amazon SageMaker.
Amazon Bedrock Marketplace now includes NVIDIA models: Introducing NVIDIA Nemotron-4 NIM microservices
This post is co-written with Abhishek Sawarkar, Eliuth Triana, Jiahong Liu and Kshitiz Gupta from NVIDIA.
At AWS re:Invent 2024, we are excited to introduce Amazon Bedrock Marketplace. This a revolutionary new capability within Amazon Bedrock that serves as a centralized hub for discovering, testing, and implementing foundation models (FMs). It provides developers and organizations access to an extensive catalog of over 100 popular, emerging, and specialized FMs, complementing the existing selection of industry-leading models in Amazon Bedrock. Bedrock Marketplace enables model subscription and deployment through managed endpoints, all while maintaining the simplicity of the Amazon Bedrock unified APIs.
The NVIDIA Nemotron family, available as NVIDIA NIM microservices, offers a cutting-edge suite of language models now available through Amazon Bedrock Marketplace, marking a significant milestone in AI model accessibility and deployment.
In this post, we discuss the advantages and capabilities of the Bedrock Marketplace and Nemotron models, and how to get started.
About Amazon Bedrock Marketplace
Bedrock Marketplace plays a pivotal role in democratizing access to advanced AI capabilities through several key advantages:
- Comprehensive model selection – Bedrock Marketplace offers an exceptional range of models, from proprietary to publicly available options, allowing organizations to find the perfect fit for their specific use cases.
- Unified and secure experience – By providing a single access point for all models through the Amazon Bedrock APIs, Bedrock Marketplace significantly simplifies the integration process. Organizations can use these models securely, and for models that are compatible with the Amazon Bedrock Converse API, you can use the robust toolkit of Amazon Bedrock, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Flows.
- Scalable infrastructure – Bedrock Marketplace offers configurable scalability through managed endpoints, allowing organizations to select their desired number of instances, choose appropriate instance types, define custom auto scaling policies that dynamically adjust to workload demands, and optimize costs while maintaining performance.
About the NVIDIA Nemotron model family
At the forefront of the NVIDIA Nemotron model family is Nemotron-4, as stated by NVIDIA, it is a powerful multilingual large language model (LLM) trained on an impressive 8 trillion text tokens, specifically optimized for English, multilingual, and coding tasks. Key capabilities include:
- Synthetic data generation – Able to create high-quality, domain-specific training data at scale
- Multilingual support – Trained on extensive text corpora, supporting multiple languages and tasks
- High-performance inference – Optimized for efficient deployment on GPU-accelerated infrastructure
- Versatile model sizes – Includes variants like the Nemotron-4 15B with 15 billion parameters
- Open license – Offers a uniquely permissive open model license that gives enterprises a scalable way to generate and own synthetic data that can help build powerful LLMs
The Nemotron models offer transformative potential for AI developers by addressing critical challenges in AI development:
- Data augmentation – Solve data scarcity problems by generating synthetic, high-quality training datasets
- Cost-efficiency – Reduce manual data annotation costs and time-consuming data collection processes
- Model training enhancement – Improve AI model performance through high-quality synthetic data generation
- Flexible integration – Support seamless integration with existing AWS services and workflows, enabling developers to build sophisticated AI solutions more rapidly
These capabilities make Nemotron models particularly well-suited for organizations looking to accelerate their AI initiatives while maintaining high standards of performance and security.
Getting started with Bedrock Marketplace and Nemotron
To get started with Amazon Bedrock Marketplace, open the Amazon Bedrock console. From there, you can explore Bedrock Marketplace interface, which offers a comprehensive catalog of FMs from various providers. You can browse through the available options to discover different AI capabilities and specializations. This exploration will lead you to find NVIDIA’s model offerings, including Nemotron-4.
We walk you through these steps in the following sections.
Open Amazon Bedrock Marketplace
Navigating to Amazon Bedrock Marketplace is straightforward:
- On the Amazon Bedrock console, choose Model catalog in the navigation pane.
- Under Filters, select Bedrock Marketplace.
Upon entering Bedrock Marketplace, you’ll find a well-organized interface with various categories and filters to help you find the right model for your needs. You can browse by providers and modality.
- Use the search function to quickly locate specific providers, and explore models cataloged in Bedrock Marketplace.
Deploy NVIDIA Nemotron models
After you’ve located NVIDIA’s model offerings in Bedrock Marketplace, you can narrow down to the Nemotron model. To subscribe to and deploy Nemotron-4, complete the following steps:
- Filter by Nemotron under Providers or search by model name.
- Choose from the available models, such as
Nemotron-4 15B
.
On the model details page, you can examine its specifications, capabilities, and pricing details. The Nemotron-4 model offers impressive multilingual and coding capabilities.
- Choose View subscription options to subscribe to the model.
- Review the available options and choose Subscribe.
- Choose Deploy and follow the prompts to configure your deployment options, including instance types and scaling policies.
The process is user-friendly, allowing you to quickly integrate these powerful AI capabilities into your projects using the Amazon Bedrock APIs.
Conclusion
The launch of NVIDIA Nemotron models on Amazon Bedrock Marketplace marks a significant milestone in making advanced AI capabilities more accessible to developers and organizations. Nemotron-4 15B, with its impressive 15-billion-parameter architecture trained on 8 trillion text tokens, brings powerful multilingual and coding capabilities to the Amazon Bedrock.
Through Bedrock Marketplace, organizations can use Nemotron’s advanced capabilities while benefiting from the scalable infrastructure of AWS and NVIDIA’s robust technologies. We encourage you to start exploring the capabilities of NVIDIA Nemotron models today through Amazon Bedrock Marketplace, and experience firsthand how this powerful language model can transform your AI applications.
About the authors
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within Cloud platforms & enhancing user experience on accelerated computing.
Eliuth Triana is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.
Jiahong Liu is a Solutions Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA-accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.
Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.
Real value, real time: Production AI with Amazon SageMaker and Tecton
This post is cowritten with Isaac Cameron and Alex Gnibus from Tecton.
Businesses are under pressure to show return on investment (ROI) from AI use cases, whether predictive machine learning (ML) or generative AI. Only 54% of ML prototypes make it to production, and only 5% of generative AI use cases make it to production.
ROI isn’t just about getting to production—it’s about model accuracy and performance. You need a scalable, reliable system with high accuracy and low latency for the real-time use cases that directly impact the bottom line every millisecond.
Fraud detection, for example, requires extremely low latency because decisions need to be made in the time it takes to swipe a credit card. With fraud on the rise, more organizations are pushing to implement successful fraud detection systems. The US nationwide fraud losses topped $10 billion in 2023, a 14% increase from 2022. Global ecommerce fraud is predicted to exceed $343 billion by 2027.
But building and managing an accurate, reliable AI application that can make a dent in that $343 billion problem is overwhelmingly complex.
ML teams often start by manually stitching together different infrastructure components. It seems straightforward at first for batch data, but the engineering gets even more complicated when you need to go from batch data to incorporating real-time and streaming data sources, and from batch inference to real-time serving.
Engineers need to build and orchestrate the data pipelines, juggle the different processing needs for each data source, manage the compute infrastructure, build reliable serving infrastructure for inference, and more. Without the capabilities of Tecton, the architecture might look like the following diagram.
Accelerate your AI development and deployment with Amazon SageMaker and Tecton
All that manual complexity gets simplified with Tecton and Amazon SageMaker. Together, Tecton and SageMaker abstract away the engineering needed for production, real-time AI applications. This enables faster time to value, and engineering teams can focus on building new features and use cases instead of struggling to manage the existing infrastructure.
Using SageMaker, you can build, train and deploy ML models. Meanwhile, Tecton makes it straightforward to compute, manage, and retrieve features to power models in SageMaker, both for offline training and online serving. This streamlines the end-to-end feature lifecycle for production-scale use cases, resulting in a simpler architecture, as shown in the following diagram.
How does it work? With Tecton’s simple-to-use declarative framework, you define the transformations for your features in a few lines of code, and Tecton builds the pipelines needed to compute, manage, and serve the features. Tecton takes care of the full deployment into production and online serving.
It doesn’t matter if it’s batch, streaming, or real-time data or whether it’s offline or online serving. It’s one common framework for every data processing need in end-to-end feature production.
This framework creates a central hub for feature management and governance with enterprise feature store capabilities, making it straightforward to observe the data lineage for each feature pipeline, monitor data quality, and reuse features across multiple models and teams.
The following diagram shows the Tecton declarative framework.
The next section examines a fraud detection example to show how Tecton and SageMaker accelerate both training and real-time serving for a production AI system.
Streamline feature development and model training
First, you need to develop the features and train the model. Tecton’s declarative framework makes it simple to define features and generate accurate training data for SageMaker models:
- Experiment and iterate on features in SageMaker notebooks – You can use Tecton’s software development kit (SDK) to interact with Tecton directly through SageMaker notebook instances, enabling flexible experimentation and iteration without leaving the SageMaker environment.
- Orchestrate with Tecton-managed EMR clusters – After features are deployed, Tecton automatically creates the scheduling, provisioning, and orchestration needed for pipelines that can run on Amazon EMR compute engines. You can view and create EMR clusters directly through the SageMaker notebook.
- Generate accurate training data for SageMaker models – For model training, data scientists can use Tecton’s SDK within their SageMaker notebooks to retrieve historical features. The same code is used to backfill the offline store and continually update the online store, reducing training/serving skew.
Next, the features need to be served online for the final model to consume in production.
Serve features with robust, real-time online inference
Tecton’s declarative framework extends to online serving. Tecton’s real-time infrastructure is designed to help meet the demands of extensive applications and can reliably run 100,000 requests per second.
For critical ML apps, it’s hard to meet demanding service level agreements (SLAs) in a scalable and cost-efficient manner. Real-time use cases such as fraud detection typically have a p99 latency budget between 100 to 200 milliseconds. That means 99% of requests need to be faster than 200ms for the end-to-end process from feature retrieval to model scoring and post-processing.
Feature serving only gets a fraction of that end-to-end latency budget, which means you need your solution to be especially quick. Tecton accommodates these latency requirements by integrating with both disk-based and in-memory data stores, supporting in-memory caching, and serving features for inference through a low-latency REST API, which integrates with SageMaker endpoints.
Now we can complete our fraud detection use case. In a fraud detection system, when someone makes a transaction (such as buying something online), your app might follow these steps:
- It checks with other services to get more information (for example, “Is this merchant known to be risky?”) from third-party APIs
- It pulls important historical data about the user and their behavior (for example, “How often does this person usually spend this much?” or “Have they made purchases from this location before?”), requesting the ML features from Tecton
- It will likely use streaming features to compare the current transaction with recent spending activity over the last few hours or minutes
- It sends all this information to the model hosted on Amazon SageMaker that predicts whether the transaction looks fraudulent.
This process is shown in the following diagram.
Expand to generative AI use cases with your existing AWS and Tecton architecture
After you’ve developed ML features using the Tecton and AWS architecture, you can extend your ML work to generative AI use cases.
For instance, in the fraud detection example, you might want to add an LLM-powered customer support chat that helps a user answer questions about their account. To generate a useful response, the chat would need to reference different data sources, including the unstructured documents in your knowledge base (such as policy documentation about what causes an account suspension) and structured data such as transaction history and real-time account activity.
If you’re using a Retrieval Augmented Generation (RAG) system to provide context to your LLM, you can use your existing ML feature pipelines as context. With Tecton, you can either enrich your prompts with contextual data or provide features as tools to your LLM—all using the same declarative framework.
To choose and customize the model that will best suit your use case, Amazon Bedrock provides a range of pre-trained foundation models (FMs) for inference, or you can use SageMaker for more extensive model building and training.
The following graphic shows how Amazon Bedrock is incorporated to support generative AI capabilities in the fraud detection system architecture.
Build valuable AI apps faster with AWS and Tecton
In this post, we walked through how SageMaker and Tecton enable AI teams to train and deploy a high-performing, real-time AI application—without the complex data engineering work. Tecton combines production ML capabilities with the convenience of doing everything from within SageMaker, whether that’s at the development stage for training models or doing real-time inference in production.
To get started, refer to Getting Started with Amazon SageMaker & Tecton’s Feature Platform, a more detailed guide on how to use Tecton with Amazon SageMaker. And if you can’t wait to try it yourself, check out the Tecton interactive demo and observe a fraud detection use case in action.
You can also find Tecton at AWS re:Invent. Reach out to set up a meeting with experts onsite about your AI engineering needs.
About the Authors
Isaac Cameron is Lead Solutions Architect at Tecton, guiding customers in designing and deploying real-time machine learning applications. Having previously built a custom ML platform from scratch at a major U.S. airline, he brings firsthand experience of the challenges and complexities involved—making him a strong advocate for leveraging modern, managed ML/AI infrastructure.
Alex Gnibus is a technical evangelist at Tecton, making technical concepts accessible and actionable for engineering teams. Through her work educating practitioners, Alex has developed deep expertise in identifying and addressing the practical challenges teams face when productionizing AI systems.
Arnab Sinha is a Senior Solutions Architect at AWS, specializing in designing scalable solutions that drive business outcomes in AI, machine learning, big data, digital transformation, and application modernization. With expertise across industries like energy, healthcare, retail and manufacturing, Arnab holds all AWS Certifications, including the ML Specialty, and has led technology and engineering teams before joining AWS.
Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models
Today, we’re excited to announce a new capability that allows you to deploy over 100 open-weight and proprietary models from Amazon SageMaker JumpStart and register them with Amazon Bedrock, allowing you to seamlessly access them through the powerful Amazon Bedrock APIs. You can now use Amazon Bedrock features such as Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails with models deployed through SageMaker JumpStart.
SageMaker JumpStart helps you get started with machine learning (ML) by providing fully customizable solutions and one-click deployment and fine-tuning of more than 400 popular open-weight and proprietary generative AI models. Amazon Bedrock is a fully managed service that provides a single API to access and use various high-performing foundation models (FMs). It also offers a broad set of capabilities to build generative AI applications. The Amazon Bedrock Converse API is a runtime API that provides a consistent interface that works with different models. It allows you to use advanced features in Amazon Bedrock such as the playground, guardrails, and tool use (function calling).
SageMaker JumpStart has long been the go-to service for developers and data scientists seeking to deploy state-of-the-art generative AI models. Through this integration, you can now combine the flexibility of hosting models on SageMaker JumpStart with the fully managed experience of Amazon Bedrock, including advanced security controls, scalable infrastructure, and comprehensive monitoring capabilities.
In this post, we show you how to deploy FMs through SageMaker JumpStart, register them with Amazon Bedrock, and invoke them using Amazon Bedrock APIs.
Solution overview
The Converse API standardizes interaction with Amazon Bedrock FMs, enabling developers to write code one time and use it across various models without needing to adjust for model-specific differences. It supports multi-turn conversations through conversational history as part of the API request, and developers can perform tasks that require access to external APIs through the usage of tools (function calling). Additionally, the Converse API allows you to block inappropriate inputs or generated content by including a guardrail in your API calls. To review the complete list of supported models and model features, refer to Supported models and model features.
This new feature extends the capabilities of the Converse API into a single interface that developers can use to call FMs deployed in SageMaker JumpStart. This allows developers to use the same API to invoke models from Amazon Bedrock and SageMaker JumpStart, streamlining the process to integrate models into their generative AI applications. Now you can build on top of an even larger library of world-class open source and proprietary models through a single API. To view the full list of Bedrock Ready models available from SageMaker JumpStart, refer to the Bedrock Marketplace documentation. You can also use Amazon Bedrock Marketplace to discover and deploy these models to SageMaker endpoints.
In this post, we walk through the following steps:
- Deploy the Gemma 2 9B Instruct model using SageMaker JumpStart.
- Register the model with Amazon Bedrock.
- Test the model with sample prompts using the Amazon Bedrock playground.
- Use the Amazon Bedrock
RetrieveAndGenerate
API to query the Amazon Bedrock knowledge base. - Set up Amazon Bedrock Guardrails to help block harmful content and personally identifiable information (PII) data.
- Invoke models with Converse APIs to show an end-to-end Retrieval Augmented Generation (RAG) pipeline.
Prerequisites
You can access and deploy pretrained models from SageMaker JumpStart in the Amazon SageMaker Studio UI. To access SageMaker Studio on the AWS Management Console, you need to set up an Amazon SageMaker domain. SageMaker uses domains to organize user profiles, applications, and their associated resources. To create a domain and set up a user profile, refer to Guide to getting set up with Amazon SageMaker.
You also need an AWS Identity and Access Management (IAM) role with appropriate permissions. To get started with this example, you can use the AmazonSageMakerFullAccess, AmazonBedrockFullAccess, AmazonOpenSearchAccess managed policies to provide the required permissions to SageMaker JumpStart and Amazon Bedrock. For a more scoped down set of permissions, refer to the following:
After applying the relevant permissions, setting up a SageMaker domain, and creating user profiles, you are ready to deploy your SageMaker JumpStart model and register it with Amazon Bedrock.
Deploy a model with SageMaker JumpStart and register it with Amazon Bedrock
This section provides a walkthrough of deploying a model using SageMaker JumpStart and registering it with Amazon Bedrock. In this walkthrough, you will deploy and register the Gemma 2 9B Instruct model offered through Hugging Face in SageMaker JumpStart. Complete the following steps:
- On the SageMaker console, choose Studio in the navigation pane.
- Choose the relevant user profile on the dropdown menu and choose Open Studio.
- In SageMaker Studio, choose JumpStart in the navigation pane.
Here, you will see a list of the available SageMaker JumpStart models. Models that can be registered to Amazon Bedrock after they’ve been deployed through SageMaker JumpStart have a Bedrock ready tag.
- The Gemma 2 9B Instruct model for this example is provided by Hugging Face, so choose the Hugging Face model card.
- To filter the list of models to view which models are supported by Amazon Bedrock, select Bedrock Ready under Action.
- Search for Gemma 2 9B Instruct and choose the model card for Gemma 2 9B Instruct.
You can review the model card for Gemma 2 9B Instruct to learn more about the model.
- To deploy the model, choose Deploy.
- Review the End User License Agreement for Gemma 2 9B Instruct and select I accept the End User License Agreement (EULA) and read the terms and conditions.
- Leave the endpoint settings with their default values and choose Deploy.
The endpoint deployment process will take a few minutes.
- Under Deployments in the navigation pane, choose Endpoints to view your available endpoints.
After a few minutes, the model will be deployed to the endpoint and its status will change to In service, indicating that the endpoint is ready to serve traffic. You can use the Refresh icon at the bottom of the endpoint screen to get the latest information.
- When your endpoint is in service, choose it to go to the endpoint details page.
- Choose Use with Bedrock to start the registration process.
You will be redirected to the Amazon Bedrock console.
- On the Register endpoint page, the SageMaker endpoint Amazon Resource Name (ARN) and model ARN have already been prepopulated. Review these values and choose Register.
Your SageMaker endpoint will be registered with Amazon Bedrock in a few minutes.
After your SageMaker endpoint is registered with Amazon Bedrock, you can invoke it using the Converse API. Then you can test your endpoint in the Amazon Bedrock playground.
- In the navigation pane on the Amazon Bedrock console, choose Marketplace deployments under Foundation models.
- From the list of managed deployments, select your registered model, then choose Open in playground.
You will now be in the Amazon Bedrock playground for Chat/text. The Chat/text playground allows to you test your model with a single prompt, or provides chat capability for conversational use cases. Because this example will be an interactive chat session, leave the Mode as the default Chat. The chat capability in the playground should be set to test your Gemma 2 9B Instruct model.
Now you can test your SageMaker endpoint through Amazon Bedrock! Use the following prompt to test summarizing a meeting transcript, and review the results:
- Enter the prompt into the playground, then choose Run.
You can view the response in the chat generated by your deployed SageMaker JumpStart model through Amazon Bedrock:
You can also test the model with your own prompts and use cases.
Use Amazon Bedrock APIs with the deployed model
This section demonstrates using the AWS SDK for Python (Boto3) and Converse APIs to invoke the Gemma 2 9B Instruct model you deployed earlier through SageMaker and registered with Amazon Bedrock. The full source code associated with this post is available in the accompanying GitHub repo. For additional Converse API examples, refer to Converse API examples.
In this code sample, we also implement a RAG architecture in conjunction with the deployed model. RAG is the process of optimizing the output of a large language model (LLM) so it references an authoritative knowledge base outside of its training data sources before generating a response.
Use the deployed SageMaker model with the RetrieveAndGenerate
API offered by Amazon Bedrock to query a knowledge base and generate responses based on the retrieved results. The response only cites sources that are relevant to the query. For information on creating a Knowledge Base, refer to Creating a Knowledge Base. For additional code samples, refer to RetrieveAndGenerate.
The following diagram illustrates the RAG workflow.
Complete the following steps:
- To invoke the deployed model, you need to pass the endpoint ARN of the deployed model in the
modelId
parameter of the Converse API.
To obtain the ARN of the deployed model, navigate to the Amazon Bedrock console. In the navigation pane, choose Marketplace deployments under Foundation models. From the list of managed deployments, choose your registered model to view more details.
You will be directed to the model summary on the Model catalog page under Foundation models. Here, you will find the details associated with your deployed model. Copy the model ARN to use in the following code sample.
- Invoke the SageMaker JumpStart model with the
RetrieveAndGenerate
API. Thegeneration_template
andorchestration_template
parameters in theretrieve_and_generate
API are model specific. These templates define the prompts and instructions for the language model, guiding the generation process and the integration with the knowledge retrieval component.
Now you can implement guardrails with the Converse API for your SageMaker JumpStart model. Amazon Bedrock Guardrails enables you to implement safeguards for your generative AI applications based on your use cases and responsible AI policies. For information on creating guardrails, refer to Create a Guardrail. For additional code samples to implement guardrails, refer to Include a guardrail with Converse API.
- In the following code sample, you include a guardrail in a Converse API request invoking a SageMaker JumpStart model:
Clean up
To clean up your resources, use the following code:
The SageMaker JumpStart model you deployed will incur cost if you leave it running. Delete the endpoint if you want to stop incurring charges. Deleting the endpoint will also de-register the model from Amazon Bedrock. For more details, see Delete Endpoints and Resources.
Conclusion
In this post, you learned how to deploy FMs through SageMaker JumpStart, register them with Amazon Bedrock, and invoke them using Amazon Bedrock APIs. With this new capability, organizations can access leading proprietary and open-weight models using a single API, reducing the complexity of building generative AI applications with a variety of models. This integration between SageMaker JumpStart and Amazon Bedrock is generally available in all AWS Regions where Amazon Bedrock is available. Try this code to use ConverseAPIs, Knowledge bases and Guardrails with SageMaker.
About the Author
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.
Abhishek Doppalapudi is a Solutions Architect at Amazon Web Services (AWS), where he assists startups in building and scaling their products using AWS services. Currently, he is focused on helping AWS customers adopt Generative AI solutions. In his free time, Abhishek enjoys playing soccer, watching Premier League matches, and reading.
June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.
Eashan Kaushik is an Associate Solutions Architect at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.
Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.