Extend large language models powered by Amazon SageMaker AI using Model Context Protocol

Organizations implementing agents and agent-based systems often experience challenges such as implementing multiple tools, function calling, and orchestrating the workflows of the tool calling. An agent uses a function call to invoke an external tool (like an API or database) to perform specific actions or retrieve information it doesn’t possess internally. These tools are integrated as an API call inside the agent itself, leading to challenges in scaling and tool reuse across an enterprise. Customers looking to deploy agents at scale need a consistent way to integrate these tools, whether internal or external, regardless of the orchestration framework they are using or the function of the tool.

Model Context Protocol (MCP) aims to standardize how these channels, agents, tools, and customer data can be used by agents, as shown in the following figure. For customers, this translates directly into a more seamless, consistent, and efficient experience compared to dealing with fragmented systems or agents. By making tool integration simpler and standardized, customers building agents can now focus on which tools to use and how to use them, rather than spending cycles building custom integration code. We will deep dive into the MCP architecture later in this post.

For MCP implementation, you need a scalable infrastructure to host these servers and an infrastructure to host the large language model (LLM), which will perform actions with the tools implemented by the MCP server. Amazon SageMaker AI provides the ability to host LLMs without worrying about scaling or managing the undifferentiated heavy lifting. You can deploy your model or LLM to SageMaker AI hosting services and get an endpoint that can be used for real-time inference. Moreover, you can host MCP servers on the compute environment of your choice from AWS, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda, according to your preferred level of managed service—whether you want to have complete control of the machine running the server, or you prefer not to worry about maintaining and managing these servers.

In this post, we discuss the following topics:

Understanding the MCP architecture, why you should use the MCP compared to implementing microservices or APIs, and two popular ways of implementing MCP using LangGraph adapters:
- FastMCP for prototyping and simple use cases
- FastAPI for complex routing and authentication
Recommended architecture for scalable deployment of MCP
Using SageMaker AI with FastMCP for rapid prototyping
Implementing a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

Understanding MCP

Let’s deep dive into the MCP architecture. Developed by Anthropic as an open protocol, the MCP provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture (as illustrated in the following screenshot), MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), or other AI applications
Client – Protocol clients that maintain one-to-one connections with servers
Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect to

Based on these components, we can define the protocol as the communication backbone connecting the MCP client and server within the architecture, which includes the set of rules and standards defining how clients and servers should interact, what messages they exchange (using JSON-RPC 2.0), and the roles of different components.

Now let’s understand the MCP workflow and how it interacts with an LLM to deliver you a response by using an example of a travel agent. You ask the agent to “Book a 5-day trip to Europe in January and we like warm weather.” The host application (acting as an MCP client) identifies the need for external data and connects through the protocol to specialized MCP servers for flights, hotels, and weather information. These servers return the relevant data through the MCP, which the host then integrates with the original prompt, providing enriched context to the LLM to generate a comprehensive, augmented response for the user. The following diagram illustrates this workflow.

When to use MCP instead of implementing microservices or APIs

MCP marks a significant advancement compared to traditional monolithic APIs and intricate microservices architectures. Traditional APIs often bundle the functionalities together, leading to challenges where scaling requires upgrading the entire system, updates carry high risks of system-wide failures, and managing different versions for various applications becomes overly complex. Although microservices offer more modularity, they typically demand separate, often complex, integrations for each service and intricate management overhead.

MCP overcomes these limitations by establishing a standardized client-server architecture specifically designed for efficient and secure integration. It provides a real-time, two-way communication interface enabling AI systems to seamlessly connect with diverse external tools, API services, and data sources using a “write once, use anywhere” philosophy. Using transports like standard input/output (stdio) or streamable HTTP under the unifying JSON-RPC 2.0 standard, MCP delivers key advantages such as superior fault isolation, dynamic service discovery, consistent security controls, and plug-and-play scalability, making it exceptionally well-suited for AI applications that require reliable, modular access to multiple resources.

FastMCP vs. FastAPI

In this post, we discuss two different approaches for implementing MCP servers: FastAPI with SageMaker, and FastMCP with LangGraph. Both are fully compatible with the MCP architecture and can be used interchangeably, depending on your needs. Let’s understand the difference between both.

FastMCP is used for rapid prototyping, educational demos, and scenarios where development speed is a priority. It’s a lightweight, opinionated wrapper built specifically for quickly standing up MCP-compliant endpoints. It abstracts away much of the boilerplate—such as input/output schemas and request handling—so you can focus entirely on your model logic.

For use cases where you need to customize request routing, add authentication, or integrate with observability tools like Langfuse or Prometheus, FastAPI gives you the flexibility to do so. FastAPI is a full-featured web framework that gives you finer-grained control over the server behavior. It’s well-suited for more complex workflows, advanced request validation, detailed logging, middleware, and other production-ready features.

You can safely use either approach in your MCP servers—the choice depends on whether you prioritize simplicity and speed (FastMCP) or flexibility and extensibility (FastAPI). Both approaches conform to the same interface expected by agents in the LangGraph pipeline, so your orchestration logic remains unchanged.

Solution overview

In this section, we walk through a reference architecture for scalable deployment of MCP servers and MCP clients, using SageMaker AI as the hosting environment for the foundation models (FMs) and LLMs. Although this architecture uses SageMaker AI as its reasoning core, it can be quickly adapted to support Amazon Bedrock models as well. The following diagram illustrates the solution architecture.

The architecture decouples the client from the server by using streamable HTTP as the transport layer. By doing this, clients and servers can scale independently, making it a great fit for serverless orchestration powered by Lambda, AWS Fargate for Amazon ECS, or Fargate for Amazon EKS. An additional benefit of decoupling is that you can better control authorization of applications and user by controlling AWS Identity and Access Management (IAM) permissions of client and servers separately, and propagating user access to the backend. If you’re running client and server with a monolithic architecture on the same compute, we suggest instead using stdio as the transport layer to reduce networking overhead.

Use SageMaker AI with FastMCP for rapid prototyping

With the architecture defined, let’s analyze the application flow as shown in the following figure.

In terms of usage patterns, MCP shares a logic similar to tool calling, with an initial addition to discover the available tools:

The client connects to the MCP server and obtains a list of available tools.
The client invokes the LLM using a prompt engineered with the list of tools available on the MCP server (message of type “user”).
The LLM reasons with respect to which ones it needs to call and how many times, and replies (“assistant” type message).
The client asks the MCP server to execute the tool calling and provides the result to the LLM (“user” type message).
This loop iterates until a final answer is reached and can be given back to the user.
The client disconnects from the MCP server.

Let’s start with the MCP server definition. To create an MCP server, we use the official Model Context Protocol Python SDK. For example, let’s create a simple server with just one tool. The tool will simulate searching for the most popular song played at a radio station, and return it in a Python dictionary. Make sure to add proper docstring and input/output typing, so that the both the server and client can discover and consume the resource correctly.

from mcp.server.fastmcp import FastMCP

# instantiate an MCP server client
mcp = FastMCP("Radio Station Server")

# DEFINE TOOLS
@mcp.tool()
def top_song(sign: str) -> dict:
"""Get the most popular song played on a radio station"""
# In this example, we simulate the return
# but you should replace this with your business logic
return {
"song": "In the end",
"author": "Linkin Park"
}

@mcp.tool()
def ...

if __name__ == "__main__":
# Start the MCP server using stdio/SSE transport
  mcp.run(transport="sse")

As we discussed earlier, MCP servers can be run on AWS compute services—Amazon EC2, Amazon EC2, Amazon EKS, or Lambda—and can then be used to safely access other resources in the AWS Cloud, for example databases in virtual private clouds (VPCs) or an enterprise API, as well as external resources. For example, a simple way to deploy an MCP server is to use Lambda support for Docker images to install the MCP dependency on the Lambda function or Fargate.

With the server set up, let’s turn our focus to the MCP client. Communication starts with the MCP client connecting to the MCP Server using streamable HTTP:

from mcp import ClientSession
from mcp.client.sse import sse_client

async def connect_to_sse_server(self, server_url: str):
"""Connect to an MCP server running with SSE transport"""
  # Store the context managers so they stay alive
  self._streams_context = sse_client(url=server_url)
  streams = await self._streams_context.__aenter__()

  self._session_context = ClientSession(*streams)
  self.session: ClientSession = await self._session_context.__aenter__()

  # Initialize
  await self.session.initialize()

  # List available tools to verify connection
  print("Initialized SSE client...")
  print("Listing tools...")
  response = await self.session.list_tools()
  tools = response.tools
  print("nConnected to server with tools:", [tool.name for tool in tools])

When connecting to the MCP server, a good practice is to ask the server for a list of available tools with the list_tools() API. With the tool list and their description, we can then define a system prompt for tool calling:

system_message = (
     "You are a helpful assistant with access to these tools:nn"
      f"{tools_description}n"
      "Choose the appropriate tool based on the user's question. "
      "If no tool is needed, reply directly.nn"
      "IMPORTANT: When you need to use a tool, you must ONLY respond with "
      "the exact JSON object format below, nothing else:n"
      "{n"
      '    "tool": "tool-name",n'
      '    "arguments": {n'
      '        "argument-name": "value"n'
      "    }n"
      "}nn"
      "After receiving a tool's response:n"
      "1. Transform the raw data into a natural, conversational responsen"
      "2. Keep responses concise but informativen"
      "3. Focus on the most relevant informationn"
      "4. Use appropriate context from the user's questionn"
      "5. Avoid simply repeating the raw datann"
      "Please use only the tools that are explicitly defined above."
)

Tools are usually defined using a JSON schema similar to the following example. This tool is called top_song and its function is to get the most popular song played on a radio station:

{
   "name": "top_song",
   "description": "Get the most popular song played on a radio station.",
   "parameters": {
     "type": "object",
     "properties": {
        "sign": {
           "type": "string",
           "description": "The call sign for the radio station for which you want the most popular song. Example calls signs are WZPZ and WKRP."
           }
         },
     "required": ["sign"]
     }
}

With the system prompt configured, you can run the chat loop as much as needed, alternating between invoking the hosted LLM and calling the tools powered by the MCP server. You can use packages such as SageMaker Boto3, the Amazon SageMaker Python SDK, or another third-party library, such as LiteLLM or similar.

messages = [
     {"role": "system", "content": system_message},
     {"role": "user", "content": "What is the most played song on WZPZ?"}
]

result = sagemaker_client.invoke_endpoint(...)
tool_name, tool_args = parse_tools_from_llm_response(result)
# Identify if there is a tool call in the message received from the LLM
result = await self.session.call_tool(tool_name, tool_args)
# Parse the output from the tool called, then invoke the endpoint again
result = sagemaker_client.invoke_endpoint(...)

A model hosted on SageMaker doesn’t support function calling natively in its API. This means that you will need to parse the content of the response using a regular expression or similar methods:

import re, json

def parse_tools_from_llm_response(message: str)->dict:
    match = re.search(r'(?s){(?:[^{}]|(?:{[^{}]*}))*}', content)
    content = json.loads(match.group(0))
    tool_name = content["tool"]
    tool_arguments = content["arguments"]
    return tool_name, tool_arguments

After no more tool requests are available in the LLM response, you can consider the content as the final answer and return it to the user. Finally, you close the stream to finalize interactions with the MCP server.

Implement a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

To demonstrate the power of MCP with SageMaker AI, let’s explore a loan underwriting system that processes applications through three specialized personas:

Loan officer – Summarizes the application
Credit analyst – Evaluates creditworthiness
Risk manager – Makes final approval or denial decisions

We will walk you through these personas through the following architecture for a loan processing workflow using MCP. The code for this solution is available in the following GitHub repo.

In the architecture, the MCP client and server are running on EC2 instances and the LLM is hosted on SageMaker endpoints. The workflow consists of the following steps:

The user enters a prompt with loan input details such as name, age, income, and credit score.
The request is routed to the loan MCP server by the MCP client.
The loan parser sends output as input to the credit analyzer MCP server.
The credit analyzer sends output as input to the risk manager MCP server.
The final prompt is processed by the LLM and sent back to the MCP client to provide the output to the user.

You can use LangGraph’s built-in human-in-the-loop feature when the credit analyzer sends the output to the risk manager and when the risk manager sends the output. For this post, we have not implemented this workflow.

Each persona is powered by an agent with LLMs hosted by SageMaker AI, and its logic is deployed by using a dedicated MCP server. Our MCP server implementation in the example uses the Awesome MCP FastAPI, but you can also build a standard MCP server implementation according to the original Anthropic package and specification. The dedicated MCP server in this example is running on a local Docker container, but it can be quickly deployed to the AWS Cloud using services like Fargate. To run the servers locally, use the following code:

uvicorn servers.loan_parser.main:app --port 8002
uvicorn servers.credit_analyzer.main:app --port 8003
uvicorn servers.risk_assessor.main:app --port 8004

When the servers are running, you can start creating the agents and the workflow. You will need to deploy the LLM endpoint by running the following command:

Python deploy_sm_endpoint.py

This example uses LangGraph, a common open source framework for agentic workflows, designed to support seamless integration of language models into complex workflows and applications. Workflows are represented as graphs made of nodes—actions, tools, or model queries—and edges with the flow of information between them. LangGraph provides a structured yet dynamic way to execute tasks, making it simple to write AI applications involving natural language understanding, automation, and decision-making.

In our example, the first agent we create is the loan officer:

graph = StateGraph(State)
graph.add_node("LoanParser", call_mcp_server(PARSER_URL))

The goal of the loan officer (or LoanParser) is to perform the tasks defined in its MCP server. To call the MCP server, we can use the httpx library:

import httpx
from langchain_core.runnables import RunnableLambda

def call_mcp_server(url):
    async def fn(state: State) -> State:
      print(f"[DEBUG] Calling {url} with payload:", state["output"])
      async with httpx.AsyncClient() as client:
        response = await client.post(url, json=state["output"])
        response.raise_for_status()
        return {"output": response.json()}
    return RunnableLambda(fn).with_config({"run_name": f"CallMCP::{url.split(':')[2]}"})

With that done, we can run the workflow using the scripts/run_pipeline.py file. We configured the repository to be traceable by using LangSmith. If you have correctly configured the environment variables, you will see a trace similar to this one in your LangSmith UI.

Configuring LangSmith UI for experiment tracing is optional. You can skip this step.

After running python3 scripts/run_pipeline.py, you should see the following in your terminal or log.

We use the following input:

loan_input = {
  "output": {
     "name": "Jane Doe",
     "age": 35,
     "income": 2000000,
     "loan_amount": 4500000,
     "credit_score": 820,
     "existing_liabilities": 15000,
     "purpose": "Home Renovation"
     }
}

We get the following output:

[DEBUG] Calling http://localhost:8002/process with payload: {'name': 'Jane Doe', 'age': 35, 'income': 2000000, 'loan_amount': 4500000, 'credit_score': 820, 'existing_liabilities': 15000, 'purpose': 'Home Renovation'}

[DEBUG] Calling http://localhost:8003/process with payload: {'summary': 'Jane Doe, 35 years old, applying for a loan of $4,500,000 to renovate her home. She has an income of $2,000,000, a credit score of 820, and existing liabilities of $150,000.', 'fields': {'name': 'Jane Doe', 'age': 35, 'income': 2000000.0, 'loan_amount': 4500000.0, 'credit_score': 820, 'existing_liabilities': 15000.0, 'purpose': 'Home Renovation'}}

[DEBUG] Calling http://localhost:8004/process with payload: {'credit_assessment': 'High', 'score': 'High', 'fields': {'name': 'Jane Doe', 'age': 35, 'income': 2000000.0, 'loan_amount': 4500000.0, 'credit_score': 820, 'existing_liabilities': 15000.0, 'purpose': 'Home Renovation'}}

Final result: {'decision': 'Approved', 'reasoning': 'Decision: Approved'}

Tracing with the LangSmith UI

LangSmith traces contain the full information of all the inputs and outputs of each step of the application, giving users full visibility into their agent. This is an optional step and in case you have configured LangSmith for tracing the MCP loan processing application. You can go the LangSmith login page and log in to the LangSmith UI. Then you can choose Tracing Project and run LoanUnderwriter. You should see a detailed flow of each MCP server, such as loan parser, credit analyzer, and risk assessor input and outputs by the LLM, as shown in the following screenshot.

Conclusion

The MCP proposed by Anthropic offers a standardized way of connecting FMs to data sources, and now you can use this capability with SageMaker AI. In this post, we presented an example of combining the power of SageMaker AI and MCP to build an application that offers a new perspective on loan underwriting through specialized roles and automated workflows.

Organizations can now streamline their AI integration processes by minimizing custom integrations and maintenance bottlenecks. As AI continues to evolve, the ability to securely connect models to your organization’s critical systems will become increasingly valuable. Whether you’re looking to transform loan processing, streamline operations, or gain deeper business insights, the SageMaker AI and MCP integration provides a flexible foundation for your next AI innovation.

The following are some examples of what you can build by connecting your SageMaker AI models to MCP servers:

A multi-agent loan processing system that coordinates between different roles and data sources
A developer productivity assistant that integrates with enterprise systems and tools
A machine learning workflow orchestrator that manages complex, multi-step processes while maintaining context across operations

If you’re looking for ways to optimize your SageMaker AI deployment, learn more about how to unlock cost savings with the new scale down to zero feature in SageMaker Inference, as well as how to unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with a SageMaker trained model. For application development, refer to Build agentic AI solutions with DeepSeek-R1, CrewAI, and Amazon SageMaker AI

About the Authors

Mona Mona currently works as a Sr World Wide Gen AI Specialist Solutions Architect at Amazon focusing on Gen AI Solutions. She was a Lead Generative AI specialist in Google Public Sector at Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.

Davide Gallitelli is a Senior Worldwide Specialist Solutions Architect for Generative AI at AWS, where he empowers global enterprises to harness the transformative power of AI. Based in Europe but with a worldwide scope, Davide partners with organizations across industries to architect custom AI agents that solve complex business challenges using AWS ML stack. He is particularly passionate about democratizing AI technologies and enabling teams to build practical, scalable solutions that drive organizational transformation.

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

Giuseppe Zappia is a Principal Solutions Architect at AWS, with over 20 years of experience in full stack software development, distributed systems design, and cloud architecture. In his spare time, he enjoys playing video games, programming, watching sports, and building things.

Vedere AI