September 2023 – Page 3

NVIDIA Works With NTT DOCOMO to Launch World’s First GPU-Accelerated 5G Network

As generative AI sweeps across corporate boardrooms around the world, global telecommunications companies are exploring how to cost-effectively deliver many of these new AI applications to the edge over 5G and upcoming 6G networks.

Telcos plan to deploy over 17 million 5G microcells and towers worldwide by 2025. Building, managing and optimizing this new infrastructure while maintaining quality-of-service delivery and maximizing the customer experience is the industry’s next big challenge.

Today, NTT DOCOMO announced it is deploying a GPU-accelerated wireless solution in its network in Japan. This makes it the first-ever telco in the world to deploy a GPU-accelerated commercial 5G network.

DOCOMO’s move aims to address the multibillion dollar problem of driving improvements in performance, total cost of ownership and energy efficiency while unlocking the flexibility, scalability and supply chain diversity promise of Open RAN.

The 5G Open RAN solution uses a high-performance 5G virtual radio access network (vRAN) from Fujitsu built on the NVIDIA Aerial vRAN stack and NVIDIA Converged Accelerators. This combination enables telcos to create a fully software- and cloud-defined network that can dynamically allocate resources using industry-standard equipment.

“Open RAN offers the opportunity to build next-generation 5G networks with unprecedented flexibility and scalability thanks to multivendor connections,” said Sadayuki Abeta, global head of Open RAN solutions at NTT DOCOMO. “We look forward to continuing working with NVIDIA on infrastructure solutions that meet those needs.”

The 5G Open RAN solution is the first 5G vRAN for telco commercial deployment using the NVIDIA Aerial platform, with a hardened, carrier-grade vRAN stack. The platform brings together the NVIDIA Aerial vRAN stack for 5G, AI frameworks, accelerated compute infrastructure and long-term software support and maintenance.

Cost and Energy-Efficiency Benefits of GPU Acceleration

Working with offerings from Fujitsu and Wind River, the new 5G solution uses the NVIDIA Aerial platform to lower costs and reduce power consumption. Compared to its existing 5G network deployments, DOCOMO says the solution reduces total costs by up to 30%, network design utilization by up to 50%, and power consumption at base stations by up to 50%.

“Delivering a 5G Open RAN network that meets stringent performance requirements of operators is a significant accomplishment,” said Masaki Taniguchi, senior vice president and head of the Mobile System Business Unit at Fujitsu Limited. “Using our Fujitsu vCU/vDU, in combination with NVIDIA Aerial platform, will help network operators to efficiently build high-performance 5G networks for consumers and businesses alike.”

NVIDIA’s contribution to the DOCOMO rollout is part of a growing portfolio of 5G solutions that are driving transformation in the telecommunications industry. Anchored on NVIDIA Aerial vRAN stack and NVIDIA Converged Accelerators — combined with NVIDIA BlueField data processing units (DPUs) and a suite of AI frameworks — NVIDIA provides a high-performance, software-defined, cloud-native, AI-enabled 5G for on-premises and telco operators’ RAN.

Fujitsu, NVIDIA and Wind River have been working under the OREX (5G Open RAN service brand), which was launched by DOCOMO in February 2021, to develop the Open RAN 5G vRAN. OREX has been deployed in Japan based on Fujitsu’s virtualized DU (vDU) and virtualized CU (vCU) and leverages commercial off-the-shelf servers, the Wind River cloud platform, Fujitsu’s 5G vRAN software and the NVIDIA Aerial vRAN stack and NVIDIA Converged Accelerators.

“Wind River is delighted to work with NTT DOCOMO, Fujitsu and NVIDIA towards a vision of improved efficiency of RAN deployments and operations,” said Paul Miller, chief technology officer at Wind River. “Wind River Studio provides a cloud-native, distributed cloud, automation, and analytics solution based on open source, so that operators can deploy and manage their 5G edge networks globally at high scale with faster innovation. This solution is proven in highly scaled production deployments today.”

OREX: Building Out From Japan and Beyond

DOCOMO and its partners in OREX are promoting a multivendor, Open RAN-compliant 5G vRAN to the global operator community. The commercial deployment in Japan is a first step in the vision of OREX, where members can commercially validate their solutions and then promote them to other operators globally.

NVIDIA is working with DOCOMO and other partners to support operators around the world to deploy high-performance, energy-efficient, software-defined, commercial 5G vRAN.

NVIDIA CEO and Founder Jensen Huang Returns to Denny’s Where NVIDIA Launched a Trillion-Dollar Vision

Talk about a Grand Slam.

Denny’s CEO Kelli Valade was joined Tuesday by NVIDIA CEO Jensen Huang Tuesday to unveil a plaque at the Silicon Valley Denny’s where NVIDIA’s founders hatched their idea for a chip that would enable realistic 3D graphics on personal computers.

“This is a place where we fuel ideas, your story is so inspiring it will continue to inspire people at Dennny’s,” Valade said, as she presented the plaque to Huang.

“Denny’s has taught me so many lessons,” Huang said.

Both CEOs got their start working in diners. Valade got her first job as a waitress at a diner when she was 16. Huang got his first job at Denny’s in Portland when he was 15.

“I was a dishwasher, I was a busboy, I waited tables,” Huang said. “No one can carry more coffee cups than I can.”

To fuel even more great ideas, Valade announced the Denny’s Trillion-Dollar Incubator Contest — offering $25,000 in seed money for the next $1 trillion idea.

The contest is open to anyone with a creative and innovative idea that could impact the world. The only catch? The idea must originate — like NVIDIA — in a Denny’s booth.

NVIDIA, the leading accelerated computing and AI company, got its start at the 24-hour diner chain known for favorites such as its signature Grand Slam combo.

In 1993, three friends — Huang, Chris Malachowsky and Curtis Priem — met at Denny’s to discuss creating a chip that would enable realistic 3D graphics on personal computers.

The Denny’s just off a busy thoroughfare in the heart of Silicon Valley was the perfect place to start a business, said Huang, who lived nearby at the time with his wife and kids.

“It had all the coffee you could drink and no one could chase you out,” Huang told Valade.

Tuesday’s event took place in a corner of the bustling restaurant — one of the most popular Denny’s locations in Northern California — as families, retirees, and workers coming off the night shift piled in for plates piled high with eggs and pancakes, sausage and bacon.

Huang was among them, starting the day with a meeting where he and his table polished off a Lumberjack Slam, Moons Over My Hammy, and a Super Bird sandwich — washed down with plenty of coffee.

Huang, whose family immigrated to the United States from Taiwan when he was a child, told Valade he had his first hamburger at Dennys and his first milkshake.

“We make the best pancakes here,” Huang said.

“I love how you still say ‘we,’” Valade said.

“Made fresh every day,” Huang added with a grin.

Valade and Huang said Denny’s can be a great launching pad, not just for great ideas, but for great careers.

“Start your first job in the restaurant business,” Huang said. “It teaches you humility, it teaches you hard work, it teaches you hospitality.”

Valade agreed wholeheartedly, who, after talking shop with Huang, checked in with diners like Alfred, who was tucking into a stack of pancakes amid the morning rush.

For people across Silicon Valley, it’s the place to be. “I come here every day,” the retired roofer said.

Full contest details for the Denny’s Trillion Dollar Incubator Contest can be found at www.dennys.com/trilliondollarincubator. Contestants can submit their ideas online or at Denny’s restaurant by Nov. 21st at 8:59 am. The winner will be announced early next year.

IMAGE CREDIT: Don Feria/AP Images for Denny’s

Build and deploy ML inference applications from scratch using Amazon SageMaker

As machine learning (ML) goes mainstream and gains wider adoption, ML-powered inference applications are becoming increasingly common to solve a range of complex business problems. The solution to these complex business problems often requires using multiple ML models and steps. This post shows you how to build and host an ML application with custom containers on Amazon SageMaker.

Amazon SageMaker offers built-in algorithms and pre-built SageMaker docker images for model deployment. But, if these don’t fit your needs, you can bring your own containers (BYOC) for hosting on Amazon SageMaker.

There are several use cases where users might need to BYOC for hosting on Amazon SageMaker.

Custom ML frameworks or libraries: If you plan on using a ML framework or libraries that aren’t supported by Amazon SageMaker built-in algorithms or pre-built containers, then you’ll need to create a custom container.
Specialized models: For certain domains or industries, you may require specific model architectures or tailored preprocessing steps that aren’t available in built-in Amazon SageMaker offerings.
Proprietary algorithms: If you’ve developed your own proprietary algorithms inhouse, then you’ll need a custom container to deploy them on Amazon SageMaker.
Complex inference pipelines: If your ML inference workflow involves custom business logic — a series of complex steps that need to be executed in a particular order — then BYOC can help you manage and orchestrate these steps more efficiently.

Solution overview

In this solution, we show how to host a ML serial inference application on Amazon SageMaker with real-time endpoints using two custom inference containers with latest scikit-learn and xgboost packages.

The first container uses a scikit-learn model to transform raw data into featurized columns. It applies StandardScaler for numerical columns and OneHotEncoder to categorical ones.

The second container hosts a pretrained XGboost model (i.e., predictor). The predictor model accepts the featurized input and outputs predictions.

Lastly, we deploy the featurizer and predictor in a serial-inference pipeline to an Amazon SageMaker real-time endpoint.

Here are few different considerations as to why you may want to have separate containers within your inference application.

Decoupling – Various steps of the pipeline have a clearly defined purpose and need to be run on separate containers due to the underlying dependencies involved. This also helps keep the pipeline well structured.
Frameworks – Various steps of the pipeline use specific fit-for-purpose frameworks (such as scikit or Spark ML) and therefore need to be run on separate containers.
Resource isolation – Various steps of the pipeline have varying resource consumption requirements and therefore need to be run on separate containers for more flexibility and control.
Maintenance and upgrades – From an operational standpoint, this promotes functional isolation and you can continue to upgrade or modify individual steps much more easily, without affecting other models.

Additionally, local build of the individual containers helps in the iterative process of development and testing with favorite tools and Integrated Development Environments (IDEs). Once the containers are ready, you can use deploy them to the AWS cloud for inference using Amazon SageMaker endpoints.

Full implementation, including code snippets, is available in this Github repository here.

Prerequisites

As we test these custom containers locally first, we’ll need docker desktop installed on your local computer. You should be familiar with building docker containers.

You’ll also need an AWS account with access to Amazon SageMaker, Amazon ECR and Amazon S3 to test this application end-to-end.

Ensure you have the latest version of Boto3 and the Amazon SageMaker Python packages installed:

pip install --upgrade boto3 sagemaker scikit-learn

Solution Walkthrough

Build custom featurizer container

To build the first container, the featurizer container, we train a scikit-learn model to process raw features in the abalone dataset. The preprocessing script uses SimpleImputer for handling missing values, StandardScaler for normalizing numerical columns, and OneHotEncoder for transforming categorical columns. After fitting the transformer, we save the model in joblib format. We then compress and upload this saved model artifact to an Amazon Simple Storage Service (Amazon S3) bucket.

Here’s a sample code snippet that demonstrates this. Refer to featurizer.ipynb for full implementation:

```python
numeric_features = list(feature_columns_names)
numeric_features.remove("sex")
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_features = ["sex"]
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Call fit on ColumnTransformer to fit all transformers to X, y
preprocessor = preprocess.fit(df_train_val)

# Save the processor model to disk
joblib.dump(preprocess, os.path.join(model_dir, "preprocess.joblib"))
```

Next, to create a custom inference container for the featurizer model, we build a Docker image with nginx, gunicorn, flask packages, along with other required dependencies for the featurizer model.

Nginx, gunicorn and the Flask app will serve as the model serving stack on Amazon SageMaker real-time endpoints.

When bringing custom containers for hosting on Amazon SageMaker, we need to ensure that the inference script performs the following tasks after being launched inside the container:

Model loading: Inference script (preprocessing.py) should refer to /opt/ml/model directory to load the model in the container. Model artifacts in Amazon S3 will be downloaded and mounted onto the container at the path /opt/ml/model.
Environment variables: To pass custom environment variables to the container, you must specify them during the Model creation step or during Endpoint creation from a training job.
API requirements: The Inference script must implement both /ping and /invocations routes as a Flask application. The /ping API is used for health checks, while the /invocations API handles inference requests.
Logging: Output logs in the inference script must be written to standard output (stdout) and standard error (stderr) streams. These logs are then streamed to Amazon CloudWatch by Amazon SageMaker.

Here’s a snippet from preprocessing.py that show the implementation of /ping and /invocations.

Refer to preprocessing.py under the featurizer folder for full implementation.

```python
def load_model():
    # Construct the path to the featurizer model file
    ft_model_path = os.path.join(MODEL_PATH, "preprocess.joblib")
    featurizer = None

    try:
        # Open the model file and load the featurizer using joblib
        with open(ft_model_path, "rb") as f:
            featurizer = joblib.load(f)
            print("Featurizer model loaded", flush=True)
    except FileNotFoundError:
        print(f"Error: Featurizer model file not found at {ft_model_path}", flush=True)
    except Exception as e:
        print(f"Error loading featurizer model: {e}", flush=True)

    # Return the loaded featurizer model, or None if there was an error
    return featurizer

def transform_fn(request_body, request_content_type):
    """
    Transform the request body into a usable numpy array for the model.

    This function takes the request body and content type as input, and
    returns a transformed numpy array that can be used as input for the
    prediction model.

    Parameters:
        request_body (str): The request body containing the input data.
        request_content_type (str): The content type of the request body.

    Returns:
        data (np.ndarray): Transformed input data as a numpy array.
    """
    # Define the column names for the input data
    feature_columns_names = [
        "sex",
        "length",
        "diameter",
        "height",
        "whole_weight",
        "shucked_weight",
        "viscera_weight",
        "shell_weight",
    ]
    label_column = "rings"

    # Check if the request content type is supported (text/csv)
    if request_content_type == "text/csv":
        # Load the featurizer model
        featurizer = load_model()

        # Check if the featurizer is a ColumnTransformer
        if isinstance(
            featurizer, sklearn.compose._column_transformer.ColumnTransformer
        ):
            print(f"Featurizer model loaded", flush=True)

        # Read the input data from the request body as a CSV file
        df = pd.read_csv(StringIO(request_body), header=None)

        # Assign column names based on the number of columns in the input data
        if len(df.columns) == len(feature_columns_names) + 1:
            # This is a labelled example, includes the ring label
            df.columns = feature_columns_names + [label_column]
        elif len(df.columns) == len(feature_columns_names):
            # This is an unlabelled example.
            df.columns = feature_columns_names

        # Transform the input data using the featurizer
        data = featurizer.transform(df)

        # Return the transformed data as a numpy array
        return data
    else:
        # Raise an error if the content type is unsupported
        raise ValueError("Unsupported content type: {}".format(request_content_type))


@app.route("/ping", methods=["GET"])
def ping():
    # Check if the model can be loaded, set the status accordingly
    featurizer = load_model()
    status = 200 if featurizer is not None else 500

    # Return the response with the determined status code
    return flask.Response(response="n", status=status, mimetype="application/json")


@app.route("/invocations", methods=["POST"])
def invocations():
    # Convert from JSON to dict
    print(f"Featurizer: received content type: {flask.request.content_type}")
    if flask.request.content_type == "text/csv":
        # Decode input data and transform
        input = flask.request.data.decode("utf-8")
        transformed_data = transform_fn(input, flask.request.content_type)

        # Format transformed_data into a csv string
        csv_buffer = io.StringIO()
        csv_writer = csv.writer(csv_buffer)

        for row in transformed_data:
            csv_writer.writerow(row)

        csv_buffer.seek(0)

        # Return the transformed data as a CSV string in the response
        return flask.Response(response=csv_buffer, status=200, mimetype="text/csv")
    else:
        print(f"Received: {flask.request.content_type}", flush=True)
        return flask.Response(
            response="Transformer: This predictor only supports CSV data",
            status=415,
            mimetype="text/plain",
        )
```

Build Docker image with featurizer and model serving stack

Let’s now build a Dockerfile using a custom base image and install required dependencies.

For this, we use python:3.9-slim-buster as the base image. You can change this any other base image relevant to your use case.

We then copy the nginx configuration, gunicorn’s web server gateway file, and the inference script to the container. We also create a python script called serve that launches nginx and gunicorn processes in the background and sets the inference script (i.e., preprocessing.py Flask application) as the entry point for the container.

Here’s a snippet of the Dockerfile for hosting the featurizer model. For full implementation refer to Dockerfile under featurizer folder.

```docker
FROM python:3.9-slim-buster
…

# Copy requirements.txt to /opt/program folder
COPY requirements.txt /opt/program/requirements.txt

# Install packages listed in requirements.txt
RUN pip3 install --no-cache-dir -r /opt/program/requirements.txt

# Copy contents of code/ dir to /opt/program
COPY code/ /opt/program/

# Set working dir to /opt/program which has the serve and inference.py scripts
WORKDIR /opt/program

# Expose port 8080 for serving
EXPOSE 8080

ENTRYPOINT ["python"]

# serve is a python script under code/ directory that launches nginx and gunicorn processes
CMD [ "serve" ]
```

Test custom inference image with featurizer locally

Now, build and test the custom inference container with featurizer locally, using Amazon SageMaker local mode. Local mode is perfect for testing your processing, training, and inference scripts without launching any jobs on Amazon SageMaker. After confirming the results of your local tests, you can easily adapt the training and inference scripts for deployment on Amazon SageMaker with minimal changes.

To test the featurizer custom image locally, first build the image using the previously defined Dockerfile. Then, launch a container by mounting the directory containing the featurizer model (preprocess.joblib) to the /opt/ml/model directory inside the container. Additionally, map port 8080 from container to the host.

Once launched, you can send inference requests to http://localhost:8080/invocations.

To build and launch the container, open a terminal and run the following commands.

Note that you should replace the <IMAGE_NAME>, as shown in the following code, with the image name of your container.

The following command also assumes that the trained scikit-learn model (preprocess.joblib) is present under a directory called models.

```shell
docker build -t <IMAGE_NAME> .
```

```shell
docker run –rm -v $(pwd)/models:/opt/ml/model -p 8080:8080 <IMAGE_NAME>
```

After the container is up and running, we can test both the /ping and /invocations routes using curl commands.

Run the below commands from a terminal

```shell
# test /ping route on local endpoint
curl http://localhost:8080/ping

# send raw csv string to /invocations. Endpoint should return transformed data
curl --data-raw 'I,0.365,0.295,0.095,0.25,0.1075,0.0545,0.08,9.0' -H 'Content-Type: text/csv' -v http://localhost:8080/invocations
```

When raw (untransformed) data is sent to http://localhost:8080/invocations, the endpoint responds with transformed data.

You should see response something similar to the following:

```shell
* Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> POST /invocations HTTP/1.1
> Host: localhost: 8080
> User-Agent: curl/7.87.0
> Accept: */*
> Content -Type: text/csv
> Content -Length: 47
>
* Mark bundle as not supporting multiuse
> HTTP/1.1 200 OK
> Server: nginx/1.14.2
> Date: Sun, 09 Apr 2023 20:47:48 GMT
> Content -Type: text/csv; charset=utf-8
> Content -Length: 150
> Connection: keep -alive
-1.3317586042173168, -1.1425409076053987, -1.0579488602777858, -1.177706547272754, -1.130662184748842,
* Connection #0 to host localhost left intact
```

We now terminate the running container, and then tag and push the local custom image to a private Amazon Elastic Container Registry (Amazon ECR) repository.

See the following commands to login to Amazon ECR, which tags the local image with full Amazon ECR image path and then push the image to Amazon ECR. Ensure you replace region and account variables to match your environment.

```shell
# login to ecr with your credentials
aws ecr get-login-password - -region "${region}" |
docker login - -username AWS - -password-stdin ${account}".dkr.ecr."${region}".amazonaws.com

# tag and push the image to private Amazon ECR
docker tag ${image} ${fullname}
docker push $ {fullname}

```

Refer to create a repository and push an image to Amazon ECR AWS Command Line Interface (AWS CLI) commands for more information.

Optional step

Optionally, you could perform a live test by deploying the featurizer model to a real-time endpoint with the custom docker image in Amazon ECR. Refer to featurizer.ipynb notebook for full implementation of buiding, testing, and pushing the custom image to Amazon ECR.

Amazon SageMaker initializes the inference endpoint and copies the model artifacts to the /opt/ml/model directory inside the container. See How SageMaker Loads your Model artifacts.

Build custom XGBoost predictor container

For building the XGBoost inference container we follow similar steps as we did while building the image for featurizer container:

Download pre-trained XGBoost model from Amazon S3.
Create the inference.py script that loads the pretrained XGBoost model, converts the transformed input data received from featurizer, and converts to XGBoost.DMatrix format, runs predict on the booster, and returns predictions in json format.
Scripts and configuration files that form the model serving stack (i.e., nginx.conf, wsgi.py, and serve remain the same and needs no modification.
We use Ubuntu:18.04 as the base image for the Dockerfile. This isn’t a prerequisite. We use the ubuntu base image to demonstrate that containers can be built with any base image.
The steps for building the customer docker image, testing the image locally, and pushing the tested image to Amazon ECR remain the same as before.

For brevity, as the steps are similar shown previously; however, we only show the changed coding in the following.

First, the inference.py script. Here’s a snippet that show the implementation of /ping and /invocations. Refer to inference.py under the predictor folder for full implementation of this file.

```python
@app.route("/ping", methods=["GET"])
def ping():
    """
    Check the health of the model server by verifying if the model is loaded.

    Returns a 200 status code if the model is loaded successfully, or a 500
    status code if there is an error.

    Returns:
        flask.Response: A response object containing the status code and mimetype.
    """
    status = 200 if model is not None else 500
    return flask.Response(response="n", status=status, mimetype="application/json")

@app.route("/invocations", methods=["POST"])
def invocations():
    """
    Handle prediction requests by preprocessing the input data, making predictions,
    and returning the predictions as a JSON object.

    This function checks if the request content type is supported (text/csv; charset=utf-8),
    and if so, decodes the input data, preprocesses it, makes predictions, and returns
    the predictions as a JSON object. If the content type is not supported, a 415 status
    code is returned.

    Returns:
        flask.Response: A response object containing the predictions, status code, and mimetype.
    """
    print(f"Predictor: received content type: {flask.request.content_type}")
    if flask.request.content_type == "text/csv; charset=utf-8":
        input = flask.request.data.decode("utf-8")
        transformed_data = preprocess(input, flask.request.content_type)
        predictions = predict(transformed_data)

        # Return the predictions as a JSON object
        return json.dumps({"result": predictions})
    else:
        print(f"Received: {flask.request.content_type}", flush=True)
        return flask.Response(
            response=f"XGBPredictor: This predictor only supports CSV data; Received: {flask.request.content_type}",
            status=415,
            mimetype="text/plain",
        )

```

Here’s a snippet of the Dockerfile for hosting the predictor model. For full implementation refer to Dockerfile under predictor folder.

```docker
FROM ubuntu:18.04

…

# install required dependencies including flask, gunicorn, xgboost etc.,
RUN pip3 --no-cache-dir install  flask  gunicorn  gevent  numpy  pandas  xgboost

# Copy contents of code/ dir to /opt/program
COPY code /opt/program

# Set working dir to /opt/program which has the serve and inference.py scripts
WORKDIR /opt/program

# Expose port 8080 for serving
EXPOSE 8080

ENTRYPOINT ["python"]

# serve is a python script under code/ directory that launches nginx and gunicorn processes
CMD ["serve"]
```

We then continue to build, test, and push this custom predictor image to a private repository in Amazon ECR. Refer to predictor.ipynb notebook for full implementation of building, testing and pushing the custom image to Amazon ECR.

Deploy serial inference pipeline

After we have tested both the featurizer and predictor images and have pushed them to Amazon ECR, we now upload our model artifacts to an Amazon S3 bucket.

Then, we create two model objects: one for the featurizer (i.e., preprocess.joblib) and other for the predictor (i.e., xgboost-model) by specifying the custom image uri we built earlier.

Here’s a snippet that shows that. Refer to serial-inference-pipeline.ipynb for full implementation.

```python
suffix = f"{str(uuid4())[:5]}-{datetime.now().strftime('%d%b%Y')}"

# Featurizer Model (SKLearn Model)
image_name = "<FEATURIZER_IMAGE_NAME>"
sklearn_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{image_name}:latest"

featurizer_model_name = f""<FEATURIZER_MODEL_NAME>-{suffix}"
print(f"Creating Featurizer model: {featurizer_model_name}")
sklearn_model = Model(
    image_uri=featurizer_ecr_repo_uri,
    name=featurizer_model_name,
    model_data=featurizer_model_data,
    role=role,
)

# Full name of the ECR repository
predictor_image_name = "<PREDICTOR_IMAGE_NAME>"
predictor_ecr_repo_uri
= f"{account_id}.dkr.ecr.{region}.amazonaws.com/{predictor_image_name}:latest"

# Predictor Model (XGBoost Model)
predictor_model_name = f"""<PREDICTOR_MODEL_NAME>-{suffix}"
print(f"Creating Predictor model: {predictor_model_name}")
xgboost_model = Model(
    image_uri=predictor_ecr_repo_uri,
    name=predictor_model_name,
    model_data=predictor_model_data,
    role=role,
)
```

Now, to deploy these containers in a serial fashion, we first create a PipelineModel object and pass the featurizer model and the predictor model to a python list object in the same order.

Then, we call the .deploy() method on the PipelineModel specifying the instance type and instance count.

```python
from sagemaker.pipeline import PipelineModel

pipeline_model_name = f"Abalone-pipeline-{suffix}"

pipeline_model = PipelineModel(
    name=pipeline_model_name,
    role=role,
    models=[sklearn_model, xgboost_model],
    sagemaker_session=sm_session,
)

print(f"Deploying pipeline model {pipeline_model_name}...")
predictor = pipeline_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
)
```

At this stage, Amazon SageMaker deploys the serial inference pipeline to a real-time endpoint. We wait for the endpoint to be InService.

We can now test the endpoint by sending some inference requests to this live endpoint.

Refer to serial-inference-pipeline.ipynb for full implementation.

Clean up

After you are done testing, please follow the instructions in the cleanup section of the notebook to delete the resources provisioned in this post to avoid unnecessary charges. Refer to Amazon SageMaker Pricing for details on the cost of the inference instances.

```python
# Delete endpoint, model
try:
    print(f"Deleting model: {pipeline_model_name}")
    predictor.delete_model()
except Exception as e:
    print(f"Error deleting model: {pipeline_model_name}n{e}")
    pass

try:
    print(f"Deleting endpoint: {endpoint_name}")
    predictor.delete_endpoint()
except Exception as e:
    print(f"Error deleting EP: {endpoint_name}n{e}")
    pass

```

Conclusion

In this post, I showed how we can build and deploy a serial ML inference application using custom inference containers to real-time endpoints on Amazon SageMaker.

This solution demonstrates how customers can bring their own custom containers for hosting on Amazon SageMaker in a cost-efficient manner. With BYOC option, customers can quickly build and adapt their ML applications to be deployed on to Amazon SageMaker.

We encourage you to try this solution with a dataset relevant to your business Key Performance Indicators (KPIs). You can refer to the entire solution in this GitHub repository.

References

About the Author

Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas to scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.

Our 10 biggest AI moments so far

A round-up of our top 10 AI moments of the last 25 years.Read More

Ensuring that customers don’t miss out on trending products

Time series forecasting enables up-to-the-minute trend recognition, while novel two-step training process improves forecast accuracy.Read More

Google Research embarks on effort to map a mouse brain

Posted by Michał Januszewski, Research Scientist, Google Research

The human brain is perhaps the most computationally complex machine in existence, consisting of networks of billions of cells. Researchers currently don’t understand the full picture of how glitches in its network machinery contribute to mental illnesses and other diseases, such as dementia. However, the emerging connectomics field, which aims to precisely map the connections between every cell in the brain, could help solve that problem. While maps have only been created for simpler organisms, technological advances for mapping even larger brains can enable us to understand how the human brain works, and how to treat brain diseases.

Today, we’re excited to announce that the Connectomics team at Google Research and our collaborators are launching a $33 million project to expand the frontiers of connectomics over the next five years. Supported by the Brain Research Through Advancing Innovative Neurotechnologies (BRAIN) Initiative at the National Institutes of Health (NIH) and led by researchers at Harvard University, we’ll be working alongside a multidisciplinary team of experts from the Allen Institute, MIT, Cambridge University, Princeton University and Johns Hopkins University, with advisers from HHMI’s Janelia Research Campus. Our project goal is to tackle an immense challenge in neuroscience: mapping a tiny fraction (2-3%) of the mouse brain. We will specifically target the hippocampal region, which is responsible for encoding memories, attention and spatial navigation. This project is one of 11 funded by the NIH’s $150 million BRAIN Initiative Connectivity Across Scales (BRAIN CONNECTS) program. Google Research is contributing computational and analytical resources to this effort, and will not receive any funding from the NIH. Our project asks a critical question: Can we scale and speed up our technologies enough to map the whole connectome of a mouse brain?

The modern era of connectomics

This effort to map the connectome of a small part of the mouse brain builds on a decade of innovation in the field, including many advances initiated by the Connectomics team at Google Research. We hope to accomplish something similar to the early days of the Human Genome Project, when scientists worked for years to sequence a small portion of the human genome as they refined technologies that would enable them to complete the rest of the genome.

In 2021, we and collaborators at Harvard successfully mapped one cubic millimeter of the human brain, which we released as the H01 dataset, a resource for studying the human brain and scaling connectomics technologies. But mapping the entire human brain connectome would require gathering and analyzing as much as a zettabyte of data (one billion terabytes), which is beyond the current capabilities of existing technologies.

Analyzing a mouse connectome is the next best thing. It is small enough to be technically feasible and could potentially deliver insights relevant to our own minds; neuroscientists already use mice to study human brain function and dysfunction. By working together to map 10–15 cubic mm of the mouse brain, we hope to develop new approaches that will allow us to map the entire remainder of the mouse brain, and the human brain thereafter.

Neuroscientists have been working for decades to map increasingly larger and more complicated connectomes.

One of biology’s largest datasets

In this connectomics project, we will map the connectome of the hippocampal formation of the mouse brain, which converts short-term memories into long-term memories and helps the mouse navigate in space. The mouse hippocampal formation is the largest area of any brain we’ve attempted to understand in this way. Through mapping this region of the mouse brain, we will create one of the largest datasets in biology, combining about 25,000 terabytes, or 25 petabytes of brain data. For reference, there are about 250 billion stars in our Milky Way Galaxy. If each of those stars was a single byte, it would take 100,000 Milky Way Galaxies to match the 25 petabytes of data that the project will collect when mapping a small region of the mouse brain.

To illustrate the hippocampal project’s scale, we calculated the number of Pixel phones (shown as stacks of Pixels below) needed to store the image data from the completed connectome projects that mapped the roundworm and fruit fly brains, as well as for the mouse hippocampal region and entire mouse brain projects, which are just getting started.

Then, we compared the heights of each Pixel stack to familiar objects and landmarks. It would take a stack of 100 Pixels, as tall as a four-year-old girl, to store the image data for the fruit fly brain, the largest completed project thus far. In contrast, the mouse hippocampal connectome effort will require storage equivalent to more than 48,800 Pixels, reaching as high as the Empire State Building. The animation below shows how the mouse hippocampal project will surpass the scale of previous connectome projects.

We are partnering with several collaborators to build a connectome (a map of the connections between brain cells) for the hippocampal region of a mouse brain. This project will create the largest connectomic dataset ever, surpassing the scale of previous projects that mapped the smaller roundworm and fruit fly brains. We hope this effort will lead to the development of new approaches that will allow us to later map an entire mouse brain. This animation shows how the field of connectomics is scaling up by calculating the number of Pixel phones needed to store the data from various projects. It would take just two Pixels, the height of an olive, to store the roundworm connectome data, while it would take a stack of Pixels the size of Mount Everest to store the data from an entire mouse connectome.

Understanding the connectome of the mouse hippocampal formation could help illuminate the way our own brains work. For instance, we may find common features between this circuitry in the mouse brain and human brains that explain how we know where we are, how our brains associate memories with specific locations, and what goes wrong in people who can’t properly form new spatial memories.

Opening the petabyte pipeline

Over the last decade, our team has worked to develop tools for managing massive connectomic datasets, and extracting scientific value from them. But a mouse brain has 1,000 times more neurons than the brain of the Drosophila fruit fly, an organism for which we helped build a connectome for a large part of the brain. Starting the mouse brain connectome will challenge us to improve existing technologies to enable us to map more data faster than ever before.

We’ll continue to refine our flood-filling networks, which use deep learning to trace, or “segment”, each neuron’s path through three-dimensional brain volumes made from electron microscope data. We’ll also extend the capabilities of our self-supervised learning technology, SegCLR, which allows us to automatically extract key insights from segmented volumes, such as identifying cell type (e.g., pyramidal neuron, basket neuron, etc.) and parts of each neuron (e.g., axon, dendrite, etc.).

A flood filling network traces a neuron through three-dimensional brain space.

We will also continue to enhance the scalability and performance of our core connectomics infrastructure, such as TensorStore for storage and Neuroglancer for visualization, in order to enable all of our computational pipelines and human analysis workflows to operate at these new scales of data. We’re eager to get to work to discover what peering into a mouse’s mind might tell us about our own.

Acknowledgements

The mouse connectomics project described in this blog post will be supported in part by the NIH BRAIN Initiative under award number 1UM1NS132250. Google Research is contributing computational and analytical resources to the mouse connectome project, and will not receive funding from the NIH. Many people were involved in the development of the technologies that make this project possible. We thank our long-term academic collaborators in the Lichtman Lab (Harvard University), HHMI Janelia, and the Denk Lab (Max Planck Institute for Biological Intelligence), and acknowledge core contributions from the Connectomics Team at Google. We also thank John Guilyard for creating the illustrative animation in this post, and Elise Kleeman, and Erika Check Hayden for their support. Thanks to Lizzie Dorfman, Michael Brenner, Jay Yagnik and Jeff Dean for their support, coordination and leadership.

AI Power Players: GeForce and NVIDIA RTX GPUs Supercharge Creativity, Gaming, Development, Productivity and More

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep-diving on new GeForce RTX 40 Series GPU features, technologies and resources and how they dramatically accelerate content creation.

Improving gaming, creating and everyday productivity, NVIDIA RTX graphics cards feature specialized Tensor Cores that deliver cutting-edge performance and transformative capabilities for AI.

An arsenal of 100 million AI-ready, RTX-powered Windows 11 PCs and workstations are primed to excel in AI workflows and help launch a new wave of training models and apps.

This week In the NVIDIA Studio, learn more about these AI power-players and read about Victor de Martrin, a self-taught 3D artist who shares the creative process behind his viral video Ascension, which features an assist from AI and uses a new visual effect process involving point clouds.

AI on the Prize

AI can elevate both work and play.

Content creators with NVIDIA Studio hardware can access over 100 AI-enabled and RTX-accelerated creative apps, including the Adobe Creative Cloud suite, Blender, Blackmagic Design’s DaVinci Resolve, OBS Studio and more.

This includes the NVIDIA Studio suite of AI tools and exclusive next-generation technology such as NVIDIA DLSS 3.5, featuring Ray Reconstruction and OptiX. AI image generation on a GeForce RTX 4090-powered PC can be completed at lightning speed — the same task would take 8x longer with an Apple M2 Ultra.

3D modelers can benefit immensely from AI, with dramatic improvements in real-time viewport rendering and upscaling with DLSS in NVIDIA Omniverse, Adobe Substance, Adobe Painter, Blender, D5 Render, Unreal Engine and more.

D5 Render full ray-tracing preview with Ray Reconstruction.

Gamers are primed for AI boosts with DLSS in over 300 games, including Alan Wake 2, coming soon, and Cyberpunk 2077: Phantom Liberty, available for download today.

Modders can breathe new life into classic games with NVIDIA RTX Remix’s revolutionary AI upscaling and texture enhancements. Game studios deploying the NVIDIA Avatar Cloud Engine (ACE) can build intelligent game characters, interactive avatars and digital humans in apps at scale — saving time and money.

Everyday tasks also get a boost. AI can draft emails, summarize content and upscale video to stunning 4K on YouTube and Netflix with RTX Video Super Resolution. Frequent video chat users can harness the power of NVIDIA Broadcast for AI-enhanced voice and video features. And livestreamers can take advantage of AI-powered features like virtual backgrounds, noise removal and eye contact, which moves the eyes of the speaker to simulate eye contact with the camera.

NVIDIA hardware powers AI natively and in the cloud — including for the world’s leading cloud AIs like ChatGPT, Midjourney, Microsoft 365 Copilot, Adobe Firefly and more.

Get further with AI — faster on RTX — and learn more to get started.

AI on Windows

NVIDIA AI, the world’s leading AI development platform, is supported by 100 million AI-ready RTX-powered Windows 11 PCs and workstations.

State-of-the-art technologies behind the Windows platform and NVIDIA’s AI hardware and software enable GPU-accelerated workflows for training and deploying AI models, exclusive tools, containers, software development kits and new open-source models optimized for RTX.

100 million RTX-powered Windows 11 PCs and workstations are already AI-ready.

With these integrations, Windows and NVIDIA are making it easier for developers to create the next generation of AI-powered Windows apps.

AI on Viral Content

3D content creator Martrin saw content creation as a means to gain financial, creative and geographical freedom — as well as a sense of gratification that he wasn’t getting in his day job in advertising.

“To gain recognition as an artist nowadays, unless you have connections, you’ve got to grind it out on social media,” admitted Martrin. As he began to showcase artwork to wider audiences, Martrin unexpectedly enjoyed the process of creating art aimed at garnering social media attention.

Best known for his viral video series Satisfying Marble Music — in which he replicates melodies by simulating a marble bouncing on xylophone notes — Martrin also takes a keen interest in the latest trends and advancements in the world of 3D to consistently meet his clients’ evolving demands.

In his recent project Ascension, he experimented with a point-cloud modeling technique for the first time, aiming to create a transition between reality and 3D.

Martrin used several 3D apps to refine video footage and animate himself in 3D. The initial stages involved modeling and animating a portrait of himself, which he used Daz Studio and Blender to accomplish. His PC, equipped with two NVIDIA GeForce RTX 4090 GPUs, easily handled this task.

He then used the GPU-accelerated Marvelous Designer cloth simulation engine to craft and animate his clothing.

Realistic clothes generated in Marvelous Designer.

Martrin stressed the importance of GPU acceleration in his creative workflow. “When did I use it? The countless times I checked my live viewer and during the creation of every render,” said Martrin.

From there, Martrin tested point-cloud modeling techniques. He first 3D-scanned his room. Then, instead of using the room as conventional 3D mesh, he deployed it as a set of data points in space to create a unique visual style.

He then used Cinema 4D and OctaneRender with RTX-accelerated ray tracing to render the animation.

Martrin completed the project by applying a GPU-accelerated composition and special effects in Adobe After Effects, achieving faster rendering with NVIDIA CUDA technology.

“No matter how tough it is at the start, keep grinding and challenging — odds are, success will eventually follow,” said Martrin.

Check out Martrin’s viral content on TikTok.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction

3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images. Despite numerous task-specific methods, developing a comprehensive model remains challenging. In this paper, we present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous studies have used two-stage approaches that rely on pretrained NeRFs as real data to train diffusion models. In contrast, we propose a new single-stage training paradigm…Apple Machine Learning Research

Innovation for Inclusion: Hack.The.Bias with Amazon SageMaker

This post was co-authored with Daniele Chiappalupi, participant of the AWS student Hackathon team at ETH Zürich.

Everyone can easily get started with machine learning (ML) using Amazon SageMaker JumpStart. In this post, we show you how a university Hackathon team used SageMaker JumpStart to quickly build an application that helps users identify and remove biases.

“Amazon SageMaker was instrumental in our project. It made it easy to deploy and manage a pre-trained instance of Flan, offering us a solid foundation for our application. Its auto scaling feature proved crucial during high-traffic periods, ensuring that our app remained responsive and users received a steady and fast bias analysis. Further, by allowing us to offload the heavy task of querying the Flan model to a managed service, we were able to keep our application lightweight and swift, enhancing user experience across various devices. SageMaker’s features empowered us to maximize our time at the hackathon, allowing us to focus on optimizing our prompts and app rather than managing the model’s performance and infrastructure.”

– Daniele Chiappalupi, participant of the AWS student Hackathon team at ETH Zürich.

Solution overview

The theme of the Hackathon is to contribute to the UN sustainable goals with AI technology. As shown in the following figure, the application built at the Hackathon contributes to three of the Sustainable Development Goals (quality education, targeting gender-based discrimination, and reduced inequalities) by helping users identify and remove biases from their text in order to promote fair and inclusive language.

As shown in the following screenshot, after you provide the text, the application generates a new version that is free from racial, ethnical, and gender biases. Additionally, it highlights the specific parts of your input text related to each category of bias.

In the architecture shown in the following diagram, users input text in the React-based web app, which triggers Amazon API Gateway, which in turn invokes an AWS Lambda function depending on the bias in the user text. The Lambda function calls the Flan model endpoint in SageMaker JumpStart, which returns the unbiased text result via the same route back to the front-end application.

Application development process

The process of developing this application was iterative and centered on two main areas: user interface and ML model integration.

We chose React for the front-end development due to its flexibility, scalability, and powerful tools for creating interactive user interfaces. Given the nature of our application—processing user input and presenting refined results—React’s component-based architecture proved ideal. With React, we could efficiently build a single-page application that allowed users to submit text and see de-biased results without the need for constant page refreshes.

The text entered by the user needed to be processed by a powerful language model to scrutinize for biases. We chose Flan for its robustness, efficiency, and scalability properties. To utilize Flan, we used SageMaker JumpStart, as shown in the following screenshot. Amazon SageMaker made it easy to deploy and manage a pre-trained instance of Flan, allowing us to focus on optimizing our prompts and queries rather than managing the model’s performance and infrastructure.

Connecting the Flan model to our front-end application required a robust and secure integration, which was achieved using Lambda and API Gateway. With Lambda, we created a serverless function that communicates directly with our SageMaker model. We then used API Gateway to create a secure, scalable, and readily accessible endpoint for our React app to invoke the Lambda function. When a user submitted text, the app triggered a series of API calls to the gateway—first to identify if any bias was present, then, if necessary, additional queries to identify, locate, and neutralize the bias. All these requests were routed through the Lambda function and then to our SageMaker model.

Our final task in the development process was the selection of prompts to query the language model. Here, the CrowS-Pairs dataset played an instrumental role because it provided us with real examples of biased text, which we utilized to fine-tune our requests. We selected the prompts by an iterative process, with the objective of maximizing accuracy in bias detection within this dataset.

Wrapping up the process, we observed a seamless operational flow in the finished application. The process begins with a user submitting text for analysis, which is then sent via a POST request to our secure API Gateway endpoint. This triggers the Lambda function, which communicates with the SageMaker endpoint. Consequently, the Flan model receives a series of queries. The first checks for the presence of any biases in the text. If biases are detected, additional queries are deployed to locate, identify, and neutralize these biased elements. The results are then returned through the same path—first to the Lambda function, then through the API Gateway, and ultimately back to the user. If any bias was present in the original text, the user receives a comprehensive analysis indicating the types of biases detected, whether racial, ethnic, or gender. Specific sections of the text where these biases were found are highlighted, giving users a clear view of the changes made. Alongside this analysis, a new, de-biased version of their text is presented, effectively transforming potentially biased input into a more inclusive narrative.

In the following sections, we detail the steps to implement this solution.

Set up the React environment

We began by setting up our development environment for React. For bootstrapping a new React application with minimal configuration, we used create-react-app:

npx create-react-app my-app

Build the user interface

Using React, we designed a simple interface for users to input text, with a submission button, a reset button, and overlaying displays for presenting the processed results when they’re available.

Initiate the Flan model on SageMaker

We used SageMaker to create a pre-trained instance of the Flan language model with an endpoint for real-time inference. The model can be used against any JSON-structured payload like the following:

payload = {
      text_inputs: "text_inputs",
      max_length: <max_length>,
      num_return_sequences: <num_return_sequences>,
      top_k: <top_k>,
      top_p: <top_p>,
      do_sample: <do_sample>,
      num_beams: <num_beams>,
      seed: <seed>,
    };

Create a Lambda function

We developed a Lambda function that interacted directly with our SageMaker endpoint. The function was designed to receive a request with the user’s text, forward it to the SageMaker endpoint, and return the refined results, as shown in the following code (ENDPOINT_NAME was set up as the SageMaker instance endpoint):

import os
import io
import boto3
import json
import csv

# grab environment variables
ENDPOINT_NAME = os.environ['ENDPOINT_NAME']
runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
    data = json.loads(json.dumps(event))
    payload = json.dumps(data['data']).encode('utf-8')

    query_response = runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType='application/json', 
        Body=payload)

    response_dict = json.loads(query_response['Body'].read())

    return response_dict['generated_texts']

Set up API Gateway

We configured a new REST API in API Gateway and linked it to our Lambda function. This connection allowed our React application to make HTTP requests to the API Gateway, which subsequently triggered the Lambda function.

Integrate the React app with the API

We updated the React application to make a POST request to the API Gateway when the submit button was clicked, with the body of the request being the user’s text. The JavaScript code we used to perform the API call is as follows (REACT_APP_AWS_ENDPOINT corresponds to the API Gateway endpoint bound to the Lambda call):

const makeAWSApiCall = (
    textInputs,
    maxLength,
    numReturnSequences,
    topK,
    topP,
    doSample,
    numBeams
  ) => {
    const axiosRequestUrl =
      `${process.env.REACT_APP_AWS_ENDPOINT}`;
    const requestData = {
      text_inputs: textInputs,
      max_length: maxLength,
      num_return_sequences: numReturnSequences,
      top_k: topK,
      top_p: topP,
      do_sample: doSample,
      num_beams: numBeams,
      seed: 8,
    };

    return axios.post(axiosRequestUrl, { data: requestData });
  };

Optimize prompt selection

To improve the accuracy of bias detection, we tested different prompts against the CrowS-Pairs dataset. Through this iterative process, we chose the prompts that gave us the highest accuracy.

Deploy and test the React app on Vercel

After building the application, we deployed it on Vercel to make it publicly accessible. We conducted extensive tests to ensure the application functioned as expected, from the user interface to the responses from the language model.

These steps laid the groundwork for creating our application for analyzing and de-biasing text. Despite the inherent complexity of the process, the use of tools like SageMaker, Lambda, and API Gateway streamlined the development, allowing us to focus on the core goal of the project—identifying and eliminating biases in text.

Conclusion

SageMaker JumpStart offers a convenient way to explore the features and capabilities of SageMaker. It provides curated one-step solutions, example notebooks, and deployable pre-trained models. These resources allow you to quickly learn and understand SageMaker. Additionally, you have the option to fine-tune the models and deploy them according to your specific needs. Access to JumpStart is available through Amazon SageMaker Studio or programmatically using the SageMaker APIs.

In this post, you learned how a student Hackathon team developed a solution in a short time using SageMaker JumpStart, which shows the potential of AWS and SageMaker JumpStart in enabling rapid development and deployment of sophisticated AI solutions, even by small teams or individuals.

To learn more about using SageMaker JumpStart, refer to Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart and Zero-shot prompting for the Flan-T5 foundation model in Amazon SageMaker JumpStart.

ETH Analytics Club hosted ‘ETH Datathon,’ an AI/ML hackathon that draws more than 150 participants from ETH Zurich, University of Zurich, and EPFL. The event features workshops led by industry leaders, a 24-hour coding challenge, and valuable networking opportunities with fellow students and industry professionals. Great thanks to the ETH Hackathon team: Daniele Chiappalupi, Athina Nisioti, and Francesco Ignazio Re, as well as the rest of AWS organizing team: Alice Morano, Demir Catovic, Iana Peix, Jan Oliver Seidenfuss, Lars Nettemann, and Markus Winterholer.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the authors

Jun Zhang is a Solutions Architect based in Zurich. He helps Swiss customers architect cloud-based solutions to achieve their business potential. He has a passion for sustainability and strives to solve current sustainability challenges with technology. He is also a huge tennis fan and enjoys playing board games a lot.

Mohan Gowda leads Machine Learning team at AWS Switzerland. He works primarily with Automotive customers to develop innovative AI/ML solutions and platforms for next generation vehicles. Before working with AWS, Mohan worked with a Global Management Consulting firm with a focus on Strategy & Analytics. His passion lies in connected vehicles and autonomous driving.

Matthias Egli is the Head of Education in Switzerland. He is an enthusiastic Team Lead with a broad experience in business development, sales, and marketing.

Kemeng Zhang is an ML Engineer based in Zurich. She helps global customers design, develop, and scale ML-based applications to empower their digital capabilities to increase business revenue and reduce cost. She is also very passionate about creating human-centric applications by leveraging knowledge from behavioral science. She likes playing water sports and walking dogs.

Daniele Chiappalupi is a recent graduate from ETH Zürich. He enjoys every aspect of software engineering, from design to implementation, and from deployment to maintenance. He has a deep passion for AI and eagerly anticipates exploring, utilizing, and contributing to the latest advancements in the field. In his free time, he loves going snowboarding during colder months and playing pick-up basketball when the weather warms up.

Improve throughput performance of Llama 2 models using Amazon SageMaker

We’re at an exciting inflection point in the widespread adoption of machine learning (ML), and we believe most customer experiences and applications will be reinvented with generative AI. Generative AI can create new content and ideas, including conversations, stories, images, videos, and music. Like most AI, generative AI is powered by ML models—very large models that are trained on vast amounts of data and commonly referred to as foundation models (FMs). FMs are based on transformers. Transformers are slow and memory-hungry on generating long text sequences due to the sheer size of the models. Large language models (LLMs) used to generate text sequences need immense amounts of computing power and have difficulty accessing the available high bandwidth memory (HBM) and compute capacity. This is because a large portion of the available memory bandwidth is consumed by loading the model’s parameters and by the auto-regressive decoding process.As a result, even with massive amounts of compute power, LLMs are limited by memory I/O and computation limits, preventing them from taking full advantage of the available hardware resources.

Overall, generative inference of LLMs has three main challenges (according to Pope et al. 2022):

A large memory footprint due to massive model parameters and transient state during decoding. The parameters often exceed the memory of a single accelerator chip. Attention key-value caches also require substantial memory.
Low parallelizability increases latency, especially with the large memory footprint, requiring substantial data transfers to load parameters and caches into compute cores each step. This results in high total memory bandwidth needs to meet latency targets.
Quadratic scaling of attention mechanism compute relative to sequence length compounds the latency and computational challenges.

Batching is one of the techniques to address these challenges. Batching refers to the process of sending multiple input sequences together to a LLM and thereby optimizing the performance of the LLM inference. This approach helps improve throughput because model parameters don’t need to be loaded for every input sequence. The parameters can be loaded one time and used to process multiple input sequences. Batching efficiently utilizes the accelerator’s HBM bandwidth, resulting in higher compute utilization, improved throughput, and cost-effective inference.

This post examines techniques to maximize the throughput using batching techniques for parallelized generative inference in LLMs. We discuss different batching methods to reduce memory footprint, increase parallelizability, and mitigate the quadratic scaling of attention to boost throughput. The goal is to fully use hardware like HBM and accelerators to overcome bottlenecks in memory, I/O, and computation. Then we highlight how Amazon SageMaker large model inference (LMI) deep learning containers (DLCs) can help with these techniques. Finally, we present a comparative analysis of throughput improvements with each batching strategy on SageMaker using LMI DLCs to improve throughput for models like Llama v2. You can find an accompanying example notebook in the SageMaker examples GitHub repository.

Inferencing for large language models (LLMs)

Autoregressive decoding is the process by which language models like GPT generate text output one token at a time. It involves recursively feeding generated tokens back into the model as part of the input sequence in order to predict subsequent tokens. The steps are as follows:

The model receives the previous tokens in the sequence as input. For the first step, this is the starting prompt provided by the user.
The model predicts a distribution over the vocabulary for the next token.
The token with the highest predicted probability is selected and appended to the output sequence. Steps 2 and 3 are part of the decoding As of this writing, the most prominent decoding methods are greedy search, beam search, contrastive search, and sampling.
This new token is added to the input sequence for the next decoding step.
The model iterates through these steps, generating one new token per step, until an end-of-sequence marker is produced or the desired output length is reached.

Model serving for LLMs

Model serving for LLMs refers to the process of receiving input requests for text generation, making inferences, and returning the results to the requesting applications. The following are key concepts involved in model serving:

Clients generate multiple inference requests, with each request consisting of sequence of tokens or input prompts
Requests are received by the inference server (for example, DJLServing, TorchServe, Triton, or Hugging Face TGI)
The inference server batches the inference requests and schedules the batch to the execution engine that includes model partitioning libraries (such as Transformers-NeuronX, DeepSpeed, Accelerate, or FasterTransformer) for running the forward pass (predicting the output token sequence) on the generative language model
The execution engine generates response tokens and sends the response back to the inference server
The inference server replies to the clients with the generated results

There are challenges with request-level scheduling when the inference server interacts with the execution engine at the request level, such as each request using a Python process, which requires a separate copy of model, which is memory restrictive. For example, as shown in the following figure, you can only accommodate to load a single copy of a model of size 80 GB on a machine learning (ML) instance with 96 GB of total accelerator device memory. You will need to load an additional copy of the entire model if you want to serve additional requests concurrently. This is not memory and cost efficient.

Now that we understand challenges posed by request-level scheduling, let’s look at different batching techniques that can help optimize throughput.

Batching techniques

In this section, we explain different batching techniques and show how to implement them using a SageMaker LMI container.

There are two main types of batching for inference requests:

Client-side (static) – Typically, when a client sends a request to a server, the server will process each request sequentially by default, which is not optimal for throughput. To optimize the throughput, the client batches the inference requests in the single payload and the server implements the preprocessing logic to break down the batch into multiple requests and runs the inference for each request separately. In this option, the client needs to change the code for batching and the solution is tightly coupled with the batch size.
Server-side (dynamic) – Another technique for batching is to use the inference to help achieve the batching on server side. As independent inference requests arrive at the server, the inference server can dynamically group them into larger batches on the server side. The inference server can manage the batching to meet a specified latency target, maximizing throughput while staying within the desired latency range. The inference server handles this automatically, so no client-side code changes are needed. The server-side batching includes different techniques to optimize the throughput further for generative language models based on the auto-regressive decoding. These batching techniques include dynamic batching, continuous batching, and PagedAttention (vLLM) batching.

Dynamic batching

Dynamic batching refers to combining the input requests and sending them together as a batch for inference. Dynamic batching is a generic server-side batching technique that works for all tasks, including computer vision (CV), natural language processing (NLP), and more.

In an LMI container, you can configure the batching of requests based on the following settings in serving.properties:

batch_size – Refers to the size of the batch
max_batch_delay – Refers to the maximum delay for batch aggregation

If either of these thresholds are met (meeting the maximum batch size or completion of the waiting period), then a new batch is prepared and pushed to the model for inferencing. The following diagram shows the dynamic batching of requests with different input sequence lengths being processed together by the model.

You can implement dynamic batching on SageMaker by configuring the LMI container’s serving.properties as follows:

#Dynamic Batching
engine=Python
option.entryPoint=djl_python.huggingface
batch_size=64 #example
max_batch_delay=1000 #example
option.tensor_parallel_degree=2 #example

Although dynamic batching can provide up to a four-times increase in throughput compared to no batching, we observe that GPU utilization is not optimal in this case because the system can’t accept another batch until all requests have completed processing.

Continuous batching

Continuous batching is an optimization specific for text generation. It improves throughput and doesn’t sacrifice the time to first byte latency. Continuous batching (also known as iterative or rolling batching) addresses the challenge of idle GPU time and builds on top of the dynamic batching approach further by continuously pushing newer requests in the batch. The following diagram shows continuous batching of requests. When requests 2 and 3 finish processing, another set of requests is scheduled.

The following interactive diagram dives deeper into how continuous batching works.

(Courtesy: https://github.com/InternLM/lmdeploy)

You can use a powerful technique to make LLMs and text generation efficient: caching some of the attention matrices. This means that the first pass of a prompt is different from the subsequent forward passes. For the first pass, you have to compute the entire attention matrix, whereas the follow-ups only require you to compute the new token attention. The first pass is called prefill throughout this code base, whereas the follow-ups are called decode. Because prefill is much more expensive than decode, we don’t want to do it all the time, but a currently running query is probably doing decode. If we want to use continuous batching as explained previously, we need to run prefill at some point in order to create the attention matrix required to be able to join the decode group.

This technique may allow up to a 20-times increase in throughput compared to no batching by effectively utilizing the idle GPUs.

You can fine-tune the following parameters in serving.properties of the LMI container for using continuous batching:

engine – The runtime engine of the code. Values include Python, DeepSpeed, FasterTransformer, and MPI. Use MPI to enable continuous batching.
rolling_batch – Enables iteration-level batching using one of the supported strategies. Values include auto, scheduler, and lmi-dist. We use lmi-dist for turning on continuous batching for Llama 2.
max_rolling_batch_size – Limits the number of concurrent requests in the continuous batch. Defaults to 32.
max_rolling_batch_prefill_tokens – Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU out of memory. It’s only supported for when rolling_batch=lmi-dist. Our recommendation is to set the value based on the number of concurrent requests x the memory required to store input tokens and output tokens per request.

The following is sample code for serving.properties for configuring continuous batching:

#Continuous Batching
engine=MPI
option.entryPoint=djl_python.huggingface
option.rolling_batch=auto
option.max_rolling_batch_size=64 #example
option.paged_attention=false
option.max_rolling_batch_prefill_tokens=16080 #example
option.tensor_parallel_degree=2 #example

PagedAttention batching

In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as the KV cache or attention cache. As per the paper vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, the KV cache takes up to 1.7 GB for a single sequence in Llama 13B. It is also dynamic. Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. The paper found that existing systems waste 60–80% of memory due to fragmentation and over-reservation.

PagedAttention is a new optimization algorithm developed by UC Berkeley that improves the continuous batching process by allowing the attention cache (KV cache) to be non-contiguous by allocating memory in fixed-size pages or blocks. This is inspired by virtual memory and paging concepts used by operating systems.

As per the vLLM paper, the attention cache of each sequence of tokens is partitioned into blocks and mapped to physical blocks through a block table. During the computation of attention, a PagedAttention kernel can use the block table to efficiently fetch the blocks from physical memory. This results in a significant reduction of memory waste and allows for larger batch size, increased GPU utilization, and higher throughput. The following figure illustrates partitioning the attention cache into non-contiguous pages.

The following diagram shows an inference example with PagedAttention. The key steps are:

The inference request is received with an input prompt.
In the prefill phase, attention is computed and key-values are stored in non-contiguous physical memory and mapped to logical key-value blocks. This mapping is stored in a block table.
The input prompt is run through the model (a forward pass) to generate the first response token. During the response token generation, the attention cache from the prefill phase is used.
During subsequent token generation, if the current physical block is full, additional memory is allocated in a non-contiguous fashion, allowing just-in-time allocation.

PagedAttention helps in near-optimal memory usage and reduction of memory waste. This allows for more requests to be batched together, resulting in a significant increase in throughput of inferencing.

The following code is a sample serving.properties for configuring PagedAttention batching in an LMI container on SageMaker:

#Paged Attention Batching
engine=MPI
option.entryPoint=djl_python.huggingface
option.rolling_batch=auto
option.max_rolling_batch_size=64 #example
option.paged_attention=true
option.max_rolling_batch_prefill_tokens=16080 #example
option.tensor_parallel_degree=2 #example

When to use which batching technique

The following figure summarizes the server-side batching techniques along with the sample serving.properties in LMI on SageMaker.

The following table summarizes the different batching techniques and their use cases.

	PagedAttention Batching	Continuous Batching	Dynamic Batching	Client-side Batching	No Batch
How it works	Always merge new requests at the token level along with paged blocks and do batch inference.	Always merge new request at the token level and do batch inference.	Merge the new request at the request level; can delay for a few milliseconds to form a batch.	Client is responsible for batching multiple inference requests in the same payload before sending it to the inference server.	When a request arrives, run the inference immediately.
When it works the best	This is the recommended approach for the supported decoder-only models. It’s suitable for throughput-optimized workloads. It’s applicable to only text-generation models.	Concurrent requests coming at different times with the same decoding strategy. It’s suitable for throughput-optimized workloads. It’s applicable to only text-generation models.	Concurrent requests coming at different times with the same decoding strategy. It’s suitable for response time-sensitive workloads needing higher throughput. It’s applicable to CV, NLP, and other types of models.	It’s suitable for offline inference use cases that don’t have latency constraints for maximizing the throughput.	Infrequent inference requests or inference requests with different decoding strategies. It’s suitable for workloads with strict response time latency needs.

Throughput comparison of different batching techniques for a large generative model on SageMaker

We performed performance benchmarking on a Llama v2 7B model on SageMaker using an LMI container and the different batching techniques discussed in this post with concurrent incoming requests of 50 and a total number of requests of 5,000.

We used three different input prompts of variable lengths for the performance test. In continuous and PagedAttention batching, the output tokens lengths were set to 64, 128, and 256 for the three input prompts, respectively. For dynamic batching, we used a consistent output token length of 128 tokens. We deployed SageMaker endpoints for the test with an instance type of ml.g5.24xlarge. The following table contains the results of the performance benchmarking tests.

Model	Batching Strategy	Requests per Second on ml.g5.24xlarge
LLaMA2-7b	Dynamic Batching	3.24
LLaMA2-7b	Continuous Batching	6.92
LLaMA2-7b	PagedAttention Batching	7.41

We see an increase of approximately 2.3 times in throughput by using PagedAttention batching in comparison to dynamic batching for the Llama2-7B model on SageMaker using an LMI container.

Conclusion

In this post, we explained different batching techniques for LLMs inferencing and how it helps increase throughput. We showed how memory optimization techniques can increase the hardware efficiency by using continuous and PagedAttention batching and provide higher throughput values than dynamic batching. We saw an increase of approximately 2.3 times in throughput by using PagedAttention batching in comparison to dynamic batching for a Llama2-7B model on SageMaker using an LMI container. You can find the notebook used for testing the different batching techniques on GitHub.

About the authors

Gagan Singh is a Senior Technical Account Manager at AWS, where he partners with digital native startups to pave their path to heightened business success. With a niche in propelling Machine Learning initiatives, he leverages Amazon SageMaker, particularly emphasizing on Deep Learning and Generative AI solutions. In his free time, Gagan finds solace in trekking on the trails of the Himalayas and immersing himself in diverse music genres.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Venugopal Pai is a Solutions Architect at AWS. He lives in Bengaluru, India, and helps digital native customers scale and optimize their applications on AWS.

Cost and Energy-Efficiency Benefits of GPU Acceleration

OREX: Building Out From Japan and Beyond

Solution overview

Prerequisites

Solution Walkthrough

Build custom featurizer container

Build Docker image with featurizer and model serving stack

Test custom inference image with featurizer locally

Optional step

Build custom XGBoost predictor container

Deploy serial inference pipeline

Clean up

Conclusion

About the Author

The modern era of connectomics

One of biology’s largest datasets

Opening the petabyte pipeline

Acknowledgements

AI on the Prize

AI on Windows

AI on Viral Content

Solution overview

Application development process

Set up the React environment

Build the user interface

Initiate the Flan model on SageMaker

Create a Lambda function

Set up API Gateway

Integrate the React app with the API

Optimize prompt selection

Deploy and test the React app on Vercel

Conclusion

About the authors

Inferencing for large language models (LLMs)

Model serving for LLMs

Batching techniques

Dynamic batching

Continuous batching

PagedAttention batching

When to use which batching technique

Throughput comparison of different batching techniques for a large generative model on SageMaker

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.