Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Reliability managers and technicians in industrial environments such as manufacturing production lines, warehouses, and industrial plants are keen to improve equipment health and uptime to maximize product output and quality. Machine and process failures are often addressed by reactive activity after incidents happen or by costly preventive maintenance, where you run the risk of over-maintaining the equipment or missing issues that could happen between the periodic maintenance cycles. Predictive condition-based maintenance is a proactive strategy that is better than reactive or preventive ones. Indeed, this approach combines continuous monitoring, predictive analytics, and just-in-time action. This enables maintenance and reliability teams to service equipment only when necessary, based on the actual equipment condition.

There have been common challenges with condition-based monitoring to generate actionable insights for large industrial asset fleets. These challenges include but are not limited to: build and maintain a complex infrastructure of sensors collecting data from the field, obtain a reliable high-level summary of industrial asset fleets, efficiently manage failure alerts, identify possible root causes of anomalies, and effectively visualize the state of industrial assets at scale.

Amazon Monitron is an end-to-end condition monitoring solution that enables you to start monitoring equipment health with the aid of machine learning (ML) in minutes, so you can implement predictive maintenance and reduce unplanned downtime. It includes sensor devices to capture vibration and temperature data, a gateway device to securely transfer data to the AWS Cloud, the Amazon Monitron service that analyzes the data for anomalies with ML, and a companion mobile app to track potential failures in your machinery. Your field engineers and operators can directly use the app to diagnose and plan maintenance for industrial assets.

From the operational technology (OT) team standpoint, using the Amazon Monitron data also opens up new ways to improve how they operate large industrial asset fleets thanks to AI. OT teams can reinforce the predictive maintenance practice from their organization by building a consolidated view across multiple hierarchies (assets, sites, and plants). They can combine actual measurement and ML inference results with unacknowledged alarms, sensors or getaways connectivity status, or asset state transitions to build a high-level summary for the scope (asset, site, project) they are focused on.

With the recently launched Amazon Monitron Kinesis data export v2 feature, your OT team can stream incoming measurement data and inference results from Amazon Monitron via Amazon Kinesis to AWS Simple Storage Service (Amazon S3) to build an Internet of Things (IoT) data lake. By leveraging the latest data export schema, you can obtain sensors connectivity status, gateways connectivity status, measurement classification results, closure reason code and details of asset state transition events.

Use cases overview

The enriched data stream Amazon Monitron now exposes enables you to implement several key use cases such as automated work order creation, enriching an operational single pane of glass or automating failure reporting. Let’s dive into these use cases.

You can use the Amazon Monitron Kinesis data export v2 to create work orders in Enterprise Asset Management (EAM) systems such as Infor EAM, SAP Asset Management, or IBM Maximo. For example, in the video avoiding mechanical issues with predictive maintenance & Amazon Monitron, you can discover how our Amazon Fulfillment Centers are avoiding mechanical issues on conveyor belts with Amazon Monitron sensors integrated with third-party software such as the EAM used at Amazon as well as with the chat rooms technicians used. This shows how you can naturally integrate Amazon Monitron insights into your existing workflows. Stay tuned in the coming months to read the next installment of this series with an actual implementation of this integration works.

You can also use the data stream to ingest Amazon Monitron insights back into a shop floor system such as a Supervisory Control and Data Acquisition (SCADA) or a Historian. Shop floor operators are more efficient when all the insights about their assets and processes are provided in a single pane of glass. In this concept, Amazon Monitron doesn’t become yet another tool technicians have to monitor, but another data source with insights provided in the single view they are already used to. Later this year, we will also describe an architecture you can use to perform this task and send Amazon Monitron feedback to major third-party SCADA systems and Historians.

Last but not least, the new data stream from Amazon Monitron includes the asset state transitions and closure codes provided by users when acknowledging alarms (which trigger the transition to a new state). Thanks to this data, you can automatically build visualizations that provide real-time reporting of the failures and actions taken while operating their assets.

Your team can then build a broader data analytics dashboard to support your industrial fleet management practice by combining this asset state data with Amazon Monitron measurement data and other IoT data across large industrial asset fleets by using key AWS services, which we describe in this post. We explain how to build an IoT data lake, the workflow to produce and consume the data, as well as a summary dashboard to visualize Amazon Monitron sensors data and inference results. We use an Amazon Monitron dataset coming from about 780 sensors installed in an industrial warehouse, which has been running for more than 1 year. For the detailed Amazon Monitron installation guide, refer to Getting started with Amazon Monitron.

Solution overview

Amazon Monitron provides ML inference of asset health status after 21 days of the ML model training period for each asset. In this solution, the measurement data and ML inference from these sensors are exported to Amazon S3 via Amazon Kinesis Data Streams by using the latest Amazon Monitron data export feature. As soon as Amazon Monitron IoT data is available in Amazon S3, a database and table are created in Amazon Athena by using an AWS Glue crawler. You can query Amazon Monitron data via AWS Glue tables with Athena, and visualize the measurement data and ML inference with Amazon Managed Grafana. With Amazon Managed Grafana, you can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure. In this post, you connect Amazon Managed Grafana to Athena, and learn how to build a data analytics dashboard with Amazon Monitron data to help you plan industrial asset operations at scale.

The following screenshot is an example of what you can achieve at the end of this post. This dashboard is divided into three sections:

  • Plant View – Analytical information from all sensors across plants; for example, the overall counts of various states of sensors (Healthy, Warning, or Alarm), number of unacknowledged and acknowledged alarms, gateway connectivity, and average time for maintenance
  • Site View – Site-level statistics, such as asset status statistics at each site, total number of days that an alarm remains unacknowledged, top/bottom performing assets at each site, and more
  • Asset View – Summary information for the Amazon Monitron project at the asset level, such as the alarm type for an unacknowledged alarm (ISO or ML), the timeline for an alarm, and more

These panels are examples that can help strategic operational planning, but they are not exclusive. You can use a similar workflow to customize the dashboard according to your targeted KPI.



Architecture overview

The solution you will build in this post combines Amazon Monitron, Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon S3, AWS Glue, Athena, and Amazon Managed Grafana.

The following diagram illustrates the solution architecture. Amazon Monitron sensors measure and detect anomalies from equipment. Both measurement data and ML inference outputs are exported at a frequency of once per hour to a Kinesis data stream, and they are delivered to Amazon S3 via Kinesis Data Firehose with a 1-minute buffer. The exported Amazon Monitron data is in JSON format. An AWS Glue crawler analyzes the Amazon Monitron data in Amazon S3 at a chosen frequency of once per hour, builds a metadata schema, and creates tables in Athena. Finally, Amazon Managed Grafana uses Athena to query the Amazon S3 data, allowing dashboards to be built to visualize both measurement data and device health status.

To build this solution, you complete the following high-level steps:

  1. Enable a Kinesis Data Stream export from Amazon Monitron and create a data stream.
  2. Configure Kinesis Data Firehose to deliver data from the data stream to an S3 bucket.
  3. Build the AWS Glue crawler to create a table of Amazon S3 data in Athena.
  4. Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same Region.

Enable a Kinesis data stream export from Amazon Monitron and create a data stream

To configure your data stream export, complete the following steps:

  1. On the Amazon Monitron console, from your project’s main page, choose Start live data export.
  2. Under Select Amazon Kinesis data stream, choose Create a new data stream.
  3. Under Data stream configuration, enter your data stream name.
  4. For Data stream capacity, choose On-demand.
  5. Choose Create data stream.

Note that any live data export enabled after April 4th, 2023 will stream data following the Kinesis Data Streams v2 schema. If you have an existing data export that was enabled before this date, the schema will follow the v1 format.

You can now see live data export information on the Amazon Monitron console with your specified Kinesis data stream.

Configure Kinesis Data Firehose to deliver data to an S3 bucket

To configure your Firehose delivery stream, complete the following steps:

  1. On the Kinesis console, choose Delivery streams in the navigation pane.
  2. Choose Create delivery stream.
  3. For Source, select Amazon Kinesis Data Streams.
  4. For Destination, select Amazon S3.
  5. Under Source settings, for Kinesis data stream, enter the ARN of your Kinesis data stream.
  6. Under Delivery stream name, enter the name of your Kinesis data stream.
  7. Under Destination settings, choose an S3 bucket or enter a bucket URI. You can either use an existing S3 bucket to store Amazon Monitron data, or you can create a new S3 bucket.
  8. Enable dynamic partitioning using inline parsing for JSON:
    • Choose Enabled for Dynamic partitioning.
    • Choose Enabled for Inline parsing for JSON.
    • Under Dynamic partitioning keys, add the following partition keys:
Key Name JQ Expression
project .projectName| "project=(.)"
site .eventPayload.siteName| "site=(.)"
asset .eventPayload.assetName| "asset=(.)"
position .eventPayload.positionName| "position=(.)"
time .timestamp| sub(" [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}$"; "")| "time=(.)"
  1. Choose Apply dynamic partitioning keys and confirm the generated S3 bucket prefix is:
!{partitionKeyFromQuery:project}/!{partitionKeyFromQuery:site}/!{partitionKeyFromQuery:asset}/!{partitionKeyFromQuery:position}/!{partitionKeyFromQuery:time}/.
  1. Enter a prefix for S3 bucket error output prefix. Any JSON payload that doesn’t contain the keys described earlier will be delivered in this prefix. For instance, thegatewayConnectedand gatewayDisconnected events are not linked to a given asset or position. Therefore, they won’t contain the assetName and positionName fields. Specifying this optional prefix here allows you to monitor this location and process these events accordingly.
  2. Choose Create delivery stream.

You can inspect the Amazon Monitron data in the S3 bucket. Note that the Amazon Monitron data will export live data at a frequency of once per hour, so wait for 1 hour to inspect the data.

This Kinesis Data Firehose setup enables dynamic partitioning, and the S3 objects delivered will use the following key format:

/project={projectName}/site={siteDisplayName}/asset={assetDisplayName}/ position={sensorPositionDisplayName}/time={yyyy-mm-dd 00:00:00}/{filename}.

Build the AWS Glue crawler to create a table of Amazon S3 data in Athena

After the live data has been exported to Amazon S3, we use an AWS Glue crawler to generate the metadata tables. In this post, we use AWS Glue crawlers to automatically infer database and table schema from Amazon Monitron data exported in Amazon S3, and store the associated metadata in the AWS Glue Data Catalog. Athena then uses the table metadata from the Data Catalog to find, read, and process the data in Amazon S3. Complete the following steps to create your database and table schema:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. Enter a name for the crawler (for example,XXX_xxxx_monitron).
  4. Choose Next.
  5. For Is your data already mapped to Glue tables, choose Not yet.
  6. For Data Source, choose S3.
  7. For Location of S3 data, choose In this Account, and enter the path of your S3 bucket directory you set up in the previous section (s3://YourBucketName).
  8. For Repeat crawls of S3 data stores, select Crawl all sub-folders.
  9. Finally, choose Next.
  10. Select Create new IAM role and enter a name for the role.
  11. Choose Next.
  12. Select Add Database, and enter a name for the database. This creates the Athena database where your metadata tables are located after the crawler is complete.
  13. For Crawler Schedule, select a preferred time-based scheduler (for example, hourly) to refresh the Amazon Monitron data in the database, and choose Next.
  14. Review the crawler details and choose Create.
  15. On the Crawlers page of the AWS Glue console, select the crawler you created and choose Run crawler.

You may need to wait a few minutes, depending on the size of the data. When it’s complete, the crawler’s status shows as Ready. To see the metadata tables, navigate to your database on the Databases page and choose Tables in the navigation pane.

You can also view data by choosing Table data on the console.

You’re redirected to the Athena console to view the top 10 records of the Amazon Monitron data in Amazon S3.

Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana

In this section, we build a customized dashboard with Amazon Managed Grafana to visualize Amazon Monitron data in Amazon S3, so that OT team can get streamlined access to assets in alarm across their whole Amazon Monitron sensors fleet. This will enable the OT team to plan next step actions based on the possible root cause of the anomalies.

To create a Grafana workspace, complete the following steps:

  1. Ensure that your user role is admin or editor.
  2. On the Amazon Managed Grafana console, choose Create workspace.
  3. For Workspace name, enter a name for the workspace.
  4. Choose Next.
  5. For Authentication access, select AWS IAM Identity Center (successor to AWS Single Sign-On). You can use the same AWS IAM Identity Center user that you used to set up your Amazon Monitron project.
  6. Choose Next.
  7. For this first workspace, confirm that Service managed is selected for Permission type. This selection enables Amazon Managed Grafana to automatically provision the permissions you need for the AWS data sources that you use for this workspace.
  8. Choose Current account.
  9. Choose Next.
  10. Confirm the workspace details, and choose Create workspace. The workspace details page appears. Initially, the status is CREATING.
  11. Wait until the status is ACTIVE to proceed to the next step.

To configure your Athena data source, complete the following steps:

  1. On the Amazon Managed Grafana console, choose the workspace you want to work on.
  2. On the Data sources tab, select Amazon Athena, and choose Actions, Enable service-managed policy.
  3. Choose Configure in Grafana in the Amazon Athena row.
  4. Sign in to the Grafana workspace console using IAM Identity Center if necessary. The user should have the Athena access policy attached to the user or role to have access to the Athena data source. See AWS managed policy: AmazonGrafanaAthenaAccess for more info.
  5. On the Grafana workspace console, in the navigation pane, choose the lower AWS icon (there are two) and then choose Athena on the Data sources menu.
  6. Select the default Region that you want the Athena data source to query from, select the accounts that you want, then choose Add data source.
  7. Follow the steps to configure Athena details.

If your workgroup in Athena doesn’t have an output location configured already, you need to specify an S3 bucket and folder to use for query results. After setting up the data source, you can view or edit it in the Configuration pane.

In the following subsections, we demonstrate several panels in the Amazon Monitron dashboard built in Amazon Managed Grafana to gain operational insights. The Athena data source provides a standard SQL query editor that we’ll use to analyze the Amazon Monitron data to generate desired analytics.

First, if there are many sensors in the Amazon Monitron project and they are in different states (healthy, warning, alarm, and needs maintenance), the OT team wants to visually see the count of positions that sensors are in various states. You can obtain such information as a pie chart widget in Grafana via the following Athena query:

Select * FROM (Select latest_status, COUNT(assetdisplayname)OVER (PARTITION BY latest_status) AS asset_health_count FROM (SELECT timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1) GROUP BY latest_status, asset_health_count; 

The following screenshot shows a panel with the latest distribution of Amazon Monitron sensor status.

To format your SQL query for Amazon Monitron data, refer to Understanding the data export schema.

Next, your Operations Technology team may want to plan predictive maintenance based on assets that are in alarm status, and therefore they want to quickly know the total number of acknowledged alarms vs. unacknowledged alarms. You can show the summary information of alarm state as simple stats panels in Grafana:

Select COUNT(*) FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'Alarm';

The following panel shows acknowledged and unacknowledged alarms.

The OT team can also query the amount of time the sensors remain in alarm status, so that they can decide their maintenance priority:

Select c.assetdisplayname, b.sensorpositiondisplayname, b.alarm_date FROM (Select a.assetdisplayname, a.sensorpositiondisplayname, COUNT(*)/24+1 AS number_of_days_in_alarm_state FROM (Select * FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE (assetState.newState = 'ALARM' AND assetState.newState = assetState.previousState) ORDER BY timestamp DESC) a GROUP BY a.assetdisplayname, a.sensorpositiondisplayname) b INNER JOIN (Select * FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState AS latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'ALARM') c ON b.assetdisplayname = c.assetdisplayname;

The output of this analysis can be visualized by a bar chart in Grafana, and the alarm in alarm state can be easily visualized as shown in the following screenshot.

To analyze top/bottom asset performance based on the total amount of time the assets are in an alarm or need maintenance state, use the following query:

Select s.sitedisplayname, s.assetdisplayname, COUNT(s.timestamp)/24 AS trouble_time FROM (Select timestamp, sitedisplayname, assetdisplayname, sensorpositiondisplayname, assetState.newState FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE assetState.newState = 'ALARM' OR assetState.newState = 'NEEDS_MAINTENANCE') AS s GROUP BY s.assetdisplayname, s.sitedisplayname ORDER BY trouble_time, s.assetdisplayname ASC LIMIT 5;

The following bar gauge is used to visualize the preceding query output, with the top performing assets showing 0 days of alarm states, and the bottom performing assets showing accumulated alarming states over the past year.

To help the OT team understand the possible root cause of an anomaly, the alarm types can be displayed for these assets still in alarm state with the following query:

Select a.assetdisplayname, a.sensorpositiondisplayname, a.latest_status, CASE WHEN a.temperatureML != 'HEALTHY' THEN 'TEMP' WHEN a.vibrationISO != 'HEALTHY' THEN 'VIBRATION_ISO' ELSE 'VIBRATION_ML' END AS alarm_type  FROM (Select sitedisplayname, assetdisplayname, sensorpositiondisplayname, models.temperatureML.persistentClassificationOutput as temperatureML, models.vibrationISO.persistentClassificationOutput as vibrationISO, models.vibrationML.persistentClassificationOutput as vibrationML, assetState.newState as latest_status FROM (Select *, RANK() OVER (PARTITION BY assetdisplayname, sensorpositiondisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND assetState.newState = 'ALARM' ) a WHERE (a.temperatureML != 'HEALTHY' OR a. vibrationISO != 'HEALTHY' OR a. vibrationML != 'HEALTHY');

You can visualize this analysis as a table in Grafana. In this Amazon Monitron project, two alarms were triggered by ML models for vibration measurement.

The Amazon Managed Grafana dashboard is shown here for illustration purposes. You can adapt the dashboard design according to your own business needs.

Failure Reports

When a user acknowledges an alarm in the Amazon Monitron app, the associated assets transition to a new state. The user also has the opportunity to provide some details about this alarm:

  • Failure cause – This can be one of the following: ADMINISTRATION, DESIGN, FABRICATION, MAINTENANCE, OPERATION, OTHER, QUALITY, WEAR, or UNDEDETERMINED
  • Failure mode – This can be one of the following: NO_ISSUE, BLOCKAGE, CAVITATION, CORROSION, DEPOSIT, IMBALANCE, LUBRICATION, MISALIGNMENT, OTHER, RESONANCE, ROTATING_LOOSENESS, STRUCTURAL_LOOSENESS, TRANSMITTED_FAULT, or UNDETERMINED
  • Action taken – This can be ADJUST, CLEAN, LUBRICATE, MODIFY, OVERHAUL, REPLACE, NO_ACTION, or OTHER

The event payload associated to the asset state transition contains all this information, the previous state of the asset, and the new state of the asset. Stay tuned for an update of this post with more details on how you can use this information in an additional Grafana panel to build Pareto charts of the most common failures and actions taken across your assets.

Conclusion

Enterprise customers of Amazon Monitron are looking for a solution to build an IoT data lake with Amazon Monitron’s live data, so they can manage multiple Amazon Monitron projects and assets, and generate analytics reports across multiple Amazon Monitron projects. This post provide a detailed walkthrough of a solution to build this IoT data lake with the latest Amazon Monitron Kinesis data export v2 feature. This solution also showed how to use other AWS services, such as AWS Glue and Athena to query the data, generate analytics outputs, and visualize such outputs with Amazon Managed Grafana with frequent refresh.

As a next step, you can expand this solution by sending ML inference results to other EAM systems that you might use for work order management. This will allow your operation team to integrate Amazon Monitron with other enterprise applications, and improve their operation efficiency. You can also start building more in-depth insights into your failure modes and actions taken by processing the asset state transitions and the closure codes that are now part of the Kinesis data stream payload.


About the authors

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She has extensive experience in IoT architecture and Applied Data Science, and is part of both the Machine Learning and IoT Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome IoT machine learning (ML) solutions, at the Edge and in the Cloud. She enjoys leveraging latest IoT and big data technology to scale up her ML solution, reduce latency, and accelerate industry adoption.

Bishr Tabbaa is a solutions architect at Amazon Web Services. Bishr specializes in helping customers with machine learning, security, and observability applications. Outside of work, he enjoys playing tennis, cooking, and spending time with family.

Shalika Pargal is a Product Manager at Amazon Web Services. Shalika focuses on building AI products and services for Industrial customers. She brings significant experience at the intersection of Product, Industrial and Business Development. She recently shared Monitron’s success story at Reinvent 2022.

Garry Galinsky is a Principal Solutions Architect supporting Amazon on AWS. He has been involved with Monitron since its debut and has helped integrate and deploy the solution into Amazon’s worldwide fulfillment network. He recently shared Amazon’s Monitron success story at re:Invent 2022.

Michaël Hoarau is an AI/ML Specialist Solutions Architect at AWS who alternates between data scientist and machine learning architect, depending on the moment. He is passionate about bringing the AI/ML power to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. He published a book on time series analysis in 2022 and regularly writes about this topic on LinkedIn and Medium. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.

Read More

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Sparked by the release of large AI models like AlexaTM, GPT, OpenChatKit, BLOOM, GPT-J, GPT-NeoX, FLAN-T5, OPT, Stable Diffusion, and ControlNet, the popularity of generative AI has seen a recent boom. Businesses are beginning to evaluate new cutting-edge applications of the technology in text, image, audio, and video generation that have the potential to revolutionize the services they provide and the ways they interact with customers. However, as the size and complexity of the deep learning models that power generative AI continue to grow, deployment can be a challenging task. Advanced techniques such as model parallelism and quantization become necessary to achieve latency and throughput requirements. Without expertise in using these techniques, many customers struggle to get started with hosting large models for generative AI applications.

This post can help! We begin by discussing different types of model optimizations that can be used to boost performance before you deploy your model. Then, we highlight how Amazon SageMaker large model inference deep learning containers (LMI DLCs) can help with optimization and deployment. Finally, we include code examples using LMI DLCs and FasterTransformer model parallelism to deploy models like flan-t5-xxl and flan-ul2. You can find an accompanying example notebook in the SageMaker examples repository.

Large model deployment pipeline

Major steps in any model inference workflow include loading a model into memory and handling inference requests on this in-memory model through a model server. Large models complicate this process because loading a 350 GB model such as BLOOM-176B can take tens of minutes, which materially impacts endpoint startup time. Furthermore, because these models can’t fit within the memory of a single accelerator, the model must be organized and partitioned such that it can be spread across the memory of multiple accelerators; then, model servers must handle processes and communication across multiple accelerators. Beyond model loading, partitioning, and serving, compression techniques are increasingly necessary to achieve performance goals (such as subsecond latency) for customers working with large models. Quantization and compression can reduce model size and serving cost by reducing the precision of weights or reducing the number of parameters via pruning or distillation. Compilation can optimize the computation graph and fuse operators to reduce memory and compute requirements of a model. Achieving low latency for large language models (LLMs) requires improvements in all the steps in the inference workflow: compilation, model loading, compression (runtime quantization), partitioning (tensor or pipeline parallelism), and model serving. At a high level, partitioning (with kernel optimization) brings down inference latency up to 66% (for example, BLOOM-176B from 30 seconds to 10 seconds), compilation by 20%, and compression by 50% (fp32 to fp16). An example pipeline for large model hosting with runtime partitioning is illustrated in the following diagram.

Overview of large model inference optimization techniques

With the large model deployment pipeline in mind, we now explore the optimizations. Optimizations can be critical to achieve latency and throughput goals. However, you need to be thoughtful about which optimizations you use and to what degree, because the accuracy of your model can be affected.

The following diagram is a high-level overview of different inference optimization techniques. Optimization approaches can be at the hardware or software level. We focus only on software optimization techniques in this post.

Optimized kernels and compilation

Today, optimized kernels are the greatest source of performance improvement for LMI (for example, DeepSpeed’ kernels reduced BLOOM-176B latency by three times). Fused kernel operators are model specific, and different model parallel libraries have different approaches. DeepSpeed created an inject policy for each model family. DeepSpeed has handwritten PyTorch modules and CUDA kernels that could speed up the model partially. Meanwhile, FasterTransformer rewrites the model in pure C++ and CUDA to speed up model as a whole. PyTorch 2.0 offers an open portal (via torch.compile) to allow easy compilation into different platforms. To bring cost/performance-wise optimization on SageMaker for LLMs, we offer SageMaker LMI containers that provide the best open-source compilation stack offering on a model basis, like T5 with FasterTransformers and GPT-J with DeepSpeed.

Compilation or integration to optimized runtime

ML compilers, such as Amazon SageMaker Neo, apply techniques such as operator fusion, memory planning, graph optimizations, and automatic integration to optimized inference libraries. Because inference includes only a forward propagation, intermediate tensors between layers are discarded instead of stored for reuse in back-propagation. The graph optimization techniques improve the inference throughput and have a small impact on model memory footprints. Relative to other optimization techniques, compilation for inference provides a limited benefit for reducing a model’s memory requirements. Several runtime libraries for GPU are available today, such as FasterTransformer, TensorRT, and ONNX Runtime.

Model compression

Model compression is a collection of approaches that researchers and practitioners can use to reduce the size of their model, realize faster speed, and reduce hosting cost. Model compression techniques primarily include knowledge distillation, pruning, and quantization. Most compression technologies are challenging for LLMs due to requiring additional training cycles to improve the accuracy of compressed models.

Quantization

Quantization is the process of mapping values from a larger or continuous set of numbers to a smaller set of numbers (for example, INT8 {-128:127}, uINT8 {0:255}). Using a smaller set of numbers reduces memory use and complexity of computations, but the decreased precision can degrade the accuracy of the model. The level of quantization can be adjusted to fit size constraints and accuracy needs. For example, a model quantized to FP8 will be about half the size of a model in FP16 but at the expense of reduced accuracy.

Quantization has shown great and consistent success for inference tasks by reducing the size of the model up to 75%, offering 2–4 times throughput improvements and cost savings.

The success of quantization is because it’s broadly applicable across a range of models and use cases with approximately 1% accuracy/score loss, if a proper technique is used. It doesn’t require changing model architecture. Typically, it starts with an existing floating-point model and quantizes it to obtain a fixed-point quantized model. Quantizing from FP32 to INT8 reduces the model size by 75%, but the accuracy/score loss impact is often less than a point.

Distillation

With distillation, a larger teacher model transfers knowledge to a smaller student model. The model size can be reduced until the student model can fit on an edge device or smaller cloud-based hardware, but accuracy decreases as the model is reduced. There is no industry standard for distillation, and many techniques are experimental. Distillation requires more work by the customer in tuning and trial and error to shrink the model without affecting accuracy. For more information, refer to Knowledge distillation in deep learning and its applications.

Pruning

Pruning is a model compression technique that reduces the number of operations by removing parameters. To minimize the impact to model accuracy, parameters are first ranked by importance. Parameters that are less important are set to zero or connections to the neuron are removed. This decreases the number of operations with minimal impact to model accuracy. For example, when using a pre-trained model for a narrow use case, parts of the larger model that are less relevant to your application could be pruned away to reduce size without significantly degrading performance for your task.

Model partitioning

A model that can’t fit on a single accelerator’s memory must be split into multiple partitions. At a high level, there are two fundamental approaches to partitioning the model (model parallelism): tensor parallelism and pipeline parallelism.

Tensor parallelism is also called intra-layer model parallelism. In this approach, each one of the layers is partitioned across the workers (accelerators). On the positive side, we can handle models with very large layers, because the layers are split across workers. Therefore, we no longer need to fit at least a single layer on a worker, as was the case for pipeline parallelism. However, this leads to an all-to-all communication pattern between the workers after each one of the layers, so there’s a heavy burden on the GPU/accelerator interconnect.

Pipeline parallelism partitions the model into layers. Each worker may end up with having one or more layers. This approach uses point-to-point communication and therefore introduces lower communication overhead compared to tensor parallelism. However, this approach won’t be useful if a layer can’t fit into a single worker’s or accelerator’s memory. This approach is also prone to pipeline idleness and may reduce the scaling efficiency.

Open-source frameworks like DeepSpeed, Hugging Face Accelerate, and FasterTransformer allow per model-based optimization to shard the model. Especially for DeepSpeed, the partitioning algorithm is tightly coupled with fused kernel operators. SageMaker LMI containers come with pre-integrated model partitioning frameworks like FasterTransformer, DeepSpeed, HuggingFace, and Transformers-NeuronX,. Currently, DeepSpeed, FasterTransformer, and Hugging Face Accelerate shard the model at model loading time. Runtime model partitioning can take more than 10 minutes (OPT-66B) and consume extensive CPU, GPU, and accelerator memory. Ahead-of-time (AOT) partitioning can help reduce model loading times. With AOT, models are partitioned before deployment and partitions are kept ready for downstream optimization and subsequent ingestion by model parallel frameworks. When model parallel frameworks are fed already partitioned models, then runtime partition doesn’t happen. This improves model loading time and reduces CPU, GPU, and accelerator memory consumption. DeepSpeed and FasterTransformer have support for pre-partitioning and saving for models.

Prompt engineering

Prompt engineering refers to efforts to extract accurate, consistent, and fair outputs from large models, such text-to-image synthesizers or large language models. LLMs are trained on large-scale bodies of text, so they encode a great deal of factual information about the world. A prompt consists of text and optionally an image given to a pre-trained model for a prediction task. A prompt text may consist of additional components like context, task (instruction, question, and so on), image or text, and training samples. Prompt engineering also provides a way for LLMs to do few-shot generalization, in which a machine learning model trained on a set of generic tasks learns a new or related task from just a handful of examples. For more information, refer to EMNLP: Prompt engineering is the new feature engineering. Refer to the following GitHub repo for more information about getting the most out of your large models using prompt engineering on SageMaker.

Model downloading and loading

Large language models incur long download times (for example, 40 minutes to download BLOOM-176B). In 2022, SageMaker Hosting added the support for larger Amazon Elastic Block Store (Amazon EBS) volumes up to 500 GB, longer download timeout up to 60 minutes, and longer container startup time of 60 minutes. You can enable this configuration to deploy LLMs on SageMaker. SageMaker LMI containers includes model download optimization by using the s5cmd library to speed up the model download time and container startup times, and eventually speed up auto scaling on SageMaker.

Diving deep into SageMaker LMI containers

SageMaker maintains large model inference containers with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. Transformers-NeuronX is a model parallel library introduced by the AWS Neuron team for AWS Inferentia and AWS Trainium to support LLMs. It supports tensor parallelism across Neuron cores.

The LMI container uses DJLServing as the pre-built integrated model server; pre-built integrated model partitioning frameworks like DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX; support for PyTorch; and comes with pre-installed cuDNN, cuBLAS, NCCL CUDA Toolkit for GPUs, MKL for CPU, and the Neuron SDK and runtime for running models on AWS Inferentia and Trainium.

Pre-integrated model partitioning frameworks in SageMaker LMI containers

SageMaker LMI comes with pre-integrated model partitioning frameworks to suite your performance and model support requirements.

Most of the model parallel frameworks support both pipeline and tensor parallelism. Pipeline parallelism is simpler implementation compared to tensor parallelism. However, due to its sequential operating nature, it’s slower than tensor parallelism. Pipeline parallelism and tensor parallelism can be combined together.

Transformers-NeuronX is a model parallel library introduced by the Neuron team to support LLMs on AWS Inferentia and Trainium. It supports tensor parallelism across Neuron cores. The following table summarizes different model partitioning frameworks. This will help you select the right framework for deploying your models on SageMaker.

Hugging Face Accelerate DeepSpeed FasterTransformer TransformersNeuronX (Inf2/Trn1)
Model Parallel Pipeline Parallelism Pipeline and Tensor Parallelism Pipeline and Tensor Parallelism Tensor Parallelism
Load Hugging Face checkpoints
Runtime partition .
Ahead-of-time partition . .
Model partitioning on CPU memory . . .
Supported models All Hugging Face models All GPT family, Stable Diffusion, and T5 family GPT2/OPT/BLOOM/T5 GPT2/OPT/GPTJ/GPT-NeoX*
Streaming tokens .
Fast model loading .
Model loading speed Medium Fast Fast .
Performance on model types All other non-optimized models GPT family T5 and BLOOM All supported models
Hardware support CPU/GPU GPU GPU Inf2/Trn1
SM MME support .

Large model deployment pipeline on SageMaker

SageMaker LMI containers offer a low-code/no-code mechanism to set up your large model deployment pipeline with the following capabilities:

  • Faster model download time using s5cmd
  • Pre-built optimized model parallel frameworks including Transformers-NeuronX, DeepSpeed, Hugging Face Accelerate, and FasterTransformer
  • Pre-built foundation software stack including PyTorch, NCCL, and MPI
  • Low-code/no-code deployment of large models by configuring serving.properties
  • SageMaker-compatible containers

The following diagram gives an overview of a SageMaker LMI deployment pipeline you can use to deploy your models.

Deploy a FLAN-T5-XXL model on SageMaker using the newly released LMI container version

FasterTransformer is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner. FasterTransformer contains the implementation of the highly optimized version of the transformer block that contains the encoder and decoder parts. With this block, you can run the inference of both the full encoder-decoder architectures like T5, as well as encoder-only models such as BERT, or decoder-only models such as GPT. It’s written in C++/CUDA and relies on the highly optimized cuBLAS, cuBLASLt, and cuSPARSELt libraries. This allows you to build the fastest transformer inference pipeline on GPU.

The FasterTransformer model parallel library is now available in a SageMaker LMI container, adding support for popular models such as flan-t5-xxl and flan-ul2. FasterTransformer is an open-source library from NVIDIA that provides an accelerated engine for efficiently running transformer-based neural network inference. It has been designed to handle large models that require multiple GPUs or accelerators and nodes in a distributed manner. The library includes an optimized version of the transformer block, which comprises both the encoder and decoder parts, enabling you to run the inference of full encoder-decoder architectures like T5, as well as encoder-only models like BERT and decoder-only models like GPT.

Runtime architecture of hosting a model using an LMI container’s FasterTransformer engine on SageMaker

The FasterTransformer engine in an LMI container supports loading model weights from an Amazon Simple Storage Service (Amazon S3) path or Hugging Face Hub. After fetching the model, it converts the Hugging Face model checkpoint to FasterTransformer supported partitioned model artifacts based on input parameters like tensor parallel degree and loads the partitioned model artifacts across GPU devices. It has faster loading and uses multi-process loading on Python. It supports AOT compilation and uses CPU to partition the model. SageMaker LMI containers improve the performance in downloading the models from Amazon S3 using s5cmd, provide the FasterTransformer engine, which provides a layer of abstraction for developers that loads the model in Hugging Face checkpoint or PyTorch bin format, and uses the FasterTransformer library to convert it into FasterTransformer-compatible format. These steps happen during the container startup and load the model in the memory before the inference requests come in. The FasterTransformer engine provides high performance C++ and CUDA implementations for the models to run inference. This helps improve the container startup time and reduce the inference latency. The following diagram illustrates the runtime architecture of serving models using FasterTransformer on SageMaker. For more information about DJLServing’s runtime architecture, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Use SageMaker LMI container images

To use a SageMaker LMI container to host a FLAN-T5 model, we have no-code option or a bring-your-own-script option. We showcase the bring-your-own-script option in this post. The first step in the process is to use the right LMI container image. An example notebook is available in the GitHub repo.

Use the following code to use the SageMaker LMI container image after replacing the Region with the specific Region you’re running the notebook in:

inference_image_uri = image_uris.retrieve(
    framework="djl-fastertransformer", region=sess.boto_session.region_name, version="0.21.0"
)

Download the model weights

An LMI container allows us to download the model weights from the Hugging Face Hub at run time when spinning up the instance for deployment. However, that takes longer because it’s dependent on the network and on the provider. The faster option is to download the model weights into Amazon S3 and then use the LMI container to download them to the container from Amazon S3. This is also a preferred method when we need to scale up our instances. In this post, we showcase how to download the weights to Amazon S3 and then use them when configuring the container. See the following code:

model_name = "google/flan-t5-xxl"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]
# - Leverage the snapshot library to download the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"

model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)

Create the model configuration and inference script

First, we create a file called serving.properties that configure the container. This tells the DJL model server to use the FasterTransformer engine to load and shard the model weights. Secondly, we point to the S3 URI where the model weights have been installed. The LMI container will download the model artifacts from Amazon S3 using s5cmd. The file contains the following code:

engine = FasterTransformer
option.tensor_parallel_degree = 4
option.s3url = {{s3url}}

For the no-code option, the key changes are to specify the entry_point as the built-in handler. We specify the value as djl_python.fastertransformer. For more details, refer to the GitHub repo. You can use this code to modify for your own use case as needed. A complete example that illustrates the no-code option can be found in the following notebook. The serving.properties file will now look like the following code:

engine=FasterTransformer
option.entryPoint=djl_python.fastertransformer
option.s3url={{s3url}}
option.tensor_parallel_degree=4

Next, we create our model.py file, which defines the code needed to load and then serve the model. The only mandatory method is handle(inputs). We continue to use the functional programing paradigm to build the other helpful methods like load_model(), pipeline_generate(), and more. In our code, we read in the tensor_parallel_degree property value (the default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Secondly, we get the model weights downloaded under the /tmp location on the container and referenceable by the environment variable “model_dir”. To load the model, we use the FasterTransformer init method as shown in the following code. Note we load the full precision weights in FP32. You can also quantize the model at runtime by setting dtype = "fp16" in the following code and setting tensor_parallel_degree = 2 in serving.properties. However, note that the FP16 version of this model may not provide similar performance in terms of output quality as compared to FP32 version. In addition, refer to an existing issue related to impact on the model quality on FasterTransformer for the T5 model for certain NLP tasks.

import fastertransformer as ft
from djl_python import Input, Output
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    T5Tokenizer,
    T5ForConditionalGeneration,
)
import os
import logging
import math
import torch


def load_model(properties):
    model_name = "google/flan-t5-xxl"
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    pipeline_parallel_degree = 1
    model_location = properties["model_dir"]
    if "model_id" in properties:
        model_location = properties["model_id"]
    logging.info(f"Loading model in {model_location}")

    tokenizer = T5Tokenizer.from_pretrained(model_location)
    dtype = "fp32"
    model = ft.init_inference(
        model_location, tensor_parallel_degree, pipeline_parallel_degree, dtype
    )
    return model, tokenizer


model = None
tokenizer = None


def handle(inputs: Input):
    """
    inputs: Contains the configurations from serving.properties
    """
    global model, tokenizer

    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()

    input_sentences = data["inputs"]
    params = data["parameters"]

    outputs = model.pipeline_generate(input_sentences, **params)
    result = {"outputs": outputs}

    return Output().add_as_json(result)

Create a SageMaker endpoint for inference

In this section, we go through the steps to create a SageMaker model and endpoint for inference.

Create a SageMaker model

We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image provided by and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure tensor_parallel_degree to 4 in serving.properties, which means the model is partitioned along 4 GPUs. See the following code:

from sagemaker.utils import name_from_base
model_name = name_from_base(f"flan-xxl-fastertransformer")
print(model_name)
create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri, 
        "ModelDataUrl": s3_code_artifact
    },
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

Create a SageMaker endpoint for inference

You can use any instances with multiple GPUs for testing. In this demo, we use a g5.12xlarge instance. In the following code, note how we set ModelDataDownloadTimeoutInSeconds and ContainerStartupHealthCheckTimeoutInSeconds. We don’t set the VolumeSizeInGB parameters because this instance comes with SSD. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
{
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
        },
    ],)'

Lastly, we create a SageMaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name)

Starting the endpoint might take a few minutes. You can try a few more times if you run into the InsufficientInstanceCapacity error, or you can raise a request to AWS to increase the limit in your account.

Invoke the model

This is a generative model, so we pass in a text as a prompt and model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters.

# -- we set the prompt in the parameter name which matches what we try and extract in model.py
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({
        "batch_size": 1,
        "inputs" : "Amazon.com is an awesome site",
        "parameters" : {},
    }),
    ContentType="application/json",
)
response_model["Body"].read().decode("utf8")

Model parameters at inference time

The following code lists the set of default parameters that is used by the model. You can set these arguments to a specific value of your choice while invoking the endpoint.

default_args = dict(
            inputs_embeds=None,
            beam_width=1,
            max_seq_len=200,
            top_k=1,
            top_p=0.0,
            beam_search_diversity_rate=0.0,
            temperature=1.0,
            len_penalty=0.0,
            repetition_penalty=1.0,
            presence_penalty=None,
            min_length=0,
            random_seed=0,
            is_return_output_log_probs=False,
            is_return_cum_log_probs=False,
            is_return_cross_attentions=False,
            bad_words_list=None,
            stop_words_list=None
        )

The following code has a sample invocation to the endpoint we deployed. We use the max_seq_len parameter to control the number of tokens that are generated and temperature to control the randomness of the generated text.

smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": [
                "Title: ”University has a new facility coming up“\nGiven the above title of an imaginary article, imagine the article.n"
            ],
            "parameters": {"max_seq_len": 200, "temperature": 0.7},
            "padding": True,
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

Clean up

When you’re done testing the model, delete the endpoint to save costs if the endpoint is no longer required:

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Performance tuning

If you intend to use this post and accompanying notebook with a different model, you may want to explore some of the tunable parameters that SageMaker, DeepSpeed, and the DJL offer. Iteratively experimenting with these parameters can have a material impact on the latency, throughput, and cost of your hosted large model. To learn more about tuning parameters such as number of workers, degree of tensor parallelism, job queue size, and others, refer to DJLServing configurations and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Benchmarking results on hosting FLAN-T5 model on SageMaker

The following table summarizes our benchmarking results.

Model Model Partitioning and Optimization Engine Quantization Batch Size Tensor Parallel Degree Number of Workers Inference Latency
P50
(ms)
Inference Latency
P90
(ms)
Inference Latency
P99
(ms)
Data Quality
flan-t5-xxl FasterTransformer FP32 4 4 1 327.39 331.01 612.73 Normal

For our benchmark, we used four different type of tasks that form into a single batch and benchmarked Flan-T5-XXL model. FasterTransformer is using a tensor parallel degree of 4 (the model gets partitioned across four accelerator devices on the same host). From our benchmark observation, FasterTransformer was the most performant in terms of latency and throughput as compared to other frameworks for hosting this model. The p99 inference latency was 612 milliseconds.

Conclusion

In this post, we gave an overview of large model hosting challenges, and how SageMaker LMI containers help you address these challenges using its low-code/no-code capabilities. We showcased how to host large models using FasterTransformer with high performance on SageMaker using the SageMaker LMI container. We demonstrated this new capability in an example of deploying a FLAN-T5-XXL model on SageMaker. We also covered options available to tune the performance of your models using different model optimization approaches and how SageMaker LMI containers offer low-code/no-code options to you in hosting and optimizing the large models.


About the authors

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” – Andrew Ng

A data-centric AI approach involves building AI systems with quality data involving data preparation and feature engineering. This can be a tedious task involving data collection, discovery, profiling, cleansing, structuring, transforming, enriching, validating, and securely storing the data.

Amazon SageMaker Data Wrangler is a service in Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data using little to no coding. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify data preprocessing and feature engineering, taking data preparation to production faster without the need to author PySpark code, install Apache Spark, or spin up clusters.

For scenarios where you need to add your own custom scripts for data transformations, you can write your transformation logic in Pandas, PySpark, PySpark SQL. Data Wrangler now supports NLTK and SciPy libraries for authoring custom transformations to prepare text data for ML and perform constraint optimization.

You might run into scenarios where you have to add your own custom scripts for data transformation. With the Data Wrangler custom transform capability, you can write your transformation logic in Pandas, PySpark, PySpark SQL.

In this post, we discuss how you can write your custom transformation in NLTK to prepare text data for ML. We will also share some example custom code transform using other common frameworks such as NLTK, NumPy, SciPy, and scikit-learn as well as AWS AI Services. For the purpose of this exercise, we use the Titanic dataset, a popular dataset in the ML community, which has now been added as a sample dataset within Data Wrangler.

Solution overview

Data Wrangler provides over 40 built-in connectors for importing data. After data is imported, you can build your data analysis and transformations using over 300 built-in transformations. You can then generate industrialized pipelines to push the features to Amazon Simple Storage Service (Amazon S3) or Amazon SageMaker Feature Store. The following diagram shows the end-to-end high-level architecture.

Prerequisites

Data Wrangler is a SageMaker feature available within Amazon SageMaker Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

Import the Titanic dataset

Start your Studio environment and create a new Data Wrangler flow. You can either import your own dataset or use a sample dataset (Titanic) as shown in the following screenshot. Data Wrangler allows you to import datasets from different data sources. For our use case, we import the sample dataset from an S3 bucket.

Once imported, you will see two nodes (the source node and the data type node) in the data flow. Data Wrangler automatically identifies the data type for all the columns in the dataset.

Custom transformations with NLTK

For data preparation and feature engineering with Data Wrangler, you can use over 300 built-in transformations or build your own custom transformations. Custom transforms can be written as separate steps within Data Wrangler. They become part of the .flow file within Data Wrangler. The custom transform feature supports Python, PySpark, and SQL as different steps in code snippets. After notebook files (.ipynb) are generated from the .flow file or the .flow file is used as recipes, the custom transform code snippets persist without requiring any changes. This design of Data Wrangler allows custom transforms to become part of a SageMaker Processing job for processing massive datasets with custom transformations.

Titanic dataset has couple of features (name and home.dest) that contain text information. We use NLTK to split the name column and extract the last name, and print the frequency of last names. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength natural language processing (NLP) libraries.

To add a new transform, complete the following steps:

  1. Choose the plus sign and choose Add Transform.
  2. Choose Add Step and choose Custom Transform.

You can create a custom transform using Pandas, PySpark, Python user-defined functions, and SQL PySpark.

  1. Choose Python (Pandas) and add the following code to extract the last name from the name column:
    import nltk
    nltk.download('punkt')
    tokens = [nltk.word_tokenize(name) for name in df['Name']]
    
    # Extract the last names of the passengers
    df['last_name'] = [token[0] for token in tokens]

  2. Choose Preview to review the results.

The following screenshot shows the last_name column extracted.

  1. Add another custom transform step to identify the frequency distribution of the last names, using the following code:
    import nltk
    fd = nltk.FreqDist(df["last_name"])
    print(fd.most_common(10))

  2. Choose Preview to review the results of the frequency.

Custom transformations with AWS AI services

AWS pre-trained AI services provide ready-made intelligence for your applications and workflows. AWS AI services easily integrate with your applications to address many common use cases. You can now use the capabilities for AWS AI services as a custom transform step in Data Wrangler.

Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

We use Amazon Comprehend to extract the entities from the name column. Complete the following steps:

  1. Add a custom transform step.
  2. Choose Python (Pandas).
  3. Enter the following code to extract the entities:
    import boto3
    comprehend = boto3.client("comprehend")
    
    response = comprehend.detect_entities(LanguageCode = 'en', Text = df['name'].iloc[0])
    
    for entity in response['Entities']:
    print(entity['Type'] + ":" + entity["Text"])

  4. Choose Preview and visualize the results.

We have now added three custom transforms in Data Wrangler.

  1. Choose Data Flow to visualize the end-to-end data flow.

Custom transformations with NumPy and SciPy

NumPy is an open-source library for Python offering comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. SciPy is an open-source Python library used for scientific computing and technical computing, containing modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier transform (FFT), signal and image processing, solvers, and more.

Data Wrangler custom transforms allow you to combine Python, PySpark, and SQL as different steps. In the following Data Wrangler flow, different functions from Python packages, NumPy, and SciPy are applied on the Titanic dataset as multiple steps.

NumPy transformations

The fare column of the Titanic dataset has boarding fares of different passengers. The histogram of the fare column shows uniform distribution, except for the last bin. By applying NumPy transformations like log or square root, we can change the distribution (as shown by the square root transformation).

See the following code:

import pandas as pd
import numpy as np
df["fare_log"] = np.log(df["fare_interpolate"])
df["fare_sqrt"] = np.sqrt(df["fare_interpolate"])
df["fare_cbrt"] = np.cbrt(df["fare_interpolate"])

SciPy transformations

SciPy functions like z-score are applied as part of the custom transform to standardize fare distribution with mean and standard deviation.

See the following code:

df["fare_zscore"] = zscore(df["fare_interpolate"])
from scipy.stats import zscore

Constraint optimization with NumPy and SciPy

Data Wrangler custom transforms can handle advanced transformations like constraint optimization applying SciPy optimize functions and combining SciPy with NumPy. In the following example, fare as a function of age doesn’t show any observable trend. However, constraint optimization can transform fare as a function of age. The constraint condition in this case is that the new total fare remains the same as the old total fare. Data Wrangler custom transforms allow you to run the SciPy optimize function to determine the optimal coefficient that can transform fare as a function of age under constraint conditions.

Optimization definition, objective definition, and multiple constraints can be mentioned as different functions while formulating constraint optimization in a Data Wrangler custom transform using SciPy and NumPy. Custom transforms can also bring different solver methods that are available as part of the SciPy optimize package. A new transformed variable can be generated by multiplying the optimal coefficient with the original column and added to existing columns of Data Wrangler. See the following code:

import numpy as np
import scipy.optimize as opt
import pandas as pd

df2 = pd.DataFrame({"Y":df["fare_interpolate"], "X1":df["age_interpolate"]})

# optimization defination
def main(df2):
x0 = [0.1]
res = opt.minimize(fun=obj, x0=x0, args=(df2), method="SLSQP", bounds=[(0,50)], constraints=cons)
return res

# objective function
def obj(x0, df2):
sumSquares = np.sum(df2["Y"] - x0*df2["X1"])
return sumSquares

# constraints
def constraint1(x0):
sum_cons1 = np.sum(df2["Y"] - x0*df2["X1"]) - 0
return sum_cons1
con1 = {'type': 'eq', 'fun': constraint1}
cons = ([con1])

print(main(df2))

df["new_fare_age_optimized"]=main(df2).x*df2["X1"]

The Data Wrangler custom transform feature has the UI capability to show the results of SciPy optimize functions like value of optimal coefficient (or multiple coefficients).

Custom transformations with scikit-learn

scikit-learn is a Python module for machine learning built on top of SciPy. It’s an open-source ML library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, preprocessing with a discretizer can introduce nonlinearity to linear models.

In the following code, we use KBinsDiscretizer to discretize the age column into 10 bins:

# Table is available as variable `df`
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
# discretization transform the raw data
df = df.dropna()
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
ages = np.array(df["age"]).reshape(-1, 1)
df["age"] = kbins.fit_transform(ages)
print(kbins.bin_edges_)

You can see the bin edges printed in the following screenshot.

One-hot encoding

Values in the Embarked columns are categorical values. Therefore, we have to represent these strings as numerical values in order to perform our classification with our model. We could also do this using a one-hot encoding transform.

There are three values for Embarked: S, C, and Q. We represent these with numbers. See the following code:

# Table is available as variable `df`
from sklearn.preprocessing import LabelEncoder

le_embarked = LabelEncoder()
le_embarked.fit(df["embarked"])

encoded_embarked_training = le_embarked.transform(df["embarked"])
df["embarked"] = encoded_embarked_training

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.

Data Wrangler automatically saves your data flow every 60 seconds. To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated how you can use custom transformations in Data Wrangler. We used the libraries and framework within the Data Wrangler container to extend the built-in data transformation capabilities. The examples in this post represent a subset of the frameworks used. The transformations in the Data Wrangler flow can now be scaled in to a pipeline for DataOps.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler. To learn more about Autopilot and AutoML on SageMaker, visit Automate model development with Amazon SageMaker Autopilot.


About the authors

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

 Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience in end-to-end designs and solutions for machine learning; business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Read More

Overcome the machine learning cold start challenge in fraud detection using Amazon Fraud Detector

Overcome the machine learning cold start challenge in fraud detection using Amazon Fraud Detector

As more businesses increase their online presence to serve their customers better, new fraud patterns are constantly emerging. In today’s ever-evolving digital landscape, where fraudsters are becoming more sophisticated in their tactics, detecting and preventing such fraudulent activities has become paramount for companies and financial institutions.

Traditional rule-based fraud detection systems are capped in their ability to quickly iterate as they rely on predefined rules and thresholds to flag potentially fraudulent activity. These systems can generate a large number of false positives, significantly increasing the volume of manual investigations performed by the fraud team. Furthermore, humans are also error-prone and have limited capacity to process large amounts of data, making manual efforts to detect fraud time-consuming, which can result in missed fraudulent transactions, increased losses, and reputational damage.

Machine learning (ML) plays a crucial role in detecting fraud because it can quickly and accurately analyze large volumes of data to identify anomalous patterns and possible fraud trends. ML fraud model performance relies heavily on the quality of data it is trained on, and, specifically for the supervised models, accurate labeled data is crucial. In ML, a lack of significant historical data to train a model is called the cold start problem.

In the world of fraud detection, the following are some traditional cold start scenarios:

  • Building an accurate fraud model while lacking a history of transactions or fraud cases
  • Being able to accurately distinguish legitimate activity from fraud for new customers and accounts
  • Risk-decisioning payments to an address or beneficiary never seen before by the fraud system

There are multiple ways to solve for these scenarios. For example, you can use generic models, known as one-size-fits-all models, which are typically trained on top of fraud data sharing platforms like fraud consortiums. The challenge with this approach is that no business is equal, and fraud attack vectors change constantly.

Another option is to use an unsupervised anomaly detection model to monitor and surface unusual behavior among customer events. The challenge with this approach is that not all fraud events are anomalies, and not all anomalies are indeed fraud. Therefore, you can expect higher false positive rates.

In this post, we show how you can quickly bootstrap a real-time fraud prevention ML model with a little as 100 events using the Amazon Fraud Detector new feature, Cold Start, thereby dramatically lowering the barrier of entry to custom ML models for many organizations that simply don’t have the time or ability to collect and accurately label large datasets. Moreover, we discuss how by using Amazon Fraud Detector stored events, you can review results and correctly label the events to retrain your models, thereby improving the effectiveness of fraud prevention measures over time.

Solution overview

Amazon Fraud Detector is a fully managed fraud detection service that automates detecting potentially fraudulent activities online. You can use Amazon Fraud Detector to build customized fraud detection models using your own historical dataset, add decision logic using the built-in rules engine, and orchestrate risk decision workflows with a click of a button.

Previously, you had to provide over 10,000 labeled events with at least 400 examples of fraud to train a model. With the release of the Cold Start feature, you can quickly train a model with a minimum of 100 events and at least 50 classified as fraud. Compared with initial data requirements, this is a reduction of 99% in historical data and an 87% reduction in label requirements.

The new Cold Start feature provides intelligent methods for enriching, extending, and risk modeling small sets of data. Moreover, Amazon Fraud Detector performs label assignments and sampling for unlabeled events.

Experiments performed with public datasets show that, by lowering the limits to 50 fraud and only 100 events, you can build fraud ML models that consistently outperform unsupervised and semi-supervised models.

Cold Start model performance

The ability of an ML model to generalize and make accurate predictions on unseen data is impacted by the quality and diversity of the training dataset. For Cold Start models, this is no different. You should have processes in place as more data is collected to correctly label these events and retrain the models, ultimately leading to an optimal model performance.

With a lower data requirement, the instability of reported performance increases due to the increased variance of the model and the limited test data size. To help you build the right expectation of model performance, besides model AUC, Amazon Fraud Detector also reports uncertainty range metrics. The following table defines these metrics.

. . AUC
. . < 0.6 0.6 – 0.8 >= 0.8
AUC uncertainty interval > 0.3 The model performance is very low and might vary greatly. Expect low fraud detection performance. The model performance is low and might vary greatly. Expect limited fraud detection performance. The model performance might vary greatly.
0.1 – 0.3 The model performance is very low and might vary significantly. Expect low fraud detection performance. The model performance is low and might vary significantly. Expect limited fraud detection performance. The model performance might vary significantly.
< 0.1 The model performance is very low. Expect low fraud detection performance. The model performance is low. Expect limited fraud detection performance. No Warning

Train a Cold Start model

Training a Cold Start fraud model is identical to training any other Amazon Fraud Detector model; what differs is the dataset size. You can find sample datasets for Cold Start training in our GitHub repo. To train an Amazon Fraud Detector custom model, you can follow our hands-on tutorial. You can either use the Amazon Fraud Detector console tutorial or the SDK tutorial to build, train, and deploy a fraud detection model.

After your model is trained, you can review performance metrics and then deploy it by changing its status to Active. To learn more about model scores and performance metrics, see Model scores and Model performance metrics. At this point, you can now add your model to your detector, add business rules to interpret the risk scores that the model outputs, and make real-time predictions using the GetEventPrediction API.

Fraud ML model continuous improvement and feedback loop

With the Amazon Fraud Detector Cold Start feature, you can quickly bootstrap a fraud detector endpoint and start protecting your businesses immediately. However, new fraud patterns are constantly emerging, so it’s critical to retrain Cold Start models with newer data to improve the accuracy and effectiveness of the predictions over time.

To help you iterate on your models, Amazon Fraud Detector automatically stores all events sent to the service for inference. You can change or validate the event ingestion flag is on at the event type level, as shown in the following screenshot.

With the stored events feature, you can use the Amazon Fraud Detector SDK to programmatically access an event, review the event metadata and the prediction explanation, and make an informed risk decision. Moreover, you can label the event for future model retraining and continuous model improvement. The following diagram shows an example of this workflow.

In the following code snippets, we demonstrate the process to label a stored event:

  • To do a real-time fraud prediction on an event, call the GetEventPrediction API:
import boto3

def get_event_prediction():
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.get_event_prediction(
        detectorId='your_detector_name',
        detectorVersionId='1',
        eventId='my-event-id-1234',
        eventTypeName='your_event_type',
        entities=[
            {
                'entityType': 'user',
                'entityId': 'A12345'
            },
        ],
        eventTimestamp= '2023-03-23T21:42:03.658Z',
        eventVariables={
            'email': 'test@anymockcompany.com',
            'ip': '123.123.123.123',
            'card_bin': '400022',
            'billing_zip': '50401'
        }
    )
    return(prediction)

API Response:
{
  "modelScores": [
    {
      "modelVersion": {
        "modelId": "your_model_name",
        "modelType": "TRANSACTION_FRAUD_INSIGHTS",
        "modelVersionNumber": "1.0"
      },
      "scores": {
        "your_model_insightscore": 932
      }
    }
  ],
  "ruleResults": [
    {
      "ruleId": "high_risk_score",
      "outcomes": [
        "high_risk_send_for_manual_review"
      ]
    }
  ]

As seen in the response, based on the decision engine rule matched, the event should be sent for manual review by the fraud team. By gathering the prediction explanation metadata, you can gain insights into how each event variable impacted the model’s fraud prediction score.

  • To collect these insights, we use the get_event_prediction_metada API:
import boto3

def get_event_prediction_metadata(event, context):
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.get_event_prediction_metadata(
        eventId = 'my-event-id-1234',
        eventTypeName = 'your_event_type',
        predictionTimestamp = '2023-03-23T21:44:39.318Z',
        detectorId = 'your_detector_name',
        detectorVersionId = '1'
    )
    return(prediction)

API Response:

{
  "modelScores": [
    {
      "modelVersion": {
        "modelId": "your_model_name",
        "modelType": "TRANSACTION_FRAUD_INSIGHTS",
        "modelVersionNumber": "1.0"
      },
      "scores": {
        "your_model_insightscore": 932
      }
    }
  ],
  "ruleResults": [
    {
      "ruleId": "high_risk_score",
      "outcomes": [
        "high_risk_send_for_manual_review"
      ]
    }
  ]


{
  "eventId": "my-event-id-1234",
  …
  <REDACTED>
  …
  "eventVariables": [
    {
      "name": "ip",
      "value": "123.123.123.123"
    },
    {
      "name": "billing_zip",
      "value": "50401"
    },
    {
      "name": "email",
      "value": "test@anymockcompany.com"
    },
    {
      "name": "card_bin",
      "value": "400022"
    }
  ],
…
 <REDACTED>
…
   "evaluations": [
        {
          "evaluationScore": "932.0",
          "predictionExplanations": {
            "variableImpactExplanations": [
              {
                "eventVariableName": "billing_zip",
                "relativeImpact": "1",
                "logOddsImpact": 1.018196990713477135
              },
              {
                "eventVariableName": "ip",
                "relativeImpact": "0",
                "logOddsImpact": -0.23122438788414001
              },
              {
                "eventVariableName": "email",
                "relativeImpact": "0",
                "logOddsImpact": 0.004304269328713417
              },
              {
                "eventVariableName": "card_bin",
                "relativeImpact": "0",
                "logOddsImpact": -0.011150157079100609
              } 
           ],
}

With these insights, the fraud analyst can make an informed risk decision about the event in question and update the event label.

  • To update the event label  call the update_event_label API:
import boto3

def update_event_label(event, context):
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.update_event_label(
        eventId = "my-event-id-1234",
        eventTypeName = "your_event_type",
        assignedLabel='1', # Fraud
        labelTimestamp='2023-03-25T11:20:03.658Z'
    )
    
    return(prediction)

API Response

{
  "ResponseMetadata": {
    "RequestId": "3e28caa0-2a06-4b8d-9a10-9081811bf22d",
    "HTTPStatusCode": 200,
    …
     <REDACTED>
    …

    "RetryAttempts": 0
  }
}

As a final step, you can verify if the event label was correctly updated.

  • To verify the event label, call the get_event API:
import boto3

def get_event():
    fraudDetector = boto3.client('frauddetector')
    
    event = fraudDetector.get_event(
        eventId='my-event-id-1234',
        eventTypeName=’your_event_type'
    )
    
    return(event)

API Response

{
  "event": {
    "eventId": "my-event-id-1234",
    "eventTimestamp": "2023-03-23T21:42:03.658Z",
    "eventVariables": {
      "billing_zip": "50401",
      "card_bin": "400022",
      "email": "test@anymockcompany.com",
      "ip": "123.123.123.123"
    },
    "currentLabel": "1",
    "labelTimestamp": "2023-03-25T11:20:03.658Z",
    "entities": [
      {
        "entityType": "user",
        "entityId": "A12345"
      }
    ]
  }
}

Clean up

To avoid incurring future charges, delete the resources created for the solution.

Conclusion

This post demonstrated how you can quickly bootstrap a real-time fraud prevention system with a few as 100 events using the Amazon Fraud Detector new Cold Start feature. We discussed how you can use stored events to review results and correctly label the events and retrain your models, improving the effectiveness of fraud prevention measures over time.

Fully managed AWS services such as Amazon Fraud Detector help reduce the time businesses spend analyzing user behavior to identify fraud in their platforms and focus more on driving business value. To learn more about how Amazon Fraud Detector can help your business, visit Amazon Fraud Detector.


About the Authors

Marcel Pividal is a Global Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for FinTechs, payment providers, pharma, and government agencies. His current areas of focus are risk management, fraud prevention, and identity verification.

Julia Xu is a Research Scientist with Amazon Fraud Detector. She is passionate about solving customer challenges using machine learning techniques. In her free time, she enjoys hiking, painting, and exploring new coffee shops.

Guilherme Ricci is a Senior Solution Architect at AWS, helping Startups to modernize and optimize the costs of their applications. With over 10 years of experience with companies in the financial sector, he is currently working together with the team of AI/ML specialists.

Read More

Connect Amazon EMR and RStudio on Amazon SageMaker

Connect Amazon EMR and RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

In conjunction with tools like RStudio on SageMaker, users are analyzing, transforming, and preparing large amounts of data as part of the data science and ML workflow. Data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for large-scale data processing. Using RStudio on SageMaker and Amazon EMR together, you can continue to use the RStudio IDE for analysis and development, while using Amazon EMR managed clusters for larger data processing.

In this post, we demonstrate how you can connect your RStudio on SageMaker domain with an EMR cluster.

Solution overview

We use an Apache Livy connection to submit a sparklyr job from RStudio on SageMaker to an EMR cluster. This is demonstrated in the following diagram.

Scope of Solution
All code demonstrated in the post is available in our GitHub repository. We implement the following solution architecture.

Prerequisites

Prior to deploying any resources, make sure you have all the requirements for setting up and using RStudio on SageMaker and Amazon EMR:

We’ll also build a custom RStudio on SageMaker image, so ensure you have Docker running and all required permissions. For more information, refer to Use a custom image to bring your own development environment to RStudio on Amazon SageMaker.

Create resources with AWS CloudFormation

We use an AWS CloudFormation stack to generate the required infrastructure.

If you already have an RStudio domain and an existing EMR cluster, you can skip this step and start building your custom RStudio on SageMaker image. Substitute the information of your EMR cluster and RStudio domain in place of the EMR cluster and RStudio domain created in this section.

Launching this stack creates the following resources:

  • Two private subnets
  • EMR Spark cluster
  • AWS Glue database and tables
  • SageMaker domain with RStudio
  • SageMaker RStudio user profile
  • IAM service role for the SageMaker RStudio domain
  • IAM service role for the SageMaker RStudio user profile

Complete the following steps to create your resources:

Choose Launch Stack to create the stack.

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, provide a name for your stack and leave the remaining options as default, then choose Next.
  3. On the Configure stack options page, leave the options as default and choose Next.
  4. On the Review page, select
  5. I acknowledge that AWS CloudFormation might create IAM resources with custom names and
  6. I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  7. Choose Create stack.

The template generates five stacks.

To see the EMR Spark cluster that was created, navigate to the Amazon EMR console. You will see a cluster created for you called sagemaker. This is the cluster we connect to through RStudio on SageMaker.

Build the custom RStudio on SageMaker image

We have created a custom image that will install all the dependencies of sparklyr, and will establish a connection to the EMR cluster we created.

If you’re using your own EMR cluster and RStudio domain, modify the scripts accordingly.

Make sure Docker is running. Start by getting into our project repository:

cd sagemaker-rstudio-emr/sparklyr-image
./build-r-image.sh

We will now build the Docker image and register it to our RStudio on SageMaker domain.

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain select rstudio-domain.
  3. On the Environment tab, choose Attach image.

    Now we attach the sparklyr image that we created earlier to the domain.
  4. For Choose image source, select Existing image.
  5. Select the sparklyr image we built.
  6. For Image properties, leave the options as default.
  7. For Image type, select RStudio image.
  8. Choose Submit.

    Validate the image has been added to the domain. It may take a few minutes for the image to attach fully.
  9. When it’s available, log in to the RStudio on SageMaker console using the rstudio-user profile that was created.
  10. From here, create a session with the sparklyr image that we created earlier.

    First, we have to connect to our EMR cluster.
  11. In the connections pane, choose New Connection.
  12. Select the EMR cluster connect code snippet and choose Connect to Amazon EMR Cluster.

    After the connect code has run, you will see a Spark connection through Livy, but no tables.
  13. Change the database to credit_card:
    tbl_change_db(sc, “credit_card”)
  14. Choose Refresh Connection Data.
    You can now see the tables.
  15. Now navigate to the rstudio-sparklyr-code-walkthrough.md file.

This has a set of Spark transformations we can use on our credit card dataset to prepare it for modeling. The following code is an excerpt:

Let’s count() how many transactions are in the transactions table. But first we need to cache Use the tbl() function.

users_tbl &amp;lt;- tbl(sc, "users")
cards_tbl &amp;lt;- tbl(sc, "cards")
transactions_tbl &amp;lt;- tbl(sc, "transactions")

Let’s run a count of the number of rows for each table.

count(users_tbl)
count(cards_tbl)
count(transactions_tbl)

Now let’s register our tables as Spark Data Frames and pull them into the cluster-wide in memory cache for better performance. We will also filter the header that gets placed in the first row for each table.

users_tbl &lt;- tbl(sc, 'users') %&gt;%
  filter(gender != 'Gender')
sdf_register(users_tbl, "users_spark")
tbl_cache(sc, 'users_spark')
users_sdf &lt;- tbl(sc, 'users_spark')

cards_tbl &lt;- tbl(sc, 'cards') %&gt;%
  filter(expire_date != 'Expires')
sdf_register(cards_tbl, "cards_spark")
tbl_cache(sc, 'cards_spark')
cards_sdf &lt;- tbl(sc, 'cards_spark')

transactions_tbl &lt;- tbl(sc, 'transactions') %&gt;%
  filter(amount != 'Amount')
sdf_register(transactions_tbl, "transactions_spark")
tbl_cache(sc, 'transactions_spark')
transactions_sdf &lt;- tbl(sc, 'transactions_spark')

To see the full list of commands, refer to the rstudio-sparklyr-code-walkthrough.md file.

Clean up

To clean up any resources to avoid incurring recurring costs, delete the root CloudFormation template. Also delete all Amazon Elastic File Service (Amazon EFS) mounts created and any Amazon Simple Storage Service (Amazon S3) buckets and objects created.

Conclusion

The integration of RStudio on SageMaker with Amazon EMR provides a powerful solution for data analysis and modeling tasks in the cloud. By connecting RStudio on SageMaker and establishing a Livy connection to Spark on EMR, you can take advantage of the computing resources of both platforms for efficient processing of large datasets. RStudio, one of the most widely used IDEs for data analysis, allows you to take advantage of the fully managed infrastructure, access control, networking, and security capabilities of SageMaker. Meanwhile, the Livy connection to Spark on Amazon EMR provides a way to perform distributed processing and scaling of data processing tasks.

If you’re interested in learning more about using these tools together, this post serves as a starting point. For more information, refer to RStudio on Amazon SageMaker. If you have any suggestions or feature improvements, please create a pull request on our GitHub repo or leave a comment on this post!


About the Authors

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.


Raj Pathak
 is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).


Saiteja Pudi
 is a Solutions Architect at AWS, based in Dallas, Tx. He has been with AWS for more than 3 years now, helping customers derive the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in Data Science and Machine Learning.

Read More

Announcing New Tools for Building with Generative AI on AWS

Announcing New Tools for Building with Generative AI on AWS

The seeds of a machine learning (ML) paradigm shift have existed for decades, but with the ready availability of scalable compute capacity, a massive proliferation of data, and the rapid advancement of ML technologies, customers across industries are transforming their businesses. Just recently, generative AI applications like ChatGPT have captured widespread attention and imagination. We are truly at an exciting inflection point in the widespread adoption of ML, and we believe most customer experiences and applications will be reinvented with generative AI.

AI and ML have been a focus for Amazon for over 20 years, and many of the capabilities customers use with Amazon are driven by ML. Our e-commerce recommendations engine is driven by ML; the paths that optimize robotic picking routes in our fulfillment centers are driven by ML; and our supply chain, forecasting, and capacity planning are informed by ML. Prime Air (our drones) and the computer vision technology in Amazon Go (our physical retail experience that lets consumers select items off a shelf and leave the store without having to formally check out) use deep learning. Alexa, powered by more than 30 different ML systems, helps customers billions of times each week to manage smart homes, shop, get information and entertainment, and more. We have thousands of engineers at Amazon committed to ML, and it’s a big part of our heritage, current ethos, and future.

At AWS, we have played a key role in democratizing ML and making it accessible to anyone who wants to use it, including more than 100,000 customers of all sizes and industries. AWS has the broadest and deepest portfolio of AI and ML services at all three layers of the stack. We’ve invested and innovated to offer the most performant, scalable infrastructure for cost-effective ML training and inference; developed Amazon SageMaker, which is the easiest way for all developers to build, train, and deploy models; and launched a wide range of services that allow customers to add AI capabilities like image recognition, forecasting, and intelligent search to applications with a simple API call. This is why customers like Intuit, Thomson Reuters, AstraZeneca, Ferrari, Bundesliga, 3M, and BMW, as well as thousands of startups and government agencies around the world, are transforming themselves, their industries, and their missions with ML. We take the same democratizing approach to generative AI: we work to take these technologies out of the realm of research and experiments and extend their availability far beyond a handful of startups and large, well-funded tech companies. That’s why today I’m excited to announce several new innovations that will make it easy and practical for our customers to use generative AI in their businesses.

Building with Generative AI on AWS

Building with Generative AI on AWS

Generative AI and foundation models

Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. Like all AI, generative AI is powered by ML models—very large models that are pre-trained on vast amounts of data and commonly referred to as Foundation Models (FMs). Recent advancements in ML (specifically the invention of the transformer-based neural network architecture) have led to the rise of models that contain billions of parameters or variables. To give a sense for the change in scale, the largest pre-trained model in 2019 was 330M parameters. Now, the largest models are more than 500B parameters—a 1,600x increase in size in just a few years. Today’s FMs, such as the large language models (LLMs) GPT3.5 or BLOOM, and the text-to-image model Stable Diffusion from Stability AI, can perform a wide range of tasks that span multiple domains, like writing blog posts, generating images, solving math problems, engaging in dialog, and answering questions based on a document. The size and general-purpose nature of FMs make them different from traditional ML models, which typically perform specific tasks, like analyzing text for sentiment, classifying images, and forecasting trends.

FMs can perform so many more tasks because they contain such a large number of parameters that make them capable of learning complex concepts. And through their pre-training exposure to internet-scale data in all its various forms and myriad of patterns, FMs learn to apply their knowledge within a wide range of contexts. While the capabilities and resulting possibilities of a pre-trained FM are amazing, customers get really excited because these generally capable models can also be customized to perform domain-specific functions that are differentiating to their businesses, using only a small fraction of the data and compute required to train a model from scratch. The customized FMs can create a unique customer experience, embodying the company’s voice, style, and services across a wide variety of consumer industries, like banking, travel, and healthcare. For instance, a financial firm that needs to auto-generate a daily activity report for internal circulation using all the relevant transactions can customize the model with proprietary data, which will include past reports, so that the FM learns how these reports should read and what data was used to generate them.

The potential of FMs is incredibly exciting. But, we are still in the very early days. While ChatGPT has been the first broad generative AI experience to catch customers’ attention, most folks studying generative AI have quickly come to realize that several companies have been working on FMs for years, and there are several different FMs available—each with unique strengths and characteristics. As we’ve seen over the years with fast-moving technologies, and in the evolution of ML, things change rapidly. We expect new architectures to arise in the future, and this diversity of FMs will set off a wave of innovation. We’re already seeing new application experiences never seen before. AWS customers have asked us how they can quickly take advantage of what is out there today (and what is likely coming tomorrow) and quickly begin using FMs and generative AI within their businesses and organizations to drive new levels of productivity and transform their offerings.

Announcing Amazon Bedrock and Amazon Titan models, the easiest way to build and scale generative AI applications with FMs

Customers have told us there are a few big things standing in their way today. First, they need a straightforward way to find and access high-performing FMs that give outstanding results and are best-suited for their purposes. Second, customers want integration into applications to be seamless, without having to manage huge clusters of infrastructure or incur large costs. Finally, customers want it to be easy to take the base FM, and build differentiated apps using their own data (a little data or a lot). Since the data customers want to use for customization is incredibly valuable IP, they need it to stay completely protected, secure, and private during that process, and they want control over how their data is shared and used.

We took all of that feedback from customers, and today we are excited to announce Amazon Bedrock, a new service that makes FMs from AI21 Labs, Anthropic, Stability AI, and Amazon accessible via an API. Bedrock is the easiest way for customers to build and scale generative AI-based applications using FMs, democratizing access for all builders. Bedrock will offer the ability to access a range of powerful FMs for text and images—including Amazon’s Titan FMs, which consist of two new LLMs we’re also announcing today—through a scalable, reliable, and secure AWS managed service. With Bedrock’s serverless experience, customers can easily find the right model for what they’re trying to get done, get started quickly, privately customize FMs with their own data, and easily integrate and deploy them into their applications using the AWS tools and capabilities they are familiar with (including integrations with Amazon SageMaker ML features like Experiments to test different models and Pipelines to manage their FMs at scale) without having to manage any infrastructure.

Bedrock customers can choose from some of the most cutting-edge FMs available today. This includes the Jurassic-2 family of multilingual LLMs from AI21 Labs, which follow natural language instructions to generate text in Spanish, French, German, Portuguese, Italian, and Dutch. Claude, Anthropic’s LLM, can perform a wide variety of conversational and text processing tasks and is based on Anthropic’s extensive research into training honest and responsible AI systems. Bedrock also makes it easy to access Stability AI’s suite of text-to-image foundation models, including Stable Diffusion (the most popular of its kind), which is capable of generating unique, realistic, high-quality images, art, logos, and designs.

One of the most important capabilities of Bedrock is how easy it is to customize a model. Customers simply point Bedrock at a few labeled examples in Amazon S3, and the service can fine-tune the model for a particular task without having to annotate large volumes of data (as few as 20 examples is enough). Imagine a content marketing manager who works at a leading fashion retailer and needs to develop fresh, targeted ad and campaign copy for an upcoming new line of handbags. To do this, they provide Bedrock a few labeled examples of their best performing taglines from past campaigns, along with the associated product descriptions, and Bedrock will automatically start generating effective social media, display ad, and web copy for the new handbags. None of the customer’s data is used to train the underlying models, and since all data is encrypted and does not leave a customer’s Virtual Private Cloud (VPC), customers can trust that their data will remain private and confidential.

Bedrock is now in limited preview, and customers like Coda are excited about how fast their development teams have gotten up and running. Shishir Mehrotra, Co-founder and CEO of Coda, says, “As a longtime happy AWS customer, we’re excited about how Amazon Bedrock can bring quality, scalability, and performance to Coda AI. Since all our data is already on AWS, we are able to quickly incorporate generative AI using Bedrock, with all the security and privacy we need to protect our data built-in. With over tens of thousands of teams running on Coda, including large teams like Uber, the New York Times, and Square, reliability and scalability are really important.”

We have been previewing Amazon’s new Titan FMs with a few customers before we make them available more broadly in the coming months. We’ll initially have two Titan models. The first is a generative LLM for tasks such as summarization, text generation (for example, creating a blog post), classification, open-ended Q&A, and information extraction. The second is an embeddings LLM that translates text inputs (words, phrases or possibly large units of text) into numerical representations (known as embeddings) that contain the semantic meaning of the text. While this LLM will not generate text, it is useful for applications like personalization and search because by comparing embeddings the model will produce more relevant and contextual responses than word matching. In fact, Amazon.com’s product search capability uses a similar embeddings model among others to help customers find the products they’re looking for. To continue supporting best practices in the responsible use of AI, Titan FMs are built to detect and remove harmful content in the data that customers provide for customization, reject inappropriate content in the user input, and filter the models’ outputs that contain inappropriate content (such as hate speech, profanity, and violence).

Bedrock makes the power of FMs accessible to companies of all sizes so that they can accelerate the use of ML across their organizations and build their own generative AI applications because it will be easy for all developers. We think Bedrock will be a massive step forward in democratizing FMs, and our partners like Accenture, Deloitte, Infosys, and Slalom are building practices to help enterprises go faster with generative AI. Independent Software Vendors (ISVs) like C3 AI and Pega are excited to leverage Bedrock for easy access to its great selection of FMs with all of the security, privacy, and reliability they expect from AWS.

Announcing the general availability of Amazon EC2 Trn1n instances powered by AWS Trainium and Amazon EC2 Inf2 instances powered by AWS Inferentia2, the most cost-effective cloud infrastructure for generative AI

Whatever customers are trying to do with FMs—running them, building them, customizing them—they need the most performant, cost-effective infrastructure that is purpose-built for ML. Over the last five years, AWS has been investing in our own silicon to push the envelope on performance and price performance for demanding workloads like ML training and Inference, and our AWS Trainium and AWS Inferentia chips offer the lowest cost for training models and running inference in the cloud. This ability to maximize performance and control costs by choosing the optimal ML infrastructure is why leading AI startups, like AI21 Labs, Anthropic, Cohere, Grammarly, Hugging Face, Runway, and Stability AI run on AWS.

Trn1 instances, powered by Trainium, can deliver up to 50% savings on training costs over any other EC2 instance, and are optimized to distribute training across multiple servers connected with 800 Gbps of second-generation Elastic Fabric Adapter (EFA) networking. Customers can deploy Trn1 instances in UltraClusters that can scale up to 30,000 Trainium chips (more than 6 exaflops of compute) located in the same AWS Availability Zone with petabit scale networking. Many AWS customers, including Helixon, Money Forward, and the Amazon Search team, use Trn1 instances to help reduce the time required to train the largest-scale deep learning models from months to weeks or even days while lowering their costs. 800 Gbps is a lot of bandwidth, but we have continued to innovate to deliver more, and today we are announcing the general availability of new, network-optimized Trn1n instances, which offer 1600 Gbps of network bandwidth and are designed to deliver 20% higher performance over Trn1 for large, network-intensive models.

Today, most of the time and money spent on FMs goes into training them. This is because many customers are only just starting to deploy FMs into production. However, in the future, when FMs are deployed at scale, most costs will be associated with running the models and doing inference. While you typically train a model periodically, a production application can be constantly generating predictions, known as inferences, potentially generating millions per hour. And these predictions need to happen in real-time, which requires very low-latency and high-throughput networking. Alexa is a great example with millions of requests coming in every minute, which accounts for 40% of all compute costs.

Because we knew that most of the future ML costs would come from running inferences, we prioritized inference-optimized silicon when we started investing in new chips a few years ago. In 2018, we announced Inferentia, the first purpose-built chip for inference. Every year, Inferentia helps Amazon run trillions of inferences and has saved companies like Amazon over a hundred million dollars in capital expense already. The results are impressive, and we see many opportunities to keep innovating as workloads will only increase in size and complexity as more customers integrate generative AI into their applications.

That’s why we’re announcing today the general availability of Inf2 instances powered by AWS Inferentia2, which are optimized specifically for large-scale generative AI applications with models containing hundreds of billions of parameters. Inf2 instances deliver up to 4x higher throughput and up to 10x lower latency compared to the prior generation Inferentia-based instances. They also have ultra-high-speed connectivity between accelerators to support large-scale distributed inference. These capabilities drive up to 40% better inference price performance than other comparable Amazon EC2 instances and the lowest cost for inference in the cloud. Customers like Runway are seeing up to 2x higher throughput with Inf2 than comparable Amazon EC2 instances for some of their models. This high-performance, low-cost inference will enable Runway to introduce more features, deploy more complex models, and ultimately deliver a better experience for the millions of creators using Runway.

Announcing the general availability of Amazon CodeWhisperer, free for individual developers

We know that building with the right FMs and running Generative AI applications at scale on the most performant cloud infrastructure will be transformative for customers. The new wave of experiences will also be transformative for users. With generative AI built-in, users will be able to have more natural and seamless interactions with applications and systems. Think of how we can unlock our mobile phones just by looking at them, without needing to know anything about the powerful ML models that make this feature possible.

One area where we foresee the use of generative AI growing rapidly is in coding. Software developers today spend a significant amount of their time writing code that is pretty straightforward and undifferentiated. They also spend a lot of time trying to keep up with a complex and ever-changing tool and technology landscape. All of this leaves developers less time to develop new, innovative capabilities and services. Developers try to overcome this by copying and modifying code snippets from the web, which can result in inadvertently copying code that doesn’t work, contains security vulnerabilities, or doesn’t track usage of open source software. And, ultimately, searching and copying still takes time away from the good stuff.

Generative AI can take this heavy lifting out of the equation by “writing” much of the undifferentiated code, allowing developers to build faster while freeing them up to focus on the more creative aspects of coding. This is why, last year, we announced the preview of Amazon CodeWhisperer, an AI coding companion that uses a FM under the hood to radically improve developer productivity by generating code suggestions in real-time based on developers’ comments in natural language and prior code in their Integrated Development Environment (IDE). Developers can simply tell CodeWhisperer to do a task, such as “parse a CSV string of songs” and ask it to return a structured list based on values such as artist, title, and highest chart rank. CodeWhisperer provides a productivity boost by generating an entire function that parses the string and returns the list as specified. Developer response to the preview has been overwhelmingly positive, and we continue to believe that helping developers code could end up being one of the most powerful uses of generative AI we’ll see in the coming years. During the preview, we ran a productivity challenge, and participants who used CodeWhisperer completed tasks 57% faster, on average, and were 27% more likely to complete them successfully than those who didn’t use CodeWhisperer. This is a giant leap forward in developer productivity, and we believe this is only the beginning.

Today, we’re excited to announce the general availability of Amazon CodeWhisperer for Python, Java, JavaScript, TypeScript, and C#—plus ten new languages, including Go, Kotlin, Rust, PHP, and SQL. CodeWhisperer can be accessed from IDEs such as VS Code, IntelliJ IDEA, AWS Cloud9, and many more via the AWS Toolkit IDE extensions. CodeWhisperer is also available in the AWS Lambda console. In addition to learning from the billions of lines of publicly available code, CodeWhisperer has been trained on Amazon code. We believe CodeWhisperer is now the most accurate, fastest, and most secure way to generate code for AWS services, including Amazon EC2, AWS Lambda, and Amazon S3.

Developers aren’t truly going to be more productive if code suggested by their generative AI tool contains hidden security vulnerabilities or fails to handle open source responsibly. CodeWhisperer is the only AI coding companion with built-in security scanning (powered by automated reasoning) for finding and suggesting remediations for hard-to-detect vulnerabilities, such as those in the top ten Open Worldwide Application Security Project (OWASP), those that don’t meet crypto library best practices, and others. To help developers code responsibly, CodeWhisperer filters out code suggestions that might be considered biased or unfair, and CodeWhisperer is the only coding companion that can filter and flag code suggestions that resemble open source code that customers may want to reference or license for use.

We know generative AI is going to change the game for developers, and we want it to be useful to as many as possible. This is why CodeWhisperer is free for all individual users with no qualifications or time limits for generating code! Anyone can sign up for CodeWhisperer with just an email account and become more productive within minutes. You don’t even have to have an AWS account. For business users, we’re offering a CodeWhisperer Professional Tier that includes administration features like single sign-on (SSO) with AWS Identity and Access Management (IAM) integration, as well as higher limits on security scanning.

Building powerful applications like CodeWhisperer is transformative for developers and all our customers. We have a lot more coming, and we are excited about what you will build with generative AI on AWS. Our mission is to make it possible for developers of all skill levels and for organizations of all sizes to innovate using generative AI. This is just the beginning of what we believe will be the next wave of ML powering new possibilities for you.

Resources

Check out the following resources to learn more about generative AI on AWS and these announcements:

About the author

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More

How Accenture is using Amazon CodeWhisperer to improve developer productivity

How Accenture is using Amazon CodeWhisperer to improve developer productivity

Amazon CodeWhisperer is an AI coding companion that helps improve developer productivity by generating code recommendations based on their comments in natural language and code in the integrated development environment (IDE). CodeWhisperer accelerates completion of coding tasks by reducing context-switches between the IDE and documentation or developer forums. With real-time code recommendations from CodeWhisperer, you can stay focused in the IDE and finish your coding tasks faster.

CodeWhisperer is powered by a Large Language Model (LLM) that is trained on billions of lines of code, and as a result, has learned how to write code in 15 programming languages. You can simply write a comment that outlines a specific task in plain English, such as “upload a file to S3.” Based on this, CodeWhisperer automatically determines which cloud services and public libraries are best suited for the specified task, builds the specific code on the fly, and recommends the generated code snippets directly in the IDE. Moreover, CodeWhisperer seamlessly integrates with your Visual Studio Code and JetBrains IDEs so that you can stay focused and never leave the IDE. At the time of this writing, CodeWhisperer supports Java, Python, JavaScript, TypeScript, C#, Go, Ruby, Rust, Scala, Kotlin, PHP, C, C++, Shell, and SQL.

In this post, we illustrate how Accenture uses CodeWhisperer in practice to improve developer productivity.

“Accenture is using Amazon CodeWhisperer to accelerate coding as part of our software engineering best practices initiative in our Velocity platform,” says Balakrishnan Viswanathan, Senior Manager, Tech Architecture at Accenture. “The Velocity team was looking for ways to improve developer productivity. After searching for multiple options, we came across Amazon CodeWhisperer to reduce our development efforts by 30% and we are now focusing more on improving security, quality, and performance.”

Benefits of CodeWhisperer

The Accenture Velocity team has been using CodeWhisperer to accelerate their artificial intelligence (AI) and machine learning (ML) projects. The following summary highlights the benefits:

  • The team is spending less time creating boilerplate and repetitive code patterns, and more time on what matters: building great software
  • CodeWhisperer empowers developers to responsibly use AI to create syntactically correct and secure applications
  • The team can generate entire functions and logical code blocks without having to search for and customize code snippets from the web
  • They can accelerate onboarding for novice developers or developers working with an unfamiliar codebase
  • They can detect security threats early in the development process by shifting the security scanning left to the developer’s IDE

In the following sections, we discuss some of the ways that the Accenture Velocity team has been using CodeWhisperer in more detail.

Onboarding developers on new projects

CodeWhisperer helps developers unfamiliar with AWS to ramp up faster on projects that use AWS services. New developers in Accenture were able to write code for AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. In a short amount of time, they were able to be productive and contribute to the project. CodeWhisperer assisted developers by providing code blocks or line-by-line suggestions. It is also context-aware. Changing the instructions (comments) to be more specific results in CodeWhisperer generating more relevant code.

Writing boilerplate code

Developers were able to use CodeWhisperer to complete prerequisites. They were able to create a preprocessing data class just by typing “class to create preprocessing script for ML data.” Writing the preprocessing script took only a couple of minutes, and CodeWhisperer was able to generate entire code blocks.

Helping developers code in unfamiliar languages

A Java user new to the team was able to easily start writing Python code with the help of CodeWhisperer without worrying about the syntax.

Detecting security vulnerabilities in the code

Developers were able to detect security issues by choosing Run security scan in their IDE. Detailed insights on the security issues found are provided directly in the IDE. This helps developers detect and fix issues early.

As a developer, using CodeWhisperer enables you to write code more quickly,” says Nino Leenus, AI Engineering Consultant at Accenture. “In addition, CodeWhisperer will help you code more accurately by eliminating typos and other typical errors with the aid of artificial intelligence. For a developer, writing the same code multiple times is tedious. By recommending the subsequent code pieces that you may need, AI code completion technologies reduce such repetitious coding.”

Conclusion

This post introduces CodeWhisperer, an AI coding companion by Amazon. The tool uses ML models trained on large datasets to provide suggestions and autocompletion for code, as well as generate entire functions and classes based on natural language descriptions. This post also highlights some of the benefits seen by Accenture when using CodeWhisperer, such as increased productivity and the ability to reduce the time and effort required for common coding tasks. You can activate CodeWhisperer in your favorite IDE today. CodeWhisperer automatically generates suggestions based on your existing code and comments. Visit Amazon CodeWhisperer to get started.


About the Authors

Balakrishnan Viswanathan is an AI/ML Solution Architect at Accenture. Collaborating with AABG, he devises and executes cutting-edge cloud-based strategies to tackle various AI/ML related challenges. Bala’s interests lie in both cooking and Photoshop, which he is passionate about.

Shikhar Kwatra is an AI/ML specialist solutions architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Ankur Desai is a Principal Product Manager within the AWS AI Services team.

Nino Leenus is an AI Consultant at Accenture. She is expertise on developing End-to-End Machine learning solutions and its deployment using cloud. She is curious about latest tools and technologies in ML-Ops field. She loves traveling and trekking.

Read More

Deploy a predictive maintenance solution for airport baggage handling systems with Amazon Lookout for Equipment

Deploy a predictive maintenance solution for airport baggage handling systems with Amazon Lookout for Equipment

This is a guest post co-written with Moulham Zahabi from Matarat.

Probably everyone has checked their baggage when flying, and waited anxiously for their bags to appear at the carousel. Successful and timely delivery of your bags depends on a massive infrastructure called the baggage handling system (BHS). This infrastructure is one of the key functions of successful airport operations. Successfully handling baggage and cargo for departing and arriving flights is critical to ensure customer satisfaction and deliver airport operational excellence. This function is heavily dependent on the continuous operation of the BHS and the effectiveness of maintenance operations. As the lifeline of the airports, a BHS is a linear asset that can exceed 34,000 meters in length (for a single airport) handling over 70 million bags annually, making it one of the most complex automated systems and a vital component of airport operations.

Unplanned downtime of a baggage handling system, whether be it a conveyor belt, carousel, or sorter unit, can disrupt airport operations. Such disruption is bound to create an unpleasant passenger experience and possibly impose penalties on airport service providers.

The prevalent challenge with maintaining a baggage handling system is how to operate an integrated system of over 7,000 assets and over a million setpoints continuously. These systems also handle millions of bags in different shapes and sizes. It’s safe to assume that baggage handling systems are prone to error. Because the elements function in a closed loop, if one element breaks down, it affects the entire line. Traditional maintenance activities rely on a sizable workforce distributed across key locations along the BHS dispatched by operators in the event of an operational fault. Maintenance teams also rely heavily on supplier recommendations to schedule downtime for preventive maintenance. Determining if preventive maintenance activities are properly implemented or monitoring the performance of this type of asset may be unreliable and doesn’t reduce the risk of unanticipated downtime.

Spare parts management is an additional challenge as lead times are increasing due to global supply chain disruptions, yet inventory replenishment decisions are based on historical trends. In addition, these trends don’t incorporate the volatile dynamic environment of operating BHS assets as they age. To address these challenges, a seismic shift needs to happen in maintenance strategies—moving from a reactive to proactive mindset. This shift requires operators to utilize the latest technology to streamline maintenance activities, optimize operations, and minimize operating expenses.

In this post, we describe how AWS Partner Airis Solutions used Amazon Lookout for Equipment, AWS Internet of Things (IoT) services, and CloudRail sensor technologies to provide a state-of-the-art solution to address these challenges.

Baggage handling system overview

The following diagram and table illustrate the measurements taken across a typical carousel in King Khalid International Airport in Riyadh.

Carousel diagram overview

Data is collected at the different locations illustrated in the diagram.

Sensor Type Business value Datasets Location
IO Link Speed Sensors Homogeneous Carousel Speed PDV1 (1 per min) C

Vibration Sensor with Integrated

Temperature Sensor

Loose Screw,

Shaft Misaligned,

Bearing Damage,

Motor Winding Damage

Fatigue (v-RMS) (m/s)

Impact (a-Peak) (m/s^2)

Friction (a-RMS) (m/s^2)

Temperature (C)

Crest

A and B
Distance PEC Sensor Baggage Throughput Distance (cm) D

The following images show the environment and monitoring equipment for the various measurements.

Vibration sensor mounted on one of the conveyor motors

Vibration sensor mounted on one of the conveyor motors

Proximity probe measuring carousel speed

Line of sight of the baggage throughput counter (using a distance sensor)

Thermal image of one of the conveyor motors

Thermal image of one of the conveyor motors

Solution overview

The predictive maintenance system (PdMS) for baggage handling systems is a reference architecture that aids airport maintenance operators in their journey to have data as an enabler in improving unplanned downtime. It contains building blocks to accelerate the development and deployment of connected sensors and services. The PdMS includes AWS services to securely manage the lifecycle of edge compute devices and BHS assets, cloud data ingestion, storage, machine learning (ML) inference models, and business logic to power proactive equipment maintenance in the cloud.

This architecture was built from lessons learned while working with airport operations over several years. The proposed solution was developed with the support of Northbay Solutions, an AWS Premier Partner, and can be deployed to airports of all sizes and scales to thousands of connected devices within 90 days.

The following architecture diagram exposes the underlying components used to build the predictive maintenance solution:

Solution architecture overview

We use the following services to assemble our architecture:

  • CloudRail.DMC is a software as a service (SaaS) solution by the German IoT expert CloudRail GmbH. This organization manages fleets of globally distributed edge gateways. With this service, industrial sensors, smart meters, and OPC UA servers can be connected to an AWS data lake with just a few clicks.
  • AWS IoT Core lets you connect billions of IoT devices and route trillions of messages to AWS services without managing infrastructure. It securely transmits messages to and from all of your IoT devices and applications with low latency and high throughput. We use AWS IoT Core to connect to the CloudRail sensors and forward their measurements to the AWS Cloud.
  • AWS IoT Analytics is a fully managed service that makes it easy to run and operationalize sophisticated analytics on massive volumes of IoT data without having to worry about the cost and complexity typically required to build an IoT analytics platform. It’s an easy way to run analytics on IoT data to gain accurate insights.
  • Amazon Lookout for Equipment analyzes data from equipment sensors to create an ML model automatically for your equipment based on asset specific data—no data science skills necessary. Lookout for Equipment analyzes incoming sensor data in real time and accurately identifies early warning signals that could lead to unexpected downtime.
  • Amazon QuickSight allows everyone in the organization to understand the data by asking questions in natural language, visualizing information through interactive dashboards, and automatically looking for patterns and outliers powered by ML.

As illustrated in the following diagram, this architecture enables sensor data to flow to operational insights.

Sensor data flow

Data points are collected using IO-Link sensors: IO-Link is a standardized interface to enable seamless communication from the control level of an industrial asset (in our case, the baggage handling system) to the sensor level. This protocol is used to feed sensor data into a CloudRail edge gateway and loaded into AWS IoT Core. The latter then provides equipment data to ML models to identify operational and equipment issues that can be used to determine optimal timing for asset maintenance or replacement without incurring unnecessary costs.

Data collection

Retrofitting existing assets and their controls systems to the cloud remains a challenging approach for operators of equipment. Adding secondary sensors provides a fast and secure way to acquire the necessary data while not interfering with existing systems. Therefore, it’s easier, faster, and non-invasive compared to the direct connection of a machine’s PLCs. Additionally, retrofitted sensors can be selected to precisely measure the data points required for specific failure modes.

With CloudRail, every industrial IO-Link sensor can be connected to AWS services like AWS IoT Core, AWS IoT SiteWise, or AWS IoT Greengrass within a few seconds through a cloud-based device management portal (CloudRail.DMC). This enables IoT experts to work from centralized locations and onboard physical systems that are globally distributed. The solution solves the challenges of data connectivity for predictive maintenance systems through an easy plug-and-play mechanism.

The gateway acts as the Industrial Demilitarized Zone (IDMZ) between the equipment (OT) and the cloud service (IT). Through an integrated fleet management application, CloudRail ensures that the latest security patches are rolled out automatically to thousands of installations.

The following image shows an IO-Link sensor and the CloudRail edge gateway (in orange):

IO Link sensor and edge gateway

Training an anomaly detection model

Organizations from most industrial segments see modern maintenance strategies moving away from the run-to-failure, reactive approaches and progressing towards more predictive methods. However, moving to a condition-based or predictive maintenance approach requires data collected from sensors installed throughout facilities. Using historical data captured by these sensors in conjunction with analytics helps identify precursors to equipment failures, which allows maintenance personnel to act accordingly before breakdown.

Predictive maintenance systems rely on the capability to identify when failures could occur. Equipment OEMs usually provide datasheets for their equipment and recommend monitoring certain operational metrics based on near-perfect conditions. However, these conditions are rarely realistic because of the natural wear of the asset, the environmental conditions it operates in, its past maintenance history, or just the way you need to operate it to achieve your business outcomes. For instance, two identical motors (make, model, production date) were installed in the same carousel for this proof of concept. These motors operated at different temperature ranges due to different weather exposure (one part of the conveyor belt on the inside and the other outside of the airport terminal).

Motor 1 operated in a temperature ranging from 32–35°C. Vibration velocity RMS can change due to motor fatigue (for example, alignment errors or imbalance problems). As shown in the following figure, this motor shows fatigue levels ranging between 2–6, with some peaks at 9.

Motor 1

Motor 2 operated in a cooler environment, where the temperature was ranging between 20–25°C. In this context, motor 2 shows fatigue levels between 4–8, with some peaks at 10:

Motor 2

Most ML approaches expect very specific domain knowledge and information (often difficult to obtain) that must be extracted from the way you operate and maintain each asset (for example, failure degradation patterns). This work needs to be performed each time you want to monitor a new asset, or if the asset conditions change significantly (such as when you replace a part). This means that a great model delivered at the prototyping phase will likely see a performance hit when rolled out on other assets, drastically reducing the accuracy of the system and in the end, losing the end-users’ confidence. This may also cause many false positives, and you would need the skills necessary to find your valid signals in all the noise.

Lookout for Equipment only analyzes your time series data to learn the normal relationships between your signals. Then, when these relationships start to deviate from the normal operating conditions (captured at training state), the service will flag the anomaly. We found that strictly using historical data for each asset lets you focus on technologies that can learn the operating conditions that will be unique to a given asset in the very environment it’s operating in. This lets you deliver predictions supporting root cause analysis and predictive maintenance practices at a granular, per-asset level and macro level (by assembling the appropriate dashboard to let you get an overview of multiple assets at once). This is the approach we took and the reason we decided to use Lookout for Equipment.

Training strategy: Addressing the cold start challenge

The BHS we targeted wasn’t instrumented at first. We installed CloudRail sensors to start collecting new measurements from our system, but this meant we only had a limited historical depth to train our ML model. We addressed the cold start challenge in this case by recognizing that we are building a continuously improving system. After the sensors were installed, we collected an hour of data and duplicated this information to start using Lookout for Equipment as soon as possible and test our overall pipeline.

As expected, the first results were quite unstable because the ML model was exposed to a very small period of operations. This meant that any new behavior not seen during the first hour would be flagged. When looking at the top-ranking sensors, the temperature on one of the motors seemed to be the main suspect (T2_MUC_ES_MTRL_TMP in orange in the following figure). Because the initial data capture was very narrow (1 hour), over the course of the day, the main change was coming from the temperature values (which is consistent with the environmental conditions at that time).

Anomaly alerts

When matching this with the environmental conditions around this specific conveyor belt, we confirmed that the outside temperature increased severely, which, in turn, increased the temperature measured by this sensor. In this case, after the new data (accounting for the outside temperature increase) is incorporated into the training dataset, it will be part of the normal behavior as captured by Lookout for Equipment and similar behavior in the future will be less likely to raise any events.

After 5 days, the model was retrained and the false positive rates immediately fell drastically:

Retrained model

Although this cold start problem was an initial challenge to obtain actionable insights, we used this opportunity to build a retraining mechanism the end-user can trigger easily. A month into the experimentation, we trained a new model by duplicating a month’s worth of sensor data into 3 months. This continued to reduce the false positive rates as the model was exposed to a broader set of conditions. A similar false positive rate drop happened after this retraining: the condition modeled by the system was closer to what users are experiencing in real life. After 3 months, we finally had a dataset that we could use without using this duplication trick.

From now on, we will launch a retraining every 3 months and, as soon as possible, will use up to 1 year of data to account for the environmental condition seasonality. When deploying this system on other assets, we will be able to reuse this automated process and use the initial training to validate our sensor data pipeline.

After the model was trained, we deployed the model and started sending live data to Lookout for Equipment. Lookout for Equipment lets you configure a scheduler that wakes up regularly (for example, every hour) to send fresh data to the trained model and collect the results.

Now that we know how to train, improve, and deploy a model, let’s look at the operational dashboards implemented for the end-users.

Data visualization and insights

End-users need a way to extract more value from their operational data to better improve their asset utilization. With QuickSight, we connected the dashboard to the raw measurement data provided by our IoT system, allowing users to compare and contrast key pieces of equipment on a given BHS.

In the following dashboard, users can check the key sensors used to monitor the condition of the BHS and obtain period-over-period metrics changes.

BHS condition

In the preceding plot, users can visualize any unexpected imbalance of the measurement for each motor (left and right plots for temperature, fatigue, vibration, friction, and impact). At the bottom, key performance indicators are summarized, with forecast and period-over-period trends called out.

End-users can access information for the following purposes:

  • View historical data in intervals of 2 hours up to 24 hours.
  • Extract raw data via CSV format for external integration.
  • Visualize asset performance over a set period of time.
  • Produce insights for operational planning and improve asset utilization.
  • Perform correlation analysis. In the following plot, the user can visualize several measurements (such as motor fatigue vs. temperature, or baggage throughput vs. carousel speed) and use this dashboard to better inform the next best maintenance action.

Motors fatigue vs. temperature

Baggage throughput

Eliminating noise from the data

After a few weeks, we noticed that Lookout for Equipment was emitting some events thought to be false positives.

Speed of carousel belt

When analyzing these events, we discovered irregular drops in the speed of the carousel motor.

Speed of the carousel motor

We met with the maintenance team and they informed us these stops were either emergency stops or planned downtime maintenance activities. With this information, we labeled the emergency stops as anomalies and fed them to Lookout for Equipment, while the planned downtimes were considered normal behavior for this carousel.

Understanding such scenarios where abnormal data can be influenced by controlled external actions is critical to improve the anomaly detection model accuracy over time.

Smoke testing

After a few hours from retraining the model and achieving relatively no anomalies, our team physically stressed the assets, which was immediately detected by the system. This is a common request from users because they need to familiarize themselves with the system and how it reacts.

Smoke testing - Anomaly alerts

We built our dashboard to allow end-users to visualize historical anomalies with an unlimited period. Using a business intelligence service let them organize their data at will, and we have found that bar charts over a 24-hour period or pie charts are the best way to get a good view of the condition of the BHS. In addition to the dashboards that users can view whenever they need, we set up automated alerts sent to a designated email address and via text message.

Extracting deeper insights from anomaly detection models

In the future, we intend to extract deeper insights from the anomaly detection models trained with Lookout for Equipment. We will continue to use QuickSight to build an expanded set of widgets. For instance, we have found that the data visualization widgets exposed in the GitHub samples for Lookout for Equipment allow us to extract even more insights from the raw outputs of our models.

Widget examples

Results

Reactive maintenance in baggage handling systems translates to the following:

  • Lower passenger satisfaction due to lengthy wait times or damaged baggage
  • Lower asset availability due to the unplanned failures and inventory shortage of critical spare parts
  • Higher operating expenses due to rising inventory levels in addition to higher maintenance costs

Evolving your maintenance strategy to incorporate reliable, predictive analytics into the cycle of decision-making aims to improve asset operation and help avoid forced shutdowns.

From reactive to proactive

The monitoring equipment was installed locally in 1 day and configured completely remotely by IoT experts. The cloud architecture described in the solution overview was then successfully deployed within 90 days. A fast implementation time proves the benefits proposed to the end-user, quickly leading to a shift in maintenance strategy from human-based reactive (fixing breakdowns) to machine-based, data-driven proactive (preventing downtimes).

Conclusion

The cooperation between Airis, CloudRail, Northbay Solutions, and AWS led to new achievement at the King Khalid International Airport (see the press release for more details). As part of their digital transformation strategy, the Riyadh Airport plans on further deployments to cover other electro-mechanical systems like passenger boarding bridges and HVAC systems.

If you have comments about this post, please submit them in the comments section. If you have questions about this solution or its implementation, please start a new thread on re:Post, where AWS experts and the broader community can support you.


About the authors

Moulham Zahabi is an aviation specialist with over 11 years of experience in designing and managing aviation projects, and managing critical airport assets in the GCC region. He is also one of the co-founders of Airis-Solutions.ai, which aims to lead the aviation industry’s digital transformation through innovative AI/ML solutions for airports and logistical centers. Today, Moulham is heading the Asset Management Directorate in the Saudi Civil Aviation Holding Company (Matarat).

Fauzan Khan is a Senior Solutions Architect working with public sector customers, providing guidance to design, deploy, and manage their AWS workloads and architectures. Fauzan is passionate about helping customers adopt innovative cloud technologies in the area of HPC and AI/ML to address business challenges. Outside of work, Fauzan enjoys spending time in nature.

Michaël Hoarau is an AI/ML Specialist Solutions Architect at AWS who alternates between data scientist and machine learning architect, depending on the moment. He is passionate about bringing the AI/ML power to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. He published a book on time series analysis in 2022 and regularly writes about this topic on LinkedIn and Medium. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.

Read More