Amazon AWS – Page 226

The science behind ultrasonic motion sensing for Echo

May 4, 2022

by Amazon AWS

Reducing false positives for rare events, adapting Echo hardware to ultrasound sensing, and enabling concurrent ultrasound sensing and music playback are just a few challenges Amazon researchers addressed.Read More

Ankan Bansal’s long journey into the world of computer vision

May 3, 2022

by Amazon AWS

How a math-loving student travelled 7,000 miles to pursue a passion and wound up becoming an applied scientist.Read More

Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker

May 2, 2022

by Vikram Elango Amazon AWS

Machine learning (ML) applications are complex to deploy and often require multiple ML models to serve a single inference request. A typical request may flow across multiple models with steps like preprocessing, data transformations, model selection logic, model aggregation, and postprocessing. This has led to the evolution of common design patterns such as serial inference pipelines, ensembles (scatter gather), and business logic workflows, resulting in realizing the entire workflow of the request as a Directed Acyclic Graph (DAG). However, as workflows get more complex, this leads to an increase in overall response times, or latency, of these applications which in turn impacts the overall user experience. Furthermore, if these components are hosted on different instances, the additional network latency between these instances increases the overall latency. Consider an example of a popular ML use case for a virtual assistant in customer support. A typical request might have to go through several steps involving speech recognition, natural language processing (NLP), dialog state tracking, dialog policy, text generation, and finally text to speech. Furthermore, to make the user interaction more personalized, you might also use state-of-art, transformer-based NLP models like different versions of BERT, BART, and GPT. The end result is long response times for these model ensembles and a poor customer experience.

A common pattern to drive lower response times without compromising overall throughput is to host these models on the same instance along with the lightweight business logic embedded in it. These models can further be encapsulated within single or multiple containers on the same instance in order to provide isolation for running processes and keep latency low. Additionally, overall latency also depends on inference application logic, model optimizations, underlying infrastructure (including compute, storage, and networking), and the underlying web server taking inference requests. NVIDIA Triton Inference Server is an open-source inference serving software with features to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. It has wide support of ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including GPUs, CPUs, and AWS Inferentia. Additionally, Triton Inference Server is integrated with Amazon SageMaker, a fully managed end-to-end ML service, providing real-time inference options including single and multi-model hosting. These inference options include hosting multiple models within the same container behind a single endpoint, and hosting multiple models with multiple containers behind a single endpoint.

In November 2021, we announced the integration of Triton Inference Server on SageMaker. AWS worked closely with NVIDIA to enable you to get the best of both worlds and make model deployment with Triton on AWS easier.

In this post, we look at best practices for deploying transformer models at scale on GPUs using Triton Inference Server on SageMaker. First, we start with a summary of key concepts around latency in SageMaker, and an overview of performance tuning guidelines. Next, we provide an overview of Triton and its features as well as example code for deploying on SageMaker. Finally, we perform load tests using SageMaker Inference Recommender and summarize the insights and conclusions from load testing of a popular transformer model provided by Hugging Face.

You can review the notebook we used to deploy models and perform load tests on your own using the code on GitHub.

Performance tuning and optimization for model serving on SageMaker

Performance tuning and optimization is an empirical process often involving multiple iterations. The number of parameters to tune is combinatorial and the set of configuration parameter values aren’t independent of each other. Various factors affect optimal parameter tuning, including payload size, type, and the number of ML models in the inference request flow graph, storage type, compute instance type, network infrastructure, application code, inference serving software runtime and configuration, and more.

If you’re using SageMaker for deploying ML models, you have to select a compute instance with the best price-performance, which is a complicated and iterative process that can take weeks of experimentation. First, you need to choose the right ML instance type out of over 70 options based on the resource requirements of your models and the size of the input data. Next, you need to optimize the model for the selected instance type. Lastly, you need to provision and manage infrastructure to run load tests and tune cloud configuration for optimal performance and cost. All this can delay model deployment and time to market. Additionally, you need to evaluate the trade-offs between latency, throughput, and cost to select the optimal deployment configuration. SageMaker Inference Recommender automatically selects the right compute instance type, instance count, container parameters, and model optimizations for inference to maximize throughput, reduce latency, and minimize cost.

Real-time inference and latency in SageMaker

SageMaker real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. There are four most commonly used metrics for monitoring inference request latency for SageMaker inference endpoints

Container latency – The time it takes to send the request, fetch the response from the model’s container, and complete inference in the container. This metric is available in Amazon CloudWatch as part of the Invocation Metrics published by SageMaker.
Model latency – The total time taken by all SageMaker containers in an inference pipeline. This metric is available in Amazon CloudWatch as part of the Invocation Metrics published by SageMaker.
Overhead latency – Measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This metric is available in Amazon CloudWatch as part of the Invocation Metrics published by SageMaker.
End-to-end latency – Measured from the time the client sends the inference request until it receives a response back. Customers can publish this as a custom metric in Amazon CloudWatch.

The following diagram illustrates these components.

Container latency depends on several factors; the following are among the most important:

Underlying protocol (HTTP(s)/gRPC) used to communicate with the inference server
Overhead related to creating new TLS connections
Deserialization time of the request/response payload
Request queuing and batching features provided by the underlying inference server
Request scheduling capabilities provided by the underlying inference server
Underlying runtime performance of the inference server
Performance of preprocessing and postprocessing libraries before calling the model prediction function
Underlying ML framework backend performance
Model-specific and hardware-specific optimizations

In this post, we focus primarily on optimizing container latency along with overall throughput and cost. Specifically, we explore performance tuning Triton Inference Server running inside a SageMaker container.

Use case overview

Deploying and scaling NLP models in a production setup can be quite challenging. NLP models are often very large in size, containing millions of model parameters. Optimal model configurations are required to satisfy the stringent performance and scalability requirements of production-grade NLP applications.

In this post, we benchmark an NLP use case using a SageMaker real-time endpoint based on a Triton Inference Server container and recommend performance tuning optimizations for our ML use case. We use a large, pre-trained transformer-based Hugging Face BERT large uncased model, which has about 336 million model parameters. The input sentence used for the binary classification model is padded and truncated to a maximum input sequence length of 512 tokens. The inference load test simulates 500 invocations per second (30,000 maximum invocations per minute) and ModelLatency of less than 0.5 seconds (500 milliseconds).

The following table summarizes our benchmark configuration.

Model Name	Hugging Face `bert-large-uncased`
Model Size	1.25 GB
Latency Requirement	0.5 seconds (500 milliseconds)
Invocations per Second	500 requests (30,000 per minute)
Input Sequence Length	512 tokens
ML Task	Binary classification

NVIDIA Triton Inference Server

Triton Inference Server is specifically designed to enable scalable, rapid, and easy deployment of models in production. Triton supports a variety of major AI frameworks, including TensorFlow, TensorRT, PyTorch, XGBoost and ONNX. With the Python and C++ custom backend, you can also implement your inference workload for more customized use cases.

Most importantly, Triton provides a simple configuration-based setup to host your models, which exposes a rich set of performance optimization features you can use with little coding effort.

Triton increases inference performance by maximizing hardware utilization with different optimization techniques (concurrent model runs and dynamic batching are the most frequently used). Finding the optimal model configurations from various combinations of dynamic batch sizes and the number of concurrent model instances is key to achieving real time inference within low-cost serving using Triton.

Dynamic batching

Many practitioners tend to run inference sequentially when the server is invoked with multiple independent requests. Although easier to set up, it’s usually not the best practice to utilize GPU’s compute power. To address this, Triton offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput. The following diagram illustrates the Triton runtime architecture.

In the preceding architecture, all the requests reach the dynamic batcher first before entering the actual model scheduler queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the preferred_batch_size settings in the model configuration. (Note that the formed batch size needs to be less than the max_batch_size the model supports.) You can also configure max_queue_delay_microseconds to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements.

The following code snippet shows how you can add this feature with model configuration files to set dynamic batching with a preferred batch size of 16 for the actual inference. With the current settings, the model instance is invoked instantly when the preferred batch size of 16 is met or the delay time of 100 microseconds has elapsed since the first request reached the dynamic batcher.

dynamic_batching {
        preferred_batch_size: 16
        max_queue_delay_microseconds: 100
    }

Running models concurrently

Another essential optimization offered in Triton to maximize hardware utilization without additional latency overhead is concurrent model execution, which allows multiple models or multiple copies of the same model to run in parallel. This feature enables Triton to handle multiple inference requests simultaneously, which increases the inference throughput by utilizing otherwise idle compute power on the hardware.

The following figure showcases how you can easily configure different model deployment policies with only a few lines of code changes. For example, configuration A (left) shows that you can broadcast the same configuration of two model instances of bert-large-uncased to all available GPUs. In contrast, configuration B (middle) shows a different configuration for GPU 0 only, without changing the policies on the other GPUs. You can also deploy instances of different models on a single GPU, as shown in configuration C (right).

In configuration C, the compute instance can handle two concurrent requests for the DistilGPT-2 model and seven concurrent requests for the bert-large-uncased model in parallel. With these optimizations, the hardware resources can be better utilized for the serving process, thereby improving the throughput and providing better cost-efficiency for your workload.

TensorRT

NVIDIA TensorRT is an SDK for high-performance deep learning inference that works seamlessly with Triton. TensorRT, which supports every major deep learning framework, includes an inference optimizer and runtime that delivers low latency and high throughput to run inferences with massive volumes of data via powerful optimizations.

TensorRT optimizes the graph to minimize memory footprint by freeing unnecessary memory and efficiently reusing it. Additionally, TensorRT compilation fuses the sparse operations inside the model graph to form a larger kernel to avoid the overhead of multiple small kernel launches. Kernel auto-tuning helps you fully utilize the hardware by selecting the best algorithm on your target GPU. CUDA streams enable models to run in parallel to maximize your GPU utilization for best performance. Last but not least, the quantization technique can fully use the mixed-precision acceleration of the Tensor cores to run the model in FP32, TF32, FP16, and INT8 to achieve the best inference performance.

Triton on SageMaker hosting

SageMaker hosting services are the set of SageMaker features aimed at making model deployment and serving easier. It provides a variety of options to easily deploy, auto scale, monitor, and optimize ML models tailored for different use cases. This means that you can optimize your deployments for all types of usage patterns, from persistent and always available with serverless options, to transient, long-running, or batch inference needs.

Under the SageMaker hosting umbrella is also the set of SageMaker inference Deep Learning Containers (DLCs), which come prepackaged with the appropriate model server software for their corresponding supported ML framework. This enables you to achieve high inference performance with no model server setup, which is often the most complex technical aspect of model deployment and in general, isn’t part of a data scientist’s skill set. Triton inference server is now available on SageMaker Deep Learning Containers (DLC).

This breadth of options, modularity, and ease of use of different serving frameworks makes SageMaker and Triton a powerful match.

SageMaker Inference Recommender for benchmarking test results

We use SageMaker Inference Recommender to run our experiments. SageMaker Inference Recommender offers two types of jobs: default and advanced, as illustrated in the following diagram.

The default job provides recommendations on instance types with just the model and a sample payload to benchmark. In addition to instance recommendations, the service also offers runtime parameters that improve performance. The default job’s recommendations are intended to narrow down the instance search. In some cases, it could be the instance family, and in others, it could be the specific instance types. The results of the default job are then fed into the advanced job.

The advanced job offers more controls to further fine-tune performance. These controls simulate the real environment and production requirements. Among these controls is the traffic pattern, which aims to stage the request pattern for the benchmarks. You can set ramps or steady traffic by using the traffic pattern’s multiple phases. For example, an InitialNumberOfUsers of 1, SpawnRate of 1, and DurationInSeconds of 600 may result in ramp traffic of 10 minutes with 1 concurrent user at the beginning and 10 at the end. Additionally, on the controls, MaxInvocations and ModelLatencyThresholds set the threshold of production, so when one of the thresholds is exceeded, the benchmarking stops.

Finally, recommendation metrics include throughput, latency at maximum throughput, and cost per inference, so it’s easy to compare them.

We use the advanced job type of SageMaker Inference Recommender to run our experiments to gain additional control over the traffic patterns, and fine-tune the configuration of the serving container.

Experiment setup

We use the custom load test feature of SageMaker Inference Recommender to benchmark the NLP profile outlined in our use case. We first define the following prerequisites related to the NLP model and ML task. SageMaker Inference Recommender uses this information to pull an inference Docker image from Amazon Elastic Container Registry (Amazon ECR) and register the model with the SageMaker model registry.

Domain	`NATURAL_LANGUAGE_PROCESSING`
Task	`FILL_MASK`
Framework	PYTORCH: 1.6.0
Model	`bert-large-uncased`

The traffic pattern configurations in SageMaker Inference Recommender allow us to define different phases for the custom load test. The load test starts with two initial users and spawns two new users every minute, for a total duration of 25 minutes (1500 seconds), as shown in the following code:

"TrafficPattern": {
    "TrafficType": "PHASES",
    "Phases": [
        {
            "InitialNumberOfUsers": 2,
            "SpawnRate": 2,
            "DurationInSeconds": 1500
        }, 
    ],
}

We experiment with load testing the same model in two different states. The PyTorch-based experiments use the standard, unaltered PyTorch model. For the TensorRT-based experiments, we convert the PyTorch model into a TensorRT engine beforehand.

We apply different combinations of the performance optimization features on these two models, summarized in the following table.

Configuration Name	Configuration Description	Model Configuration
`pt-base`	PyTorch baseline	Base PyTorch model, no changes
`pt-db`	PyTorch with dynamic batching	`dynamic_batching` `{}`
`pt-ig`	PyTorch with multiple model instances	`instance_group [` `{` `count: 2` `kind: KIND_GPU` `}` `]`
`pt-ig-db`	PyTorch with multiple model instances and dynamic batching	`dynamic_batching` `{},` `instance_group [` `{` `count: 2` `kind: KIND_GPU` `}` `]`
`trt-base`	TensorRT baseline	PyTorch model compiled with TensoRT `trtexec` utility
`trt-db`	TensorRT with dynamic batching	`dynamic_batching` `{}`
`trt-ig`	TensorRT with multiple model instances	`instance_group [` `{` `count: 2` `kind: KIND_GPU` `}` `]`
`trt-ig-db`	TensorRT with multiple model instances and dynamic batching	`dynamic_batching` `{},` `instance_group [` `{` `count: 2` `kind: KIND_GPU` `}` `]`

Test results and observations

We conducted load tests for three instance types within the same g4dn family: ml.g4dn.xlarge, ml.g4dn.2xlarge and ml.g4dn.12xlarge. All g4dn instance types have access to NVIDIA T4 Tensor Core GPUs, and 2nd Generation Intel Cascade Lake processors. The logic behind the choice of instance types was to have both an instance with only one GPU available, as well as an instance with access to multiple GPUs—four in the case of ml.g4dn.12xlarge. Additionally, we wanted to test if increasing the vCPU capacity on the instance with only one available GPU would yield a cost-performance ratio improvement.

Let’s go over the speedup of the individual optimization first. The following graph shows that TensorRT optimization provides a 50% reduction in model latency compared to the native one in PyTorch on the ml.g4dn.xlarge instance. This latency reduction grows to over three times on the multi-GPU instances of ml.g4dn.12xlarge. Meanwhile, the 30% throughput improvement is consistent on both instances, resulting in better cost-effectiveness after applying TensorRT optimizations.

With dynamic batching, we can get close to 2x improvement in throughput using the same hardware architecture on all experiments instance of ml.g4dn.xlarge, ml.g4dn.2xlarge and ml.g4dn.12xlarge without noticeable latency increase.

Similarly, concurrent model execution enable us to obtain about 3-4x improvement in throughput by maximizing the GPU utilization on ml.g4dn.xlarge instance and about 2x improvement on both the ml.g4dn.2xlarge instance and the multi-GPU instance of ml.g4dn.12xlarge.. This throughput increase comes without any overhead in the latency.

Better still, we can integrate all these optimizations to provide the best performance by utilizing the hardware resources to the fullest. The following table and graphs summarize the results we obtained in our experiments.

Configuration Name	Model optimization	Dynamic Batching	Instance group config	Instance type	vCPUs	GPUs	GPU Memory (GB)	Initial Instance Count[1]	Invocations per min per Instance	Model Latency	Cost per Hour[2]
pt-base	NA	No	NA	ml.g4dn.xlarge	4	1	16	62	490	1500	45.6568
pt-db	NA	Yes	NA	ml.g4dn.xlarge	4	1	16	57	529	1490	41.9748
pt-ig	NA	No	2	ml.g4dn.xlarge	4	1	16	34	906	868	25.0376
pt-ig-db	NA	Yes	2	ml.g4dn.xlarge	4	1	16	34	892	1158	25.0376
trt-base	TensorRT	No	NA	ml.g4dn.xlarge	4	1	16	47	643	742	34.6108
trt-db	TensorRT	Yes	NA	ml.g4dn.xlarge	4	1	16	28	1078	814	20.6192
trt-ig	TensorRT	No	2	ml.g4dn.xlarge	4	1	16	14	2202	1273	10.3096
trt-db-ig	TensorRT	Yes	2	ml.g4dn.xlarge	4	1	16	10	3192	783	7.364
pt-base	NA	No	NA	ml.g4dn.2xlarge	8	1	32	56	544	1500	52.64
pt-db	NA	Yes	NA	ml.g4dn.2xlarge	8	1	32	59	517	1500	55.46
pt-ig	NA	No	2	ml.g4dn.2xlarge	8	1	32	29	1054	960	27.26
pt-ig-db	NA	Yes	2	ml.g4dn.2xlarge	8	1	32	30	1017	992	28.2
trt-base	TensorRT	No	NA	ml.g4dn.2xlarge	8	1	32	42	718	1494	39.48
trt-db	TensorRT	Yes	NA	ml.g4dn.2xlarge	8	1	32	23	1335	499	21.62
trt-ig	TensorRT	No	2	ml.g4dn.2xlarge	8	1	32	23	1363	1017	21.62
trt-db-ig	TensorRT	Yes	2	ml.g4dn.2xlarge	8	1	32	22	1369	963	20.68
pt-base	NA	No	NA	ml.g4dn.12xlarge	48	4	192	15	2138	906	73.35
pt-db	NA	Yes	NA	ml.g4dn.12xlarge	48	4	192	15	2110	907	73.35
pt-ig	NA	No	2	ml.g4dn.12xlarge	48	4	192	8	3862	651	39.12
pt-ig-db	NA	Yes	2	ml.g4dn.12xlarge	48	4	192	8	3822	642	39.12
trt-base	TensorRT	No	NA	ml.g4dn.12xlarge	48	4	192	11	2892	279	53.79
trt-db	TensorRT	Yes	NA	ml.g4dn.12xlarge	48	4	192	6	5356	278	29.34
trt-ig	TensorRT	No	2	ml.g4dn.12xlarge	48	4	192	6	5210	328	29.34
trt-db-ig	TensorRT	Yes	2	ml.g4dn.12xlarge	48	4	192	6	5235	439	29.34

[1] Initial instance count in the above table is the recommended number of instances to use with an autoscaling policy to maintain the throughput and latency requirements for your workload.

[2] Cost per hour in the above table is calculated based on the Initial instance count and price for the instance type.

Results mostly validate the impact that was expected of different performance optimization features:

TensorRT compilation has the most reliable impact across all instance types. Transactions per minute per instance increased by 30–35%, with a consistent cost reduction of approximately 25% when compared to the TensorRT engine’s performance to the default PyTorch BERT (pt-base). The increased performance of the TensorRT engine is compounded upon and exploited by the other tested performance tuning features.
Loading two models on each GPU (instance group) almost strictly doubled all measured metrics. Invocations per minute per instance increased approximately 80–90%, yielding a cost reduction in the 50% range, almost as if we were using two GPUs. In fact, Amazon CloudWatch metrics for our experiments on g4dn.2xlarge (as an example) confirms that both CPU and GPU utilization double when we configure an instance group of two models.

Further performance and cost-optimization tips

The benchmark presented in this post just scratched the surface of the possible features and techniques that you can use with Triton to improve inference performance. These range from data preprocessing techniques, such as sending binary payloads to the model server or payloads with bigger batches, to native Triton features, such as the following:

Model warmup, which prevents initial, slow inference requests by completely initializing the model before the first inference request is received.
Response cache, which caches repeated requests.
Model ensembling, which enables you to create a pipeline of one or more models and the connection of input and output tensors between those models. This opens the possibility of adding preprocessing and postprocessing steps, or even inference with other models, to the processing flow for each request.

We expect to test and benchmark these techniques and features in a future post, so stay tuned!

Conclusion

In this post, we explored a few parameters that you can use to maximize the performance of your SageMaker real-time endpoint for serving PyTorch BERT models with Triton Inference Server. We used SageMaker Inference Recommender to perform the benchmarking tests to fine-tune these parameters. These parameters are in essence related to TensorRT-based model optimization, leading to almost 50% improvement in response times compared to the non-optimized version. Additionally, running models concurrently and using dynamic batching of Triton led to almost a 70% increase in throughput. Fine-tuning these parameters led to an overall reduction of inference cost as well.

The best way to derive the correct values is through experimentation. However, to start building empirical knowledge on performance tuning and optimization, you can observe the combinations of different Triton-related parameters and their effect on performance across ML models and SageMaker ML instances.

SageMaker provides the tools to remove the undifferentiated heavy lifting from each stage of the ML lifecycle, thereby facilitating the rapid experimentation and exploration needed to fully optimize your model deployments.

You can find the notebook used for load testing and deployment on GitHub. You can update Triton configurations and SageMaker Inference Recommender settings to best fit your use case to achieve cost-effective and best-performing inference workloads.

About the Authors

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He mostly focuses on NLP use-cases and helping customers optimize Deep Learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 9 years and has worked on various AWS services like EMR, EFA and RDS on Outposts. Currently, he is focused on improving the SageMaker Inference Experience. In his spare time, he enjoys hiking and running marathons.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Santosh Bhavani is a Senior Technical Product Manager with the Amazon SageMaker Elastic Inference team. He focuses on helping SageMaker customers accelerate model inference and deployment. In his spare time, he enjoys traveling, playing tennis, and drinking lots of Pu’er tea.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Build a corporate credit ratings classifier using graph machine learning in Amazon SageMaker JumpStart

May 2, 2022

by Sanjiv Das Amazon AWS

Today, we’re releasing a new solution for financial graph machine learning (ML) in Amazon SageMaker JumpStart. JumpStart helps you quickly get started with ML and provides a set of solutions for the most common use cases that can be trained and deployed with just a few clicks.

The new JumpStart solution (Graph-Based Credit Scoring) demonstrates how to construct a corporate network from SEC filings (long-form text data), combine this with financial ratios (tabular data), and use graph neural networks (GNNs) to build credit rating prediction models. In this post, we explain how you can use this fully customizable solution for credit scoring, so you can accelerate your graph ML journey. Graph ML is becoming a fruitful area for financial ML because it enables the use of network data in conjunction with traditional tabular datasets. For more information, see Amazon at WSDM: The future of graph neural networks.

Solution overview

You can improve credit scoring by exploiting data on business linkages, for which you may construct a graph, denoted as CorpNet (short for corporate network) in this solution. You can then apply graph ML classification using GNNs on this graph and a tabular feature set for the nodes, to see if you can build a better ML model by further exploiting the information in network relationships. Therefore, this solution offers a template for business models that exploit network data, such as using supply chain relationship graphs, social network graphs, and more.

The solution develops several new artifacts by constructing a corporate network and generating synthetic financial data, and combines both forms of data to create models using graph ML.

The solution shows how to construct a network of connected companies using the MD&A section from SEC 10-K/Q filings. Companies with similar forward-looking statements are likely to be connected for credit events. These connections are represented in a graph. For graph node features, the solution uses the variables in the Altman Z-score model and the industry category of each firm. These are provided in a synthetic dataset made available for demonstration purposes. The graph data and tabular data are used to fit a rating classifier using GNNs. For illustrative purposes, we compare the performance of models with and without the graph information.

Use the Graph-Based Credit Scoring solution

To start using JumpStart, see Getting started with Amazon SageMaker. The JumpStart card for the Graph-Based Credit Scoring solution is available through Amazon SageMaker Studio.

Choose the model card, then choose Launch to initiate the solution.

The solution generates a model for inference and an endpoint to use with a notebook.

Wait until they’re ready and the status shows as Complete.
Choose Open Notebook to open the first notebook, which is for training and endpoint deployment.

You can work through this notebook to learn how to use this solution and then modify it for other applications on your own data. The solution comes with synthetic data and uses a subset of it to exemplify the steps needed to train the model, deploy it to an endpoint, and then invoke the endpoint for inference. The notebook also contains code to deploy an endpoint of your own.

To open the second notebook (used for inference), choose Use Endpoint in Notebook next to the endpoint artifact.

In this notebook, you can see how to prepare the data to invoke the example endpoint to perform inference on a batch of examples.

The endpoint returns predicted ratings, which are used to assess model performance, as shown in the following screenshot of the last code block of the inference notebook.

You can use this solution as a template for a graph-enhanced credit rating model. You’re not restricted to the feature set in this example—you can change both the graph data and tabular data for your own use case. The extent of code changes required is minimal. We recommend working through our template example to understand the structure of the solution, and then modify it as needed.

This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production use. Although text from SEC filings is used, the financial data is synthetically and randomly generated and has no relation to any company’s true financials. Therefore, the synthetically generated ratings also don’t have any relation to any real company’s true rating.

Data used in the solution

The dataset has synthetic tabular data such as various accounting ratios (numerical) and industry codes (categorical). The dataset has 𝑁=3286 rows. Rating labels are also added. These are the node features to be used with graph ML.

The dataset also contains a corporate graph, which is undirected and unweighted. This solution allows you to adjust the structure of the graph by varying the way in which links are included. Each company in the tabular dataset is represented by a node in the corporate graph. The function construct_network_data() helps construct the graph, which comprises lists of source nodes and destination nodes.

Rating labels are used for classification using GNNs, which can be multi-category for all ratings or binary, divided between investment grade (AAA, AA, A, BBB) and non-investment grade (BB, B, CCC, CC, C, D). D here stands for defaulted.

The complete code to read in the data and run the solution is provided in the solution notebook. The following screenshot shows the structure of the synthetic tabular data.

The graph information is passed in to the Deep Graph Library and combined with the tabular data to undertake graph ML. If you bring your own graph, simply supply it as a set of source nodes and destination nodes.

Model training

For comparison, we first train a model only on tabular data using AutoGluon, mimicking the traditional approach to credit rating of companies. We then add in the graph data and use GNNs for training. Full details are provided in the notebook, and a brief overview is offered in this post. The notebook also offers a quick overview of graph ML with selected references.

Training the GNN is undertaken as follows. We use an adaptation of the GraphSAGE model implemented in the Deep Graph Library.

Read in graph data from Amazon Simple Storage Service (Amazon S3) and create the source and destination node lists for CorpNet.
Read in the graph node feature sets (train and test). Normalize the data as required.
Set tunable hyperparameters. Call the specialized graph ML container running PyTorch to fit the GNN without hyperparameter optimization (HPO).
Repeat graph ML with HPO.

To make implementation straightforward and stable, we run model training in a container using the following code (the setup code prior to this training code is in the solution notebook):

from sagemaker.pytorch import PyTorch
from time import strftime, gmtime

training_job_name = sagemaker_config["SolutionPrefix"] + "-gcn-training"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> 
    a job name started with {training_job_name} to monitor training job 
    status and details."
)

estimator = PyTorch(
    entry_point='train_dgl_pytorch_entry_point.py',
    source_dir='graph_convolutional_network',
    role=role, 
    instance_count=1, 
    instance_type='ml.g4dn.xlarge',
    framework_version="1.9.0",
    py_version='py38',
    hyperparameters=hyperparameters,
    output_path=output_location,
    code_location=output_location,
    sagemaker_session=sess,
    base_job_name=training_job_name,
)

estimator.fit({'train': input_location})

The current training process is undertaken in a transductive setting, where the features of the test dataset (not including the target column) are used to construct the graph and therefore the test nodes are included in the training process. At the end of training, the predictions on the test dataset are generated and saved in output_location in the S3 bucket.

Even though the training is transductive, the labels of the test dataset aren’t used for training, and our exercise is aimed at predicting these labels using node embeddings for the test dataset nodes. An important feature of GraphSAGE is that inductive learning on new observations that aren’t part of the graph is also possible, though not exploited in this solution.

Hyperparameter optimization

This solution is further extended by conducting HPO on the GNN. This is done within SageMaker. See the following code:

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

# Static hyperparameters we do not tune
hyperparameters = {
    "n-layers": 2,
    "aggregator-type": "pool",
    "target-column": target_column
}
# Dynamic hyperparameters to tune and their searching ranges. 
# For demonstration purpose, we skip the architecture search by skipping 
# tuning the hyperparameters such as 'skip_rnn_num_layers', 'rnn_num_layers', etc.
hyperparameter_ranges = {
    "n-hidden": CategoricalParameter([32, 64, 128, 256, 512, 1024]),
    'dropout': ContinuousParameter(0.0, 0.6),
    'weight-decay': ContinuousParameter(1e-5, 1e-2),
    'n-epochs': IntegerParameter(70, 120), #80, 160
    'lr': ContinuousParameter(0.002, 0.02),
}

We then set up the training objective, to maximize the F1 score in this case:

objective_metric_name = "Validation F1"
metric_definitions = [{"Name": "Validation F1", "Regex": "Validation F1 (\S+)"}]
objective_type = "Maximize"

Establish the chosen environment and training resources on SageMaker:

estimator_tuning = PyTorch(
    entry_point='train_dgl_pytorch_entry_point.py',
    source_dir='graph_convolutional_network',
    role=role, 
    instance_count=1, 
    instance_type='ml.g4dn.xlarge',
    framework_version="1.9.0",
    py_version='py38',
    hyperparameters=hyperparameters,
    output_path=output_location,
    code_location=output_location,
    sagemaker_session=sess,
    base_job_name=training_job_name,
)

Finally, run the training job with hyperparameter optimization:

import time

tuning_job_name = sagemaker_config["SolutionPrefix"] + "-gcn-hpo"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {tuning_job_name} to monitor HPO tuning status and details.n"
    f"Note. You will be unable to successfully run the following cells until the tuning job completes. This step may take around 2 hours."
)

tuner = HyperparameterTuner(
    estimator_tuning,  # using the estimator defined in previous section
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=30,
    max_parallel_jobs=10,
    objective_type=objective_type,
    base_tuning_job_name = tuning_job_name,
)

start_time = time.time()

tuner.fit({'train': input_location})

hpo_training_job_time_duration = time.time() - start_time

Results

The inclusion of network data and hyperparameter optimization yields improved results. The performance metrics in the following table demonstrate the benefit of adding in CorpNet to standard tabular datasets used for credit scoring.

The results for AutoGluon don’t use the graph, only the tabular data. When we add in the graph data and use HPO, we get a material gain in performance.

	F1 Score	ROC AUC	Accuracy	MCC	Balanced Accuracy	Precision	Recall
AutoGluon	0.72	0.74323	0.68037	0.35233	0.67323	0.68528	0.75843
GCN Without HPO	0.64	0.84498	0.69406	0.45619	0.71154	0.88177	0.50281
GCN With HPO	0.81	0.87116	0.78082	0.563	0.77081	0.75119	0.89045

(Note: MCC is the Matthews Correlation Coefficient; https://en.wikipedia.org/wiki/Phi_coefficient.)

Clean up

After you’re done using this notebook, delete the model artifacts and other resources to avoid incurring further charges. You need to manually delete resources that you may have created while running the notebook, such as S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.

Summary

In this post, we introduced a graph-based credit scoring solution in JumpStart to help you accelerate your graph ML journey. The notebook provides a pipeline that you can modify and exploit graphs with existing tabular models to obtain better performance.

To get started, you can find the Graph-Based Credit Scoring solution in JumpStart in SageMaker Studio.

About the Authors

Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University) and Computer Science (M.S. from UC Berkeley), and an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice President at Citibank. He works on multimodal machine learning in the area of financial applications.

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the areas of natural language processing, deep learning on tabular data, and robust analysis of non-parametric space-time clustering.

Soji Adeshina is an Applied Scientist at AWS, where he develops graph neural network-based models for machine learning on graphs tasks with applications to fraud and abuse, knowledge graphs, recommender systems, and life sciences. In his spare time, he enjoys reading and cooking.

Patrick Yang is a Software Development Engineer at Amazon SageMaker. He focuses on building machine learning tools and products for customers.

Increase your content reach with automated document-to-speech conversion using Amazon AI services

May 2, 2022

by Harry Pan Amazon AWS

Reading the printed word opens up a world of information, imagination, and creativity. However, scanned books and documents may be difficult for people with vision impairment and learning disabilities to consume. In addition, some people prefer to listen to text-based content versus reading it. A document-to-speech solution extends the reach of digital content by giving text content a voice. It has uses across different industry sectors, such as:

Entertainment– You can create your own audiobooks.
Education – Students can convert their lecture notes to speech and access them anywhere.
Patient care – Dosage instructions and precautions are typically in small fonts and hard to read. With this solution, you could take a picture, convert to speech, and listen to the instructions in order to avoid potential harm.

The document-to-speech solution converts scanned books or documents taken on a mobile phone or handheld device automatically to speech. This solution extends the capabilities of Amazon Polly. We extract text from scanned documents using Amazon Textract, and then convert the text to speech using Amazon Polly. Solution benefits include mobility and freedom for the user plus enhanced learning capabilities for early readers.

The idea originated from Harry Pan, one of the blog author’s favorite parent-child activities – reading books. “My son enjoys storybooks, but is too young to read on his own. I love reading to him, but sometimes I need to work or tend to household chores. This sparked an idea to build a document-to-speech solution that could read to him when I was busy”.

Overview of solution

The solution is an event-driven serverless architecture that uses Amazon AI services to convert scanned documents to speech. Amazon Textract and Amazon Polly belong to the topmost layer of the AWS machine learning (ML) stack. These services allow developers to easily add intelligence to any application without prior ML knowledge.

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data without any manual effort.

Amazon Polly is a text-to-speech service that turns text into lifelike speech, allowing you to create applications that talk and to build entirely new categories of speech-enabled products. Amazon Polly uses advanced deep learning technologies to synthesize speech that sounds like a human voice.

There are significant advantages of using Amazon AI services:

They take little effort; you can integrate these APIs into any application
They offer highly scalable and cost-effective solutions
Your organization can shift its focus from development of custom models to business outcomes

The solution also uses Amazon API Gateway to quickly stand up APIs that the web UI can invoke to perform operations like uploading documents and converting scanned documents to speech. API Gateway provides a scalable way to create, publish, and maintain secure APIs. In this solution, we also use API Gateway WebSocket support to establish a persistent connection between the web UI and the backend, so the backend can keep sending progress updates to user in real time.

We use AWS Lambda functions to trigger Amazon Textract and Amazon Polly asynchronous jobs. Lambda is a highly available and scalable compute service that lets you run code without provisioning resources.

We use an AWS Step Functions state machine to orchestrate two parallel Lambda functions – one to moderate text and the other to store text in Amazon Simple Storage Service (Amazon S3). Step Functions is a serverless orchestration service to define application workflows as a series of event-driven steps.

Architecture and code

As described in the previous section, we use two key AI services, Amazon Textract and Amazon Polly, to build a document-to-speech conversion solution. One additional service that we haven’t touched upon is AWS Amplify. Amplify allows front-end developers to quickly build extensible, full stack web and mobile apps. With Amplify, you can easily configure a backend, connect an application to it within minutes, and scale effortlessly. We use Amplify to host a web UI that allows users to upload their scanned documents.

You can also use your own UI without Amplify. As we dive deep into this solution, we show how you can use any client application to connect to the backend to convert documents to speech – as long as they support REST and WebSocket APIs. The web UI here is simply to demonstrate key features of this solution. As of this writing, the solution supports JPEG, PNG, and PDF input formats, and the English language.

The following diagram illustrates the solution architecture.

We walk through this architecture by following the path of a single user request:

The user visits the web UI hosted on Amplify. The UI code is the index.html file in the client folder of the code repository.
The user chooses a JPG, PDF, or PNG file to upload using the web UI.
The user initiates the Convert & Play Audio process from the web UI, which uploads the input file to an S3 bucket, through a REST API hosted on API Gateway.
When the upload is complete, the document-to-speech conversion starts as a background process:
1. During the conversion, the web client keeps a persistent WebSocket connection with the API Gateway. This allows the backend processes (Lambda functions) to continuously send progress updates to the web client.
2. The request goes through the API Gateway and triggers the Lambda function convert-images-to-text. This function calls Amazon Textract asynchronously to convert the document to text.
3. When the image-to-text conversion is complete, Amazon Textract sends a notification to Amazon Simple Notification Service (Amazon SNS).
4. The notification triggers the Lambda function on-textract-ready, which kicks off a Step Functions state machine.
5. The state machine orchestrates the following steps:
  1. It runs the Lambda function retrieve-text to obtain the converted text from Amazon Textract.
  2. It then runs Lambda functions moderate-text and store-text in parallel. moderate-text stops further processing when undesirable words are detected, and store-text stores a copy of the converted text to an S3 bucket.
  3. After the parallel steps are complete, the state machine runs the Lambda function convert-text-to-audio, which invokes Amazon Polly asynchronously with the converted text, for speech conversion. The state machine finishes after this step.
6. Similar to Amazon Textract, Amazon Polly sends a notification to Amazon SNS when the job is done. The notification triggers the Lambda function on-polly-ready, which sends a final message to the web UI along with the Amazon S3 location of the converted audio file.
The web UI downloads the final converted audio file from Amazon S3 via a REST API, and then plays it for the user.
The application uses an Amazon DynamoDB table to track job information such as Amazon Textract job ID, Amazon Polly job ID, and more.

The code is hosted on GitHub and is deployed using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to define cloud application resources using familiar programming languages. AWS CDK provisions resources in a repeatable manner through AWS CloudFormation.

Prerequisites

The only prerequisite to deploy this solution is an AWS account.

Deploy the solution

The following steps detail how to deploy the application:

Sign in to your AWS account.
On the AWS Cloud9 console, open an existing environment, or choose Create environment to create a new one.
In your AWS Cloud9 IDE, on the Window menu, choose New Terminal to open a terminal.

All the following steps are done in the same terminal.

Clone the git repository and enter the project directory:

git clone --depth 1 https://github.com/aws-samples/scanned-documents-to-speech.git
cd scanned-documents-to-speech

Create a Python virtual environment:

python3 -m venv .venv

After the init process is complete and the virtual environment is created, use the following step to activate your virtual environment:

source .venv/bin/activate

After the virtual environment is activated, install the required dependencies:

pip install -r requirements.txt

You can now synthesize the CloudFormation templates from the AWS CDK code:

cdk synth

Deploy the AWS CDK application and capture AWS CDK outputs needed later:

cdk deploy --all --outputs-file cdk-outputs.json

You must confirm changes to be deployed for each stack. You can check the stack creation progress on the AWS Cloud Formation console.

To visit the web client, run the following command and follow its output to kick off front-end deployment and use the web client:

./extract-cdk-outputs.py cdk-outputs.json

Key things to note:

The extract-cdk-outputs.py script prints out the URL of the web UI. The script also prints out strings of the S3 bucket name, file API endpoint, and conversion API endpoint, which need to be set on the web UI before uploading a document.
You can set the list of undesirable words in the variable in the moderate-text Lambda function.

Use the application

The following steps demonstrate how to use the application via the web UI.

Following the last step of the deployment, fill in the fields for S3 Bucket Name, File Endpoint, and Conversion Endpoint in the web UI.
Choose Choose File to upload an input file.
Choose Convert & Play Audio.

The web UI shows the progress of the ongoing conversion.

The web UI plays the audio automatically when the conversion is complete.

Clean up

Run the following command to delete all resources and avoid incurring future charges:

cdk destroy --all

Conclusion

In this post, we demonstrated a solution to quickly deploy a document-to-speech conversion application by using two powerful AI services: Amazon Textract and Amazon Polly. We showed how the solution works and provided a detailed walkthrough of the code and deployment steps. This solution is meant to be a reference architecture or quick start that you can further enhance. Notably, you could add support for more human languages, add a queue for buffering incoming requests, and authenticate users.

As discussed in this post, we see multiple use cases for this solution across different industry verticals. Give it a try and let us know how this solved your use case by leaving feedback in the comments section. You can access the resources for the solution in the document to speech GitHub repository.

References

More information is available at the following resources:

About the Authors

Harry Pan is an ISV Solutions Architect at Amazon Web Services based in the San Francisco Bay Area, where he helps software companies achieve their business goals by building well-architected IT systems. He loves spending his spare time with his family, as well as playing tennis, coding in Haskell, and traveling.

Chaitra Mathur is a Principal Solutions Architect at AWS. She guides partners and customers in building highly scalable, reliable, secure, and cost-effective solutions on AWS. In her spare time, she enjoys reading, yoga and spending time with her daughters.

“An accidental project born out of our need to innovate”

May 2, 2022

by Amazon AWS

Former Amazon intern George Boateng is using machine learning and mobile tech to bridge Africa’s digital divide.Read More

Improving unsupervised sentence-pair comparison

April 29, 2022

by Amazon AWS

Method that captures advantages of cross-encoding and bi-encoding improves on predecessors by as much as 5%.Read More

Amazon Rekognition introduces Streaming Video Events to provide real-time alerts on live video streams

April 28, 2022

by Prathyusha Cheruku Amazon AWS

Today, AWS announced the general availability of Amazon Rekognition Streaming Video Events, a fully managed service for camera manufacturers and service providers that uses machine learning (ML) to detect objects such as people, pets, and packages in live video streams from connected cameras. Amazon Rekognition Streaming Video Events sends them a notification as soon as the desired object is detected in the live video stream.

With these event notifications, service providers can send timely and actionable smart alerts to their users such as “Pet detected in the backyard,” enable home automation experiences such as turning on garage lights when a person is detected, build custom in-app experiences such as a smart search to find specific video events of packages without scrolling through hours of footage, or integrate these alerts with Echo devices for Alexa announcements such as “A package was detected at the front door” when the doorbell detects a delivery person dropping off a package – all while keeping cost and latency low.

This post describes how camera manufacturers and security service providers can use Amazon Rekognition Streaming Video Events on live video streams to deliver actionable smart alerts to their users in real time.

Amazon Rekognition Streaming Video Events

Many camera manufacturers and security service providers offer home security solutions that include camera doorbells, indoor cameras, outdoor cameras, and value-added notification services to help their users understand what is happening on their property. Cameras with built-in motion detectors are placed at entry or exit points of the home to notify users of any activity in real time, such as “Motion detected in the backyard.” However, motion detectors are noisy, can be set off by innocuous events like wind and rain, creating notification fatigue, and resulting in clunky home automation setup. Building the right user experience for smart alerts, search, or even browsing video clips requires ML and automation that is hard to get right and can be expensive.

Amazon Rekognition Streaming Video Events lowers the costs of value-added video analytics by providing a low-cost, low-latency, fully managed ML service that can detect objects (such as people, pets, and packages) in real time on video streams from connected cameras. The service starts analyzing the video clip only when a motion event is triggered by the camera. When the desired object is detected, it sends a notification that includes the objects detected, bounding box coordinates, zoomed-in image of the objects detected, and the timestamp. The Amazon Rekognition pre-trained APIs provide high accuracy even in varying lighting conditions, camera angles, and resolutions.

Customer success stories

Customers like Abode Systems and 3xLOGIC are using Amazon Rekognition Streaming Video Events to send relevant alerts to their users and minimize false alarms.

Abode Systems (Abode) offers homeowners a comprehensive suite of do-it-yourself home security solutions that can be set up in minutes and enables homeowners to keep their family and property safe. Since the company’s launch in 2015, in-camera motion detection sensors have played an essential part in Abode’s solution, enabling customers to receive notifications and monitor their homes from anywhere. Abode recognized that to offer its customers the best video stream smart notification experience, they needed highly accurate yet inexpensive and scalable streaming computer vision solutions that can detect objects and events of interest in real time. After weighing alternatives, Abode chose to pilot Amazon Rekognition Streaming Video Events. Within a matter of weeks, Abode was able to deploy a serverless, well-architected solution integrating tens of thousands of cameras. To learn more about Abode’s case study, see Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers.

“We are always focused on making technology choices that provide value to our customers and enable rapid growth while keeping costs low. With Amazon Rekognition Streaming Video Events, we could launch person, pet, and package detection at a fraction of the cost of developing everything ourselves. Our smart home customers are notified in real time when Amazon Rekognition detects an object or activity of interest. This helps us filter out the noise and focus on what’s important to our customers – quality notifications.

For us it was a no-brainer, we didn’t want to create and maintain a custom computer vision service. We turned to the experts on the Amazon Rekognition team. Amazon Rekognition Streaming Video Events APIs are accurate, scalable, and easy to incorporate into our systems. The integration powers our smart notification features, so instead of a customer receiving 100 notifications a day, every time the motion sensor is triggered, they receive just two or three smart notifications when there is an event of interest present in the video stream.”

– Scott Beck, Chief Technology Officer at Abode Systems.

3xLOGIC is a leader in commercial electronic security systems. They provide commercial security systems and managed video monitoring for businesses, hospitals, schools, and government agencies. Managed video monitoring is a critical component of a comprehensive security strategy for 3xLOGIC’s customers. With more than 50,000 active cameras in the field, video monitoring teams face a daily challenge of dealing with false alarms coming from in-camera motion detection sensors. These false notifications pose a challenge for operators because they must treat every notification as if it were an event of interest. 3xLOGIC wanted to improve their managed video monitoring product VIGIL CLOUD with intelligent video analytics and provide monitoring center operators with real-time smart notifications. To do this, 3xLOGIC used Amazon Rekognition Video Streaming Events. The service enables 3xLOGIC to analyze live video streams from connected cameras to detect the presence of individuals and filter out the noise from false notifications. To learn more about 3xLOGIC’s case study, see 3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents.

“Simply relying on motion detection sensors triggers several alarms that are not a security or safety risk when there is a lot of activity in a scene. By utilizing machine learning to filter out the vast majority of events, such as animals, shadows, moving vegetation, and more, we can dramatically reduce the workload of the security operators and improve their efficiency.”

– Ola Edman, Senior Director Global Video Development at 3xLOGIC.

“With over 50,000 active cameras in the field, many without the advanced analytics of newer and more expensive camera models, 3xLOGIC takes on the challenge of false alarms every day. Building, training, testing, and maintaining computer vision models is resource-intensive and has a huge learning curve. With Amazon Rekognition Streaming Video Events, we simply call the API and surface the results to our users. It has been very easy to use and the accuracy is impressive.”

– Charlie Erickson, CTO at 3xLOGIC.

How it works

Amazon Rekognition Streaming Video Events works with Amazon Kinesis Video Streams to detect objects from live video streams. This enables camera manufacturers and service providers to minimize false alerts from camera motion events by sending real-time notifications only when a desired object (such as a person, pet, or package) is detected in the video frame. The Amazon Rekognition streaming video APIs enable service providers to accurately alert on objects that are relevant for their customer, successfully adjust the duration of the video to process per motion event, and even define specific areas within the frame that needs to be analyzed.

Amazon Rekognition helps service providers protect their user data by automatically encrypting the data at rest using AWS Key Management Service (KMS) and in transit using the industry-standard Transport Layer Security (TLS) protocol.

Here’s how camera manufacturers and service providers can incorporate video analysis on live video streams:

Integrate Kinesis Video Streams with Amazon Rekognition – Kinesis Video Streams allows camera manufacturers and service providers to easily and securely stream live video from devices such as video doorbells and indoor and outdoor cameras to AWS. It integrates seamlessly with new or existing Kinesis video streams to facilitate live video stream analysis.
Specify video duration –Amazon Rekognition Streaming Video Events allows service providers to control how much video they need to process per motion event. They can specify the length of the video clips to be between 1–120 seconds (the default is 10 seconds). When motion is detected, Amazon Rekognition starts analyzing video from the relevant Kinesis video stream for the specific duration. This provides camera manufacturers and service providers with the flexibility to better manage their ML inference costs.
Choose relevant objects –Amazon Rekognition Streaming Video Events provides the capability to choose one or more objects for detection in live video streams. This minimizes false alerts from camera motion events by sending notifications only when desired objects are detected in the video frame.
Let Amazon Rekognition know where to send the notifications – Service providers can specify their Amazon Simple Notification Service (Amazon SNS) destination to send event notifications. When Amazon Rekognition starts processing the video stream, it sends a notification as soon a desired object is detected. This notification includes the object detected, the bounding box, the time stamp, and a link to the specified Amazon Simple Storage Service (Amazon S3) bucket with the zoomed-in image of the object detected. They can then use this notification to send smart alerts to their users.
Send motion detection trigger notifications – Whenever a connected camera detects motion, the service provider sends a trigger to Amazon Rekognition to start processing the video streams. Amazon Rekognition processes the applicable Kinesis video stream for the specific objects for the defined duration. When the desired object is detected, Amazon Rekognition sends a notification to their private SNS topic.
Integrate with Alexa or other voice assistants (optional) – Service providers can integrate these notifications with Alexa Smart Home skills to enable Alexa announcements for their users. Whenever Amazon Rekognition Streaming Video Events sends them a notification, they can send these notifications to Alexa to provide audio announcements from Echo devices, such as “Package detected at the front door.”

To learn more, see Amazon Rekognition Streaming Video Events developer guide.

The following diagram illustrates Abode’s architecture with Amazon Rekognition Streaming Video Events.

The following diagram illustrates 3xLOGIC’s architecture with Amazon Rekognition Streaming Video Events.

Amazon Rekognition Video Streaming Events is generally available to AWS customers in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Mumbai) Regions, with availability in additional Regions in the coming months.

Conclusion

AWS customers such as Abode and 3xLOGIC are using Amazon Rekognition Streaming Video Events to innovate and add intelligent video analytics to their security solutions and modernize their offerings without having to invest in new hardware or develop and maintain custom computer vision analytics.

To get started with Amazon Rekognition Streaming Video Events, visit Amazon Rekognition Streaming Video Events.

About the Author

Prathyusha Cheruku is an AI/ML Computer Vision Principal Product Manager at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents

April 28, 2022

by Mike Ames Amazon AWS

3xLOGIC wanted to improve their managed video monitoring product VIGIL CLOUD with intelligent video analytics and provide monitoring center operators with real-time smart notifications. To do this, 3xLOGIC used Amazon Rekognition Video Streaming Events, a low-latency, low-cost, scalable, managed computer vision service from AWS. The service enables 3xLOGIC to analyze live video streams from connected cameras to detect the presence of people and filter out the noise from false notifications. When a person is detected the service sends a notification that includes the object detected, zoomed in image of the object, bounding boxes, and timestamps to monitoring center operators for further review.

– Ola Edman, Senior Director Global Video Development at 3xLOGIC.

Video analytics with Amazon Rekognition Streaming Video Events

The challenge for managed video monitoring operators is that the more false notifications they receive, the more they get desensitized to the noise and the more likely they are to miss a critical notification. Providers like 3xLOGIC want agents to respond to notifications with the same urgency on the last alarm of their shift as they did on the first. The best way for that to happen is to simply filter out the noise from in-camera motion detection events.

3xLOGIC worked with AWS to develop and launch a multi-location pilot program that showed a significant decrease in false alarms. The following diagram illustrates 3xLOGIC’s integration with Amazon Rekognition Streaming Video Events.

When a 3xLOGIC camera detects motion, it starts streaming video to Amazon Kinesis Video Streams and calls an API to trigger Amazon Rekognition to start analyzing the video stream. When Amazon Rekognition detects a person in the video stream, it sends an event to Amazon Simple Notification Service (Amazon SNS), which notifies a video monitoring agent of the event. Amazon Rekognition provides out-of-the-box notifications, which include zoomed-in images of the people, bounding boxes, labels, and timestamps of the event. Monitoring agents use these notifications in concert with live camera views to evaluate the event and take appropriate action. To learn more about Amazon Rekognition Streaming Video Events, refer to the Amazon Rekognition Developer guide.

– Charlie Erickson, CTO at 3xLOGIC Products and Solutions.

Conclusion

The managed video monitoring market requires an in-depth understanding of the variety of security risks that firms face. It also requires that you keep up with the latest technology, regulations, and best practices. By partnering with AWS, providers like 3xLOGIC are innovating and adding intelligent video analytics to their security solutions and modernizing their offerings without having to invest in new hardware or develop and maintain custom computer vision analytics.

To get started with Amazon Rekognition Streaming Video Events, visit Amazon Rekognition Streaming Video Events.

About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing Frisbee with his dog Max.

Prathyusha Cheruku is a Principal Product Manager for AI/ML Computer Vision at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

David Robo is a Principal WW GTM Specialist for AI/ML Computer Vision at Amazon Web Services. In this role, David works with customers and partners throughout the world who are building innovative video-based devices, products, and services. Outside of work, David has a passion for the outdoors and carving lines on waves and snow.

Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers

April 28, 2022

by Mike Ames Amazon AWS

Abode has been an AWS user since 2015, taking advantage of multiple AWS services for storage, compute, database, IoT, and video streaming for its solutions. Abode reached out to AWS to understand how they could use AWS computer vision services to build smart notifications into their home security solution for their customers. After evaluating their options, Abode chose to use Amazon Rekognition Streaming Video Events, a low-cost, low-latency, fully managed AI service that can detect objects such as people, pets, and packages in real time on video streams from connected cameras.

– Scott Beck, Chief Technology Officer at Abode Systems.

Smart notifications for the connected home market segment

Abode recognized that to offer its customers the best video stream smart notification experience, they needed highly accurate yet inexpensive and scalable streaming computer vision solutions that can detect objects and events of interest in real time. After weighing alternatives, Abode leaned on their relationship with AWS to pilot Amazon Rekognition Streaming Video Events. Within a matter of weeks, Abode was able to deploy a serverless, well-architected solution integrating tens of thousands of cameras.

“Every time a camera detects motion, we stream video to Amazon Kinesis Video Streams and trigger Amazon Rekognition Streaming Video Events APIs to detect if there truly was a person, pet, or package in the stream,” Beck says. “Our smart home customers are notified in real time when Amazon Rekognition detects an object or activity of interest. This helps us filter out the noise and focus on what’s important to our customers – quality notifications.”

Amazon Rekognition Streaming Video Events

Amazon Rekognition Streaming Video Events detects objects and events in video streams and returns the labels detected, bounding box coordinates, zoomed-in images of the object detected, and timestamps. With this service, companies like Abode can deliver timely and actionable smart notifications only when a desired label such as a person, pet, or package is detected in the video frame. For more information, refer to the Amazon Rekognition Streaming Video Events Developer Guide.

“For us it was a no-brainer, we didn’t want to create and maintain a custom computer vision service,” Beck says. “We turned to the experts on the Amazon Rekognition team. Amazon Rekognition Streaming Video Events APIs are accurate, scalable, and easy to incorporate into our systems. The integration powers our smart notification features, so instead of a customer receiving 100 notifications a day, every time the motion sensor is triggered, they receive just two or three smart notifications when there is an event of interest present in the video stream.”

Solution overview

Abode’s goal was to improve accuracy and usefulness of camera-based motion detection notifications to their customers by providing highly accurate label detection using their existing camera technology. This meant that Abode’s customers wouldn’t have to buy additional hardware to take advantage of new features, and Abode wouldn’t have to develop and maintain a bespoke solution. The following diagram illustrates Abode’s integration with Amazon Rekognition Streaming Video Events.

The solution consists of the following steps:

Integrate Amazon Kinesis Video Streams with Amazon Rekognition – Abode was already using Amazon Kinesis Video Streams to easily stream live video from devices such as video doorbells and indoor and outdoor cameras to AWS. They simply integrated Kinesis Video Streams with Amazon Rekognition to facilitate live video stream analysis.
Specify video duration – With Amazon Rekognition, Abode can control how much video needs to be processed per motion event. Amazon Rekognition allows you to specify the length of the video clips to be between 0–120 seconds (the default is 10 seconds) per motion event. When motion is detected, Amazon Rekognition starts analyzing video from the relevant Kinesis video stream for the specific duration. This allows Abode the flexibility to better manage their machine learning (ML) inference costs.
Choose relevant labels – With Amazon Rekognition, customers like Abode can choose one or more labels for detection in live video streams. This minimizes false alerts from camera motion events by sending notifications only when desired objects are detected in the video frame. Abode opted for person, pet, and package detection.
Let Amazon Rekognition know where to send the notifications – When Amazon Rekognition starts processing the video stream, it sends a notification as soon a desired object is detected to the Amazon Simple Notification Service (Amazon SNS) destination configured by Abode. This notification includes the object detected, the bounding box, the timestamp, and a link to Abode’s specified Amazon Simple Storage Service (Amazon S3) bucket with the zoomed-in image of the object detected. Abode then uses this information to send relevant smart alerts to the homeowner, such as “A package has been detected at 12:53pm” or “A pet detected in the backyard.”
Send motion detection trigger notifications – Whenever the smart camera detects motion, Abode sends a trigger to Amazon Rekognition to start processing the video streams. Amazon Rekognition processes the applicable Kinesis video stream for the specific objects and the duration defined. When the desired object is detected, Amazon Rekognition sends a notification to Abode’s private SNS topic.
Integrate with Alexa or other voice assistants (optional) – Abode also integrated these notifications with Alexa Smart Home skills to enable Alexa announcements for their users. Whenever they receive a notification from Amazon Rekognition Streaming Video Events, Abode sends these notifications to Alexa to provide audio announcements from Echo devices, such as “Package detected at the front door.”

Conclusion

The connected home security market segment is dynamic and evolving, driven by consumers’ increased need for security, convenience, and entertainment. AWS customers like Abode are innovating and adding new ML capabilities to their smart home security solutions for their consumers. The proliferation of camera and streaming video technology is just beginning, and managed computer vision services like Amazon Rekognition Streaming Video Events is paving the way for new smart video streaming capabilities in the home automation market.

To learn more, check out Amazon Rekognition Streaming Video Events and developer guide.

Performance tuning and optimization for model serving on SageMaker

Real-time inference and latency in SageMaker

Use case overview

NVIDIA Triton Inference Server

Dynamic batching

Running models concurrently

TensorRT

Triton on SageMaker hosting

SageMaker Inference Recommender for benchmarking test results

Experiment setup

Test results and observations

[1] Initial instance count in the above table is the recommended number of instances to use with an autoscaling policy to maintain the throughput and latency requirements for your workload.

[2] Cost per hour in the above table is calculated based on the Initial instance count and price for the instance type.

Further performance and cost-optimization tips

Conclusion

About the Authors

Solution overview

Use the Graph-Based Credit Scoring solution

Data used in the solution

Model training

Hyperparameter optimization

Results

Clean up

Summary

About the Authors

Overview of solution

Architecture and code

Prerequisites

Deploy the solution

Use the application

Clean up

Conclusion

References

About the Authors

Amazon Rekognition Streaming Video Events

Customer success stories

How it works

Conclusion

About the Author

Video analytics with Amazon Rekognition Streaming Video Events

Conclusion

About the Authors

Smart notifications for the connected home market segment

Amazon Rekognition Streaming Video Events

Solution overview

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.