Part 3: How NatWest Group built auditable, reproducible, and explainable ML models with Amazon SageMaker

This is the third post of a four-part series detailing how NatWest Group, a major financial services institution, partnered with AWS Professional Services to build a new machine learning operations (MLOps) platform. This post is intended for data scientists, MLOps engineers, and data engineers who are interested in building ML pipeline templates with Amazon SageMaker. We explain how NatWest Group used SageMaker to create standardized end-to-end MLOps processes. This solution reduced the time-to-value for ML solutions from 12 months to less than 3 months, and reduced costs while maintaining NatWest Group’s high security and auditability requirements.

NatWest Group chose to collaborate with AWS Professional Services given their expertise in building secure and scalable technology platforms. The joint team worked to produce a sustainable long-term solution that supports NatWest’s purpose of helping families, people, and businesses thrive by offering the best financial products and services. We aim to do this while using AWS managed architectures to minimize compute and reduce carbon emissions.

Read the entire series:

SageMaker project templates

One of the exciting aspects of using ML to answer business challenges is that there are many ways to build a solution that include not only the ML code itself, but also connecting to data, preprocessing, postprocessing, performing data quality checks, and monitoring, to name a few. However, this flexibility can create its own challenges for enterprises aiming to scale the deployment of ML-based solutions. It can lead to a lack of consistency, standardization, and reusability that limits the time to create new business propositions. This is where the concept of a template for your ML code comes in. It allows you to define a standard way to develop an ML pipeline that you can reuse across multiple projects, teams, and use cases, while still allowing the flexibility needed for key components. This makes the solutions we build more consistent, robust, and easier to test. This also makes development go much faster.

Through a recent collaboration between NatWest Group and AWS, we developed a set of SageMaker project templates that embody all the standard processes and steps that form part of a robust ML pipeline. These are used within the data science platform built on infrastructure templates that were described in Part 2 of this series.

The following figure shows the architecture of our training pipeline template.

The following screenshot shows what is displayed in Amazon SageMaker Studio when this pipeline is successfully run.

All of this helps improve our ML development practices in multiple ways, which we discuss in the following sections.

Reusable

ML projects at NatWest often require cross-disciplinary teams of data scientists and engineers to collaborate on the same project. This means that shareability and reproducibility of code, models, and pipelines are two important features teams should be able to rely on.

To meet this requirement, the team set up Amazon SageMaker Pipelines templates for training and inference as dual starting points for new ML projects. The templates are deployed by the data science team via a single click into new development accounts via AWS Service Catalog as SageMaker projects. These AWS Service Catalog products are managed and developed by a central platform team so that best practice updates are propagated across all business units, but consuming teams can contribute their own developments to the common codebase.

To achieve easy replication of templates, NatWest utilized AWS Systems Manager Parameter Store. The parameters were referenced in the generic templates, where we followed a naming convention in which each project’s set of parameters uses the same prefix. This means templates can be started up quickly and still access all necessary platform resources, without users needing to spend time configuring them. This works because AWS Service Catalog replaces any parameters in the use case accounts with the correct value.

Not only do these templates enable a much quicker startup of new projects, but they also provide more standardized model development practices in teams aligned to different business areas. This means that model developers across the organization can more easily understand code written by other teams, facilitating collaboration, knowledge sharing, and easier auditing.

Efficient

The template approach encouraged model developers to write modular and decoupled pipelines, allowing them to take advantage of the flexibility of SageMaker managed instances. Decoupled steps in an ML pipeline mean that smaller instances can be used where possible to save costs, so larger instances are only used when needed (for example, for larger data preprocessing workloads).

SageMaker Pipelines also offers a step-caching capability. Caching means that steps that were already successfully run aren’t rerun in case of a failed pipeline—the pipeline skips those, reuses their cached outputs, and starts only from the failed step. This approach means that the organization is only charged for the higher-cost instances for a minimal amount of time, which also aligns ML workloads with NatWest’s climate goal to reduce carbon emissions.

Finally, we used SageMaker integration with Amazon EventBridge and Amazon Simple Notification Service (Amazon SNS) for custom email notifications on key pipeline updates to ensure that data scientists and engineers are kept up to date with the status of their development.

Auditable

The template ML pipelines contain steps that generate artifacts, including source code, serialized model objects, scalers used to transform data during preprocessing, and transformed data. We use the Amazon Simple Storage service (Amazon S3) versioning feature to securely keep multiple variants of objects like these in the same bucket. This means objects that are deleted or overwritten can be restored. All versions are available, so we can keep track of all the objects generated in each run of the pipeline. When an object is deleted, instead of removing it permanently, Amazon S3 inserts a delete marker, which becomes the current object version. If an object is overwritten, it results in a new object version in the bucket. If necessary, it can be restored to the previous version. The versioning feature provides a transparent view for any governance process, so the artifacts generated are always auditable.

To properly audit and trust our ML models, we need to keep a record of the experiments we ran during development. SageMaker allows for metrics to be automatically collected by a training job and displayed in SageMaker experiments and the model registry. This service is also integrated with the SageMaker Studio UI, providing a visual interface to browse active and past experiments, visually compare trials on key performance metrics, and identify the best performing models.

Secure

Due to the unique requirements of NatWest Group and the financial services industry, with an emphasis on security in a banking context, the ML templates we created required additional features compared to standard SageMaker pipeline templates. For example, container images used for processing and training jobs need to download approved packages from AWS CodeArtifact because they can’t do so from the internet directly. The templates need to work with a multi-account deployment model, which allows secure testing in a production-like environment.

SageMaker Model Monitor and SageMaker Clarify

A core part of NatWest’s vision is to help our customers thrive, which we can only do if we’re making decisions in a fair, equitable, and transparent manner. For ML models, this leads directly to the requirement for model explainability, bias reporting, and performance monitoring. This means that the business can track and understand how and why models make specific decisions.

Given the lack of standardization across teams, no consistent model monitoring capabilities were in place, leading to teams creating a lot of bespoke solutions. Two vital components of the new SageMaker project templates were bias detection on the training data and monitoring the trained model. Amazon SageMaker Model Monitor and Amazon SageMaker Clarify suit these purposes, and both operate in a scalable and repeatable manner.

Model Monitor continuously monitors the quality of SageMaker ML models in production against predefined baseline metrics. Clarify offers bias checks on training data based on the Deequ open-source library. After the model is trained, Clarify offers model bias and explainability checks using various metrics, such as Shapley values. The following screenshot shows an explainability report viewed in SageMaker Studio and produced by Clarify.

The following screenshot shows a similarly produced example model bias report.

Conclusion

NatWest’s ML templates align our MLOps solution to our need for auditable, reusable, and explainable MLOps assets across the organization. With this blueprint for instantiating ML pipelines in a standardized manner using SageMaker features, data science and engineering teams at NatWest feel empowered to operate and collaborate with best practices, with the control and visibility needed to manage their ML workloads effectively.

In addition to the quick-start ML templates and account infrastructure, several existing NatWest ML use cases were developed using these capabilities. The development of these use cases informed the requirements and direction of the overall collaboration. This meant that the project templates we set up are highly relevant for the organization’s business needs.

This is the third post of a four-part series on the strategic collaboration between NatWest Group and AWS Professional Services. Check out the rest of the series for the following topics:

  • Part 1 explains how NatWest Group partnered with AWS Professional Services to build a scalable, secure, and sustainable MLOps platform
  • Part 2 describes how NatWest Group used AWS Service Catalog and SageMaker to deploy their compliant and self-service MLOps platform
  • Part 4 details how NatWest data science teams migrate their existing models to SageMaker architectures

About the Authors

Ariadna Blanca Romero is a Data Scientist/ML Engineer at NatWest with a background in computational material science. She is passionate about creating tools for automation that take advantage of the reproducibility of results from models and optimize the process to ensure prompt decision-making to deliver the best customer experience. Ariadna volunteers as a STEM virtual mentor for a non-profit organization helping the professional development of Latin-American students and young professionals, particularly for empowering women. She has a passion for practicing yoga, cooking, and playing her electric guitar.

Lucy Thomas is a data engineer at NatWest Group, working in the Data Science and Innovation team on ML solutions used by data science teams across NatWest. Her background is in physics, with a master’s degree from Lancaster University. In her spare time, Lucy enjoys bouldering, board games, and spending time with family and friends.

Sarah Bourial is a Data Scientist at AWS Professional Services. She is passionate about MLOps and supports global customers in the financial services industry in building optimized AI/ML solutions to accelerate their cloud journey. Prior to AWS, Sarah worked as a Data Science consultant for various customers, always looking to achieve impactful business outcomes. Outside work, you will find her wandering around art galleries in exotic destinations.

Pauline Ting is Data Scientist in the AWS Professional Services team. She supports financial, sports and media customers in achieving and accelerating their business outcome by developing AI/ML solutions. In her spare time, Pauline enjoys traveling, surfing, and trying new dessert places.

Read More

Part 2: How NatWest Group built a secure, compliant, self-service MLOps platform using AWS Service Catalog and Amazon SageMaker

This is the second post of a four-part series detailing how NatWest Group, a major financial services institution, partnered with AWS Professional Services to build a new machine learning operations (MLOps) platform. In this post, we share how the NatWest Group utilized AWS to enable the self-service deployment of their standardized, secure, and compliant MLOps platform using AWS Service Catalog and Amazon SageMaker. This has led to a reduction in the amount of time it takes to provision new environments from days to just a few hours.

We believe that decision-makers can benefit from this content. CTOs, CDAOs, senior data scientists, and senior cloud engineers can follow this pattern to provide innovative solutions for their data science and engineering teams.

Read the entire series:

Technology at NatWest Group

NatWest Group is a relationship bank for a digital world that provides financial services to more than 19 million customers across the UK. The Group has a diverse technology portfolio, where solutions to business challenges are often delivered using bespoke designs and with lengthy timelines.

Recently, NatWest Group adopted a cloud-first strategy, which has enabled the company to use managed services to provision on-demand compute and storage resources. This move has led to an improvement in overall stability, scalability, and performance of business solutions, while reducing cost and accelerating delivery cadence. Additionally, moving to the cloud allows NatWest Group to simplify its technology stack by enforcing a set of consistent, repeatable, and preapproved solution designs to meet regulatory requirements and operate in a controlled manner.

Challenges

The pilot stages of adopting a cloud-first approach involved several experimentation and evaluation phases utilizing a wide variety of analytics services on AWS. The first iterations of NatWest Group’s cloud platform for data science workloads confronted challenges with provisioning consistent, secure, and compliant cloud environments. The process of creating new environments took from a few days to weeks or even months. A reliance on central platform teams to build, provision, secure, deploy, and manage infrastructure and data sources made it difficult to onboard new teams to work in the cloud.

Due to the disparity in infrastructure configuration across AWS accounts, teams who decided to migrate their workloads to the cloud had to go through an elaborate compliance process. Each infrastructure component had to be analyzed separately, which increased security audit timelines.

Getting started with development in AWS involved reading a set of documentation guides written by platform teams. Initial environment setup steps included managing public and private keys for authentication, configuring connections to remote services using the AWS Command Line Interface (AWS CLI) or SDK from local development environments, and running custom scripts for linking local IDEs to cloud services. Technical challenges often made it difficult to onboard new team members. After the development environments were configured, the route to release software in production was similarly complex and lengthy.

As described in Part 1 of this series, the joint project team collected large amounts of feedback on user experience and requirements from teams across NatWest Group prior to building the new data science and MLOps platform. A common theme in this feedback was the need for automation and standardization as a precursor to quick and efficient project delivery on AWS. The new platform uses AWS managed services to optimize cost, cut down on platform configuration efforts, and reduce the carbon footprint from running unnecessarily large compute jobs. Standardization is embedded in the heart of the platform, with preapproved, fully configured, secure, compliant, and reusable infrastructure components that are sharable among data and analytics teams.

Why SageMaker Studio?

The team chose Amazon SageMaker Studio as the main tool for building and deploying ML pipelines. Studio provides a single web-based interface that gives users complete access, control, and visibility into each step required to build, train, and deploy models. The maturity of the Studio IDE (integrated development environment) for model development, metadata tracking, artifact management, and deployment were among the features that appealed strongly to the NatWest Group team.

Data scientists at NatWest Group work with SageMaker notebooks inside Studio during the initial stages of model development to perform data analysis, data wrangling, and feature engineering. After users are happy with the results of this initial work, the code is easily converted into composable functions for data transformation, model training, inference, logging, and unit tests so that it’s in a production-ready state.

Later stages of the model development lifecycle involve the use of Amazon SageMaker Pipelines, which can be visually inspected and monitored in Studio. Pipelines are visualized in a DAG (Directed Acyclic Graph) that color-codes steps based on their state while the pipeline runs. In addition, a summary of Amazon CloudWatch Logs is displayed next to the DAG to facilitate the debugging of failed steps. Data scientists are provided with a code template consisting of all the foundational steps in a SageMaker pipeline. This provides a standardized framework (consistent across all users of the platform to ease collaboration and knowledge sharing) into which developers can add the bespoke logic and application code that is particular to the business challenge they’re solving.

Developers run the pipelines within the Studio IDE to ensure their code changes integrate correctly with other pipeline steps. After code changes have been reviewed and approved, these pipelines are built and run automatically based on a main Git repository branch trigger. During model training, model evaluation metrics are stored and tracked in SageMaker Experiments, which can be used for hyperparameter tuning. After a model is trained, the model artifact is stored in the SageMaker model registry, along with metadata related to model containers, data used during training, model features, and model code. The model registry plays a key role in the model deployment process because it packages all model information and enables the automation of model promotion to production environments.

MLOps engineers deploy managed SageMaker batch transform jobs, which scale to meet workload demands. Both offline batch inference jobs and online models served via an endpoint use SageMaker’s managed inference functionality. This benefits both platform and business application teams because platform engineers no longer spend time configuring infrastructure components for model inference, and business application teams don’t write additional boilerplate code to set up and interact with compute instances.

Why AWS Service Catalog?

The team chose AWS Service Catalog to build a catalog of secure, compliant, and preapproved infrastructure templates. The infrastructure components in an AWS Service Catalog product are preconfigured to meet NatWest Group’s security requirements. Role access management, resource policies, networking configuration, and central control policies are configured for each resource packaged in an AWS Service Catalog product. The products are versioned and shared with application teams by following a standard process that enables data science and engineering teams to self-serve and deploy infrastructure immediately after obtaining access to their AWS accounts.

Platform development teams can easily evolve AWS Service Catalog products over time to enable the implementation of new features based on business requirements. Iterative changes to products are made with the help of AWS Service Catalog product versioning. When a new product version is released, the platform team merges code changes to the main Git branch and increments the version of the AWS Service Catalog product. There is a degree of autonomy and flexibility in updating the infrastructure because business application accounts can use earlier versions of products before they migrate to the latest version.

Solution overview

The following high-level architecture diagram shows how a typical business application use case is deployed on AWS. The following sections go into more detail regarding the account architecture, how infrastructure is deployed, user access management, and how different AWS services are used to build ML solutions.

As shown in the architecture diagram, accounts follow a hub and spoke model. A shared platform account serves as a hub account, where resources required by business application team (spoke) accounts are hosted by the platform team. These resources include the following:

  • A library of secure, standardized infrastructure products used for self-service infrastructure deployments, hosted by AWS Service Catalog
  • Docker images, stored in Amazon Elastic Container Registry (Amazon ECR), which are used during the run of SageMaker pipeline steps and model inference
  • AWS CodeArtifact repositories, which host preapproved Python packages

These resources are automatically shared with spoke accounts via the AWS Service Catalog portfolio sharing and import feature, and AWS Identity and Access Management (IAM) trust policies in the case of both Amazon ECR and CodeArtifact.

Each business application team is provisioned three AWS accounts in the NatWest Group infrastructure environment: development, pre-production, and production. The environment names refer to the account’s intended role in the data science development lifecycle. The development account is used to perform data analysis and wrangling, write model and model pipeline code, train models, and trigger model deployments to pre-production and production environments via SageMaker Studio. The pre-production account mirrors the setup of the production account and is used to test model deployments and batch transform jobs before they are released into production. The production account hosts models and runs production inferencing workloads.

User management

NatWest Group has strict governance processes to enforce user role separation. Five separate IAM roles have been created for each user persona.

The platform team uses the following roles:

  • Platform support engineer – This role contains permissions for business-as-usual tasks and a read-only view of the rest of the environment for monitoring and debugging the platform.
  • Platform fix engineer – This role has been created with elevated permissions. It’s used if there are issues with the platform that require manual intervention. This role is only assumed in an approved, time-limited manner.

The business application development teams have three distinct roles:

  • Technical lead – This role is assigned to the application team lead, often a senior data scientist. This user has permission to deploy and manage AWS Service Catalog products, trigger releases into production, and review the status of the environment, such as AWS CodePipeline statuses and logs. This role doesn’t have permission to approve a model in the SageMaker model registry.
  • Developer – This role is assigned to all team members that work with SageMaker Studio, which includes engineers, data scientists, and often the team lead. This role has permissions to open Studio, write code, and run and deploy SageMaker pipelines. Like the technical lead, this role doesn’t have permission to approve a model in the model registry.
  • Model approver – This role has limited permissions relating to viewing, approving, and rejecting models in the model registry. The reason for this separation is to prevent any users that can build and train models from approving and releasing their own models into escalated environments.

Separate Studio user profiles are created for developers and model approvers. The solution uses a combination of IAM policy statements and SageMaker user profile tags so that users are only permitted to open a user profile that matches their user type. This makes sure that the user is assigned the correct SageMaker execution IAM role (and therefore permissions) when they open the Studio IDE.

Self-service deployments with AWS Service Catalog

End-users utilize AWS Service Catalog to deploy data science infrastructure products, such as the following:

  • A Studio environment
  • Studio user profiles
  • Model deployment pipelines
  • Training pipelines
  • Inference pipelines
  • A system for monitoring and alerting

End-users deploy these products directly through the AWS Service Catalog UI, meaning there is less reliance on central platform teams to provision environments. This has vastly reduced the time it takes for users to gain access to new cloud environments, from multiple days down to just a few hours, which ultimately has led to a significant improvement in time-to-value. The use of a common set of AWS Service Catalog products supports consistency within projects across the enterprise and lowers the barrier for collaboration and reuse.

Because all data science infrastructure is now deployed via a centrally developed catalog of infrastructure products, care has been taken to build each of these products with security in mind. Services have been configured to communicate within Amazon Virtual Private Cloud (Amazon VPC) so traffic doesn’t traverse the public internet. Data is encrypted in transit and at rest using AWS Key Management Service (AWS KMS) keys. IAM roles have also been set up to follow the principle of least privilege.

Finally, with AWS Service Catalog, it’s easy for the platform team to continually release new products and services as they become available or required by business application teams. These can take the form of new infrastructure products, for example providing the ability for end-users to deploy their own Amazon EMR clusters, or updates to existing infrastructure products. Because AWS Service Catalog supports product versioning and utilizes AWS CloudFormation behind the scenes, in-place upgrades can be used when new versions of existing products are released. This allows the platform teams to focus on building and improving products, rather than developing complex upgrade processes.

Integration with NatWest’s existing IaC software

AWS Service Catalog is used for self-service data science infrastructure deployments. Additionally, NatWest’s standard infrastructure as code (IaC) tool, Terraform, is used to build infrastructure in the AWS accounts. Terraform is used by platform teams during the initial account setup process to deploy prerequisite infrastructure resources such as VPCs, security groups, AWS Systems Manager parameters, KMS keys, and standard security controls. Infrastructure in the hub account, such as the AWS Service Catalog portfolios and the resources used to build Docker images, are also defined using Terraform. However, the AWS Service Catalog products themselves are built using standard CloudFormation templates.

Improving developer productivity and code quality with SageMaker projects

SageMaker projects provide developers and data scientists access to quick-start projects without leaving SageMaker Studio. These quick-start projects allow you to deploy multiple infrastructure resources at the same time in just a few clicks. These include a Git repository containing a standardized project template for the selected model type, Amazon Simple Storage Service (Amazon S3) buckets for storing data, serialized models and artifacts, and model training and inference CodePipeline pipelines.

The introduction of standardized code base architectures and tooling now make it easy for data scientists and engineers to move between projects and ensure code quality remains high. For example, software engineering best practices such as linting and formatting checks (run both as automated checks and pre-commit hooks), unit tests, and coverage reports are now automated as part of training pipelines, providing standardization across all projects. This has improved the maintainability of ML projects and will make it easier to move these projects into production.

Automating model deployments

The model training process is orchestrated using SageMaker Pipelines. After models have been trained, they’re stored in the SageMaker model registry. Users assigned the model approver role can open the model registry and find information relating to the training process, such when the model was trained, hyperparameter values, and evaluation metrics. This information helps the user decide whether to approve or reject a model. Rejecting a model prevents the model from being deployed into an escalated environment, whereas approving a model triggers a model promotion pipeline via CodePipeline that automatically copies the model to the pre-production AWS account, ready for inference workload testing. After the team has confirmed that the model works correctly in pre-production, a manual step in the same pipeline is approved and the model is automatically copied over to the production account, ready for production inferencing workloads.

Outcomes

One of the main aims of this collaborative project between NatWest and AWS was to reduce the time it takes to provision and deploy data science cloud environments and ML models into production. This has been achieved—NatWest can now provision new, scalable, and secure AWS environments in a matter of hours, compared to days or even weeks. Data scientists and engineers are now empowered to deploy and manage data science infrastructure by themselves using AWS Service Catalog, reducing reliance on centralized platform teams. Additionally, the use of SageMaker projects enables users to begin coding and training models within minutes, while also providing standardized project structures and tooling.

Because AWS Service Catalog serves as the central method to deploy data science infrastructure, the platform can easily be expanded and upgraded in the future. New AWS services can be offered to end-users quickly when the need arises, and existing AWS Service Catalog products can be upgraded in place to take advantage of new features.

Finally, the move towards managed services on AWS means compute resources are provisioned and shut down on demand. This has provided cost savings and flexibility, while also aligning with NatWest’s ambition to be net-zero by 2050 due to an estimated 75% reduction in CO2 emissions.

Conclusion

The adoption of a cloud-first strategy at NatWest Group led to the creation of a robust AWS solution that can support a large number of business application teams across the organization. Managing infrastructure with AWS Service Catalog has improved the cloud onboarding process significantly by using secure, compliant, and preapproved building blocks of infrastructure that can be easily expanded. Managed SageMaker infrastructure components have improved the model development process and accelerated the delivery of ML projects.

To learn more about the process of building production-ready ML models at NatWest Group, take a look at the rest of this four-part series on the strategic collaboration between NatWest Group and AWS Professional Services:

  • Part 1 explains how NatWest Group partnered with AWS Professional Services to build a scalable, secure, and sustainable MLOps platform
  • Part 3 provides an overview of how NatWest Group uses SageMaker services to build auditable, reproducible, and explainable ML models
  • Part 4 details how NatWest data science teams migrate their existing models to SageMaker architectures

About the Authors

Junaid Baba is DevOps Consultant at AWS Professional Services  He leverages his experience in Kubernetes, distributed computing, AI/MLOps to accelerated cloud adoption of UK financial services industry customers. Junaid has been with AWS since June 2018. Prior to that, Junaid worked with number of financial start-ups driving DevOps practices. Outside of work he has interests in trekking, modern art, and still photography.

Yordanka Ivanova is a Data Engineer at NatWest Group. She has experience in building and delivering data solutions for companies in the financial services industry. Prior to joining NatWest, Yordanka worked as a technical consultant where she gained experience in leveraging a wide variety of cloud services and open-source technologies to deliver business outcomes across multiple cloud platforms. In her spare time, Yordanka enjoys working out, traveling and playing guitar.

Michael England is a software engineer in the Data Science and Innovation team at NatWest Group. He is passionate about developing solutions for running large-scale Machine Learning workloads in the cloud. Prior to joining NatWest Group, Michael worked in and led software engineering teams developing critical applications in the financial services and travel industries. In his spare time, he enjoys playing guitar, travelling and exploring the countryside on his bike.

Read More

Part 1: How NatWest Group built a scalable, secure, and sustainable MLOps platform

This is the first post of a four-part series detailing how NatWest Group, a major financial services institution, partnered with AWS to build a scalable, secure, and sustainable machine learning operations (MLOps) platform.

This initial post provides an overview of the AWS and NatWest Group joint team implemented Amazon SageMaker Studio as the standard for the majority of their data science environment in just 9 months. The intended audience includes decision-makers that would like to standardize their ML workflows, such as CDAO, CDO, CTO, Head of Innovation, and lead data scientists. Subsequent posts will detail the technical implementation of this solution.

Read the entire series:

MLOps

For NatWest Group, MLOps focuses on realizing the value from any data science activity through the adoption of DevOps and engineering best practices to build solutions and products that have an ML component at their core. It provides the standards, tools, and framework that supports the work of data science teams to take their ideas from the whiteboard to production in a timely, secure, traceable, and repeatable manner.

Strategic collaboration between NatWest Group and AWS

NatWest Group is the largest business and commercial bank in the UK, with a leading retail business. The Group champions potential by supporting 19 million people, families, and businesses in communities throughout the UK and Ireland to thrive by acting as a relationship bank in a digital world.

As the Group attempted to scale their use of advanced analytics in the enterprise, they discovered that the time taken to develop and release ML models and solutions into production was too long. They partnered with AWS to design, build, and launch a modern, secure, scalable, and sustainable self-service platform for developing and productionizing ML-based services to support their business and customers. AWS Professional Services worked with them to accelerate the implementation of AWS best practices for Amazon SageMaker services.

The aims of the collaboration between NatWest Group and AWS were to provide the following:

  • A federated, self-service, and DevOps-driven approach for infrastructure and application code, with a clear route-to-live that will lead to deployment times being measured in minutes (the current average is 60 minutes) and not weeks.
  • A secure, controlled, and templated environment to accelerate innovation with ML models and insights using industry best practices and bank-wide shared artifacts.
  • The ability to access and share data more easily and consistently across the enterprise.
  • A modern toolset based upon a managed architecture running on demand that would minimize compute requirements, drive down costs, and drive sustainable ML development and operations. This should have the flexibility to accommodate new AWS products and services to meet ongoing use case and compliance requirements.
  • Adoption, engagement, and training support for data science and engineering teams across the enterprise.

To meet the security requirements of the bank, public internet access is disabled and all data is encrypted with custom keys. As described in Part 2 of this series, a secure instance of SageMaker Studio is deployed to the development account in 60 minutes. When the account setup is complete, a new use case template is requested by the data scientists via SageMaker projects in SageMaker Studio. This process deploys the necessary infrastructure that ensures MLOps capabilities in the development account (with minimal support required from operational teams) such as CI/CD pipelines, unit testing, model testing, and monitoring.

This was to be provided in a common layer of capability, displayed in the following figure.

The process

The joint AWS-NatWest Group team used an agile five-step process to discover, design, build, test, and deploy the new platform over 9 months:

  • Discovery – Several information-gathering sessions were conducted to identify the current pain points within their ML lifecycle. These included challenges around data discovery, infrastructure setup and configuration, model building, governance, route-to-live, and the operating model. Working backward, AWS and NatWest Group understood the core requirements, priorities, and dependencies that helped create a common vision, success criteria, and delivery plan for the MLOps platform.
  • Design – Based on the output from the Discovery phase, the team iterated towards the final design for the MLOps platform by combining best practices and advice from AWS and existing experience of using cloud services within NatWest Group. This was done with a particular focus on ensuring compliance with the security and governance requirements typical within the financial services domain.
  • Build – The team collaboratively built the Terraform and AWS CloudFormation templates for the platform infrastructure. Feedback was continually gathered from end-users (data scientists, ML and data engineers, platform support team, security and governance, and senior stakeholders) to ensure deliverables matched original goals.
  • Test – A crucial aspect of the delivery was to demonstrate the platform’s viability on real business analytics and ML use cases. NatWest identified three projects that covered a range of business challenges and data science complexity that would allow the new platform to be tested across dimensions including scalability, flexibility, and accessibility. AWS and NatWest Group data scientists and engineers co-created the baseline environment templates and SageMaker pipelines based upon these use cases.
  • Launch – Once the capability was proven, the team launched the new platform into the organization, providing bespoke training plans and careful adoption and engagement support to the federated business teams to onboard their own use cases and users.

The scalable ML framework

In a business with millions of customers sitting across multiple lines of business, ML workflows require the integration of data owned and managed by siloed teams using different tools to unlock business value. Because NatWest Group is committed to the protection of its customers’ data, the infrastructure used for ML model development is also subject to high security standards, which adds further complexity and impacts the time-to-value for new ML models. In a scalable ML framework, toolset modernization and standardization are necessary to reduce the efforts required for combining different tools and to simplify the route-to-live process for new ML models.

Prior to engaging with AWS, support for data science activity was controlled by a central platform team that collected requirements, and provisioned and maintained infrastructure for data teams across the organization. NatWest has ambitions to rapidly scale the use of ML in federated teams across the organization, and needed a scalable ML framework that enables developers of new models and pipelines to self-serve the deployment of a modern, pre-approved, standardized, and secure infrastructure. This reduces the dependency on the centralized platform teams and allows for a faster time-to-value for ML model development.

The framework in its simplest terms allows a data consumer (data scientist or ML engineer) to browse and discover the pre-approved data they require to train their model, gain access to that data in a quick and simple manner, use that data to prove their model viability, and release that proven model to production for others to utilize, thereby unlocking business value.

The following figure denotes this flow and the benefits that the framework provides. These include reduced compute costs (due to the managed and on-demand infrastructure), reduced operational overhead (due to the self-service templates), reduced time to spin up and spin down secure compliant ML environments (with AWS Service Catalog products), and a simplified route-to-live.

At a lower level, the scalable ML framework comprises the following:

  • Self-service infrastructure deployment – Reduces dependency on centralized teams.
  • A central Python package management system – Makes pre-approved Python packages available for model development.
  • CI/CD pipelines for model development and promotion – Decreases time-to-live by including CI/CD pipelines as part of infrastructure as code (IaC) templates.
  • Model testing capabilities – Unit testing, model testing, integration testing, and end-to-end testing functionalities are automatically available for new models.
  • Model decoupling and orchestration – Decouples model steps according to computational resource requirements and orchestration of the different steps via Amazon SageMaker Pipelines. This avoids unnecessary compute and makes deployments more robust.
  • Code standardization – Code quality standardization with Python Enhancement Proposal (PEP8) standards validation via CI/CD pipelines integration.
  • Quick-start generic ML templates – AWS Service Catalog templates that instantiate your ML modelling environments (development, pre-production, production) and associated pipelines with the click of a button. This is done via Amazon SageMaker Projects deployment.
  • Data and model quality monitoring – Automatically monitors drift in your data and model quality via Amazon SageMaker Model Monitor. This helps ensure your models perform to operational requirements and within risk appetite.
  • Bias monitoring – Automatically checks for data imbalances and if changes in the world have introduced bias to your model. This helps the model owners ensure that they make fair and equitable decisions.

To prove SageMaker’s capabilities across a range of data processing and ML architectures, three use cases were selected from different divisions within the banking group. Use case data was obfuscated and made available in a local Amazon Simple Storage Service (Amazon S3) data bucket in the use case development account for the capabilities proving phase. When the model migration was complete, the data was made available via the NatWest cloud-hosted data lake, where it’s read by the production models. The predictions generated by the production models are then written back to the data lake. Future use cases will incorporate Cloud Data Explorer, an accelerator that allows data engineers and scientists to easily search and browse NatWest’s pre-approved data catalog, which accelerates the data discovery process.

As defined by AWS best practices, three accounts (dev, test, prod) are provisioned for each use case. To meet security requirements, public internet access is disabled and all data is encrypted with custom keys. As described in this blog post, a secure instance of SageMaker Studio is then deployed to the development account in a matter of minutes. When the account setup is complete, a new use case template is requested by the data scientists via SageMaker Projects in Studio. This process deploys the necessary infrastructure that ensures MLOps capabilities in the development account (with minimal support required from operational teams) such as CI/CD pipelines, unit testing, model testing, and monitoring.

Each use case was then developed (or refactored in the case of an existing application codebase) to run in a SageMaker architecture as illustrated in this blog post, utilizing SageMaker capabilities such as experiment tracking, model explainability, bias detection, and data quality monitoring. These capabilities were added to each use case pipeline as described in this blog post.

Cloud-first: The solution for sustainable ML model development and deployment

Training ML models using large datasets requires a lot of computational resources. However, you can optimize the amount of energy used by running training workflows in AWS. Studies suggest that AWS typically produces a nearly 80% lower carbon footprint and is five times more energy efficient than the median-surveyed European enterprise data centers. In addition, adopting an on-demand managed architecture for ML workflows, such as what they developed during the partnership, allows NatWest Group to only provision the necessary resources for the work to be done.

For example, in a use case with a big dataset where 10% of the data columns contain actual useful information and are used for the model training, the on-demand architecture orchestrated by SageMaker Pipelines allows for the split of the data preprocessing step into two phases: first, read and filter, and second, feature engineering. That way, a larger compute instance is used for reading and filtering the data, because that requires more computational resources, whereas a smaller instance is used for the feature engineering, which only needs to process 10% of the useful columns. Finally, the inclusion of SageMaker services such as Model Monitor and Pipelines allows for the continuous monitoring of the data quality, avoiding inference on models that have drifted in data and model quality. This further saves energy and computer resources with compute jobs that don’t bring business value.

During the partnership, multiple sustainability optimization techniques were introduced to NatWest’s ML model development and deployment strategy, from the selection of efficient on-demand managed architectures to the compression of data in efficient file formats. Initial calculations suggest considerable carbon emissions reductions when compared to other cloud architectures, supporting NatWest Group’s ambition to be net zero by 2050.

Outcomes

During the 9 months of delivery, NatWest and AWS worked as a team to build and scale the MLOps capabilities across the bank. The overall achievements of the partnership include:

  • Scaling MLOps capability across NatWest, with over 300 data scientists and data engineers being trained to work in the developed platform
  • Implementation of a scalable, secure, cost-efficient, and sustainable infrastructure deployment using AWS Service Catalog to create a SageMaker on-demand managed infrastructure
  • Standardization of the ML model development and deployment process across multiple teams
  • Reduction of technical debt of existing models and the creation of reusable artifacts to speed up future model development
  • Reduction of idea-to-value time for data and analytics use cases from 40 to 16 weeks
  • Reduction for ML use case environment creation time from 35–40 days to 1–2 days, including multiple validations of the use case required by the bank

Conclusion

This post provided an overview of the AWS and NatWest Group partnership that resulted in the implementation of a scalable ML framework and successfully reduced the time-to-live of ML models at NatWest Group.

In a joint effort, AWS and NatWest implemented standards to ensure scalability, security, and sustainability of ML workflows across the organization. Subsequent posts in this series will provide more details on the implemented solution:

  • Part 2 describes how NatWest Group used AWS Service Catalog and SageMaker to deploy their secure, compliant, and self-service MLOps platform. It is intended to provide details for platform developers such as DevOps, platform engineers, security, and IT teams.
  • Part 3 provides an overview of how NatWest Group uses SageMaker services to build auditable, reproducible, and explainable ML models. It is intended to provide details for model developers, such as data scientists and ML or data engineers.
  • Part 4 details how NatWest data science teams migrate their existing models to SageMaker architectures. It is intended to provide details of how to successfully migrate existent models to model developers, such as data scientists and ML or data engineers.

AWS Professional Services is ready to help your team develop scalable and production-ready ML in AWS. For more information, see AWS Professional Services or reach out through your account manager to get in touch.


About the Authors

Maira Ladeira Tanke is a Data Scientist at AWS Professional Services. As a technical lead, she helps customers accelerate their achievement of business value through emerging technology and innovative solutions. Maira has been with AWS since January 2020. Prior to that, she worked as a data scientist in multiple industries focusing on achieving business value from data. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Andy McMahon is a ML Engineering Lead in the Data Science and Innovation team in NatWest Group. He helps develop and implement best practices and capabilities for taking machine learning products to production across the bank. Andy has several years’ experience leading data science and ML teams and speaks and writes extensively on MLOps (this includes the book “Machine Learning Engineering with Python”, 2021, Packt). In his spare time, he loves running around after his young son, reading, playing badminton and golf, and planning the next family adventure with his wife.

Greig Cowan is a data science and engineering leader in the financial services domain who is passionate about building data-driven solutions and transforming the enterprise through MLOps best practices. Over the past four years, his team have delivered data and machine learning-led products across NatWest Group and have defined how people, process, data, and technology should work together to optimize time-to-business-value. He leads the NatWest Group data science community, builds links with academic and commercial partners, and provides leadership and guidance to the technical teams building the bank’s data and analytics infrastructure. Prior to joining NatWest Group, he spent many years as a scientific researcher in high-energy particle physics as a leading member of the LHCb collaboration at the CERN Large Hadron. In his spare time, Greig enjoys long-distance running, playing chess, and looking after his two amazing children.

Brian Cavanagh is a Senior Global Engagement Manager at AWS Professional Services with a passion for implementing emerging technology solutions (AI, ML, and data) to accelerate the release of value for our customers. Brian has been with AWS since December 2019. Prior to working at AWS, Brian delivered innovative solutions for major investment banks to drive efficiency gains and optimize their resources. Outside of work, Brian is a keen cyclist and you will usually find him competing on Zwift.

Sokratis Kartakis is a Senior Customer Delivery Architect on AI/ML at AWS Professional Services. Sokratis’s focus is to guide enterprise and global customers on their AI cloud transformation and MLOps journey targeting business outcomes and leading the delivery across EMEA. Prior to AWS, Sokratis spent over 15 years inventing, designing, leading, and implementing innovative solutions and strategies in AI/ML and IoT, helping companies save billions of dollars in energy, retail, health, finance and banking, supply chain, motorsports, and more. Outside of work, Sokratis enjoys spending time with his family and riding motorbikes.

Read More

Accelerate data preparation with data quality and insights in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a new capability of Amazon SageMaker that helps data scientists and data engineers quickly and easily prepare data for machine learning (ML) applications using a visual interface. It contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code.

Today, we’re excited to announce the new Data Quality and Insights Report feature within Data Wrangler. This report automatically verifies data quality and detects abnormalities in your data. Data scientists and data engineers can use this tool to efficiently and quickly apply domain knowledge to process datasets for ML model training.

The report includes the following sections:

  • Summary statistics – This section provides insights into the number of rows, features, % missing, % valid, duplicate rows, and a breakdown of the type of feature (e.g. numeric vs. text).
  • Data Quality Warnings – This section provides insights into warnings that the insights report has detected including items such as presence of small minority class, high target cardinality, rare target label, imbalanced class distribution, skewed target, heavy tailed target, outliers in target, regression frequent label, and invalid values.
  • Target Column Insights – This section provides statistics on the target column including % valid, % missing, % outliers, univariate statistics such as min/median/max, and also presents examples of observations with outlier or invalid target values.
  • Quick Model – The data insights report automatically trains a model on a small sample of your data set to provide a directional check on feature engineering progress and provides associated model statistics in the report.
  • Feature Importance – This section provides a ranking of features by feature importance which are automatically calculated when preparing the data insights and data quality report.
  • Anomalous and duplicate rows – The data quality and insights report detects anomalous samples using the Isolation forest algorithm and also surfaces duplicate rows that may be present in the data set.
  • Feature details – This section provides summary statistics for each feature in the data set as well as the corresponding distribution of the target variable.

In this post, we use the insights and recommendations of the Data Quality and Insights Report to process data by applying the suggested transformation steps without writing any code.

Get insights with the Data Quality and Insights Report

Our goal is to predict the rent cost of houses using the Brazilian houses to rent dataset. This dataset has features describing the living unit size, number of rooms, floor number, and more. To import the data set, first upload it to an Amazon Simple Storage Service (S3) bucket, and follow the steps here. After we load the dataset into Data Wrangler, we start by producing the report.

  1. Choose the plus sign and choose Get data insights.
  2. To predict rental cost, we first choose Create analysis within the Data Wrangler data flow UI.
  3. For Target column, choose the column rent amount (R$).
  4. For Problem type, select Regression.
  5. Choose Start to generate the report.

If you don’t have a target column, you can still produce the report without it.

The Summary section from the generated report provides an overview of the data and displays general statistics such as the number of features, feature types, number of rows in the data, and missing and valid values.

The High Priority Warnings section lists warnings for you to see. These are abnormalities in the data that could affect the quality of our ML model and should be addressed.

For your convenience, high-priority warnings are displayed at the top of the report, while other relevant warnings appear later. In our case, we found two major issues:

  • The data contains many (5.65%) duplicate samples
  • A feature is suspected of target leakage

You can find further information about these issues in the text that accompanies the warnings as well as suggestions on how to resolve them using Data Wrangler.

Investigate the Data Quality and Insights Report warnings

To investigate the first warning, we go to the Duplicate Rows section, where the report lists the most common duplicates. The first row in the following table shows that the data contains 22 samples of the same house in Porto Alegre.

From our domain knowledge, we know that there aren’t 22 identical apartments, so we conclude that this is an artifact caused by faulty data collection. Therefore, we follow the recommendation and remove the duplicate samples using the suggested transform.

We also note the duplicate rows in the dataset because they can affect the ability to detect target leakage. The report warns us of the following:

“Duplicate samples resulting from faulty data collection, could derail machine learning processes that rely on splitting to independent training and validation folds. For example quick model scores, prediction power estimation….”

So we recheck for target leakage after resolving the duplicate rows issue.

Next, we observe a medium-priority warning regarding outlier values in the Target Column section (to avoid clutter, medium-priority warnings aren’t highlighted at the top of the report).

Data Wrangler detected that 0.1% of the target column values are outliers. The outlier values are gathered into a separate bin in the histogram, and the most distant outliers are listed in a table.

While for most houses, rent is below 15,000, a few houses have much higher rent. In our case, we’re not interested in predicting these extreme values, so we follow the recommendation to remove these outliers using the Handle outliers transform. We can get there by choosing the plus sign for the data show, then choosing Add transform. Then we add the Handle outliers transform and provide the necessary configuration.

We follow the recommendations to drop duplicate rows using the Manage rows transform and drop outliers in the target column using the Handle outliers transform.

Address additional warnings on transformed data

After we address the initial issues found in the data, we run the Data Quality and Insights Report on the transformed data.

We can verify that the problems of duplicate rows and outliers in the target column were resolved.

However, the feature fire insurance (R$) is still suspected for target leakage. Furthermore, we see that the feature Total (R$) has very high prediction power. From our domain expertise, we know that  is tightly coupled with rent cost, and isn’t available when making the prediction.

Similarly, Total (R$) is the sum of a few columns, which includes our target rent column. Therefore, Total (R$) isn’t available in prediction time. We follow the recommendation to drop these columns using the Manage columns transform.

To provide an estimate of the expected prediction quality of a model trained on the data, Data Wrangler trains an XGBoost model on a training fold and calculates prediction quality scores on a validation fold. You can find the results in the Quick Model section of the report. The resulting R2 score is almost perfect—0.997. This is another indication of target leakage—we know that rent prices can’t be predicted that well.

In the Feature Details section, the report raises a Disguised missing value warning for the hoa (R$) column (HOA stands for Home Owners Association):

“The frequency of the value 0 in the feature hoa (R$) is 22.2% which is uncommon for numeric features. This could point to bugs in data collection or processing. In some cases the frequent value is a default value or a placeholder to indicate a missing value….”

We know that HOA stands for home owner association tax, and that zeros represents missing values. It’s better to indicate that we don’t know if these homes belong to an association. Therefore, we follow the recommendation to replace the zeros with NaNs using the Convert regex to missing transform under Search and edit.

We follow the recommendation to drop the fire insurance (R$) and Total (R$) columns using the Manage columns transform and replacing zeros with NaNs for hoa (R$).

Create the final report

To verify that all issues are resolved, we generate the report again. First, we note that there are no high-priority warnings.

Looking again at Quick model, we see that due to removing the columns with target leakage, the validation R2 score dropped from the “too good to be true” value of 0.997 to 0.653, which is more reasonable.

Additionally, fixing hoa (R$) by replacing zeros with NaNs improved the prediction power of that column from 0.35 to 0.52.

Finally, because ML algorithms usually require numeric inputs, we encode the categorical and binary columns (see the Summary section) as ordinal integers using the Ordinal encode transform under Encode Categorical. In general, categorical variables can be encoded as one-hot or ordinal variables.

Now the data is ready for model training. We can export the processed data to an Amazon Simple Storage Service (Amazon S3) bucket or create a processing job.

Additionally, you can download the reports you generate by choosing the download icon.

Conclusion

In this post, we saw how the Data Quality and Insights Report can help you employ domain expertise and business logic to process data for ML training. Data scientists of all skillsets can find value in this report. Experienced users can also find value in these reports to speed up their work because it automatically produces various statistics and performs many checks that would otherwise be done manually.

We encourage you to replicate the example in this post in your Data Wrangler data flow to see how you can apply this to your domain and datasets. For more information on using Data Wrangler, refer to Get Started with Data Wrangler.


About the Authors

Stephanie Yuan is an SDE in the SageMaker Data Wrangler team based out of Seattle, WA. Outside of work, she enjoys going out to nature that PNW has to offer.

Joseph Cho is an engineer who’s built a number of web applications from bottom up using modern programming techniques. He’s driven key features into production and has a history of practical problem solving. In his spare time Joseph enjoys playing chess and going to concerts.

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications.

Yotam Elor is a Senior Applied Scientist at Amazon SageMaker. His research interests are in machine learning, particularly for tabular data.

Read More

Host Hugging Face transformer models using Amazon SageMaker Serverless Inference

The last few years have seen rapid growth in the field of natural language processing (NLP) using transformer deep learning architectures. With its Transformers open-source library and machine learning (ML) platform, Hugging Face makes transfer learning and the latest transformer models accessible to the global AI community. This can reduce the time needed for data scientists and ML engineers in companies around the world to take advantage of every new scientific advancement. Amazon SageMaker and Hugging Face have been collaborating to simplify and accelerate adoption of transformer models with Hugging Face DLCs, integration with SageMaker Training Compiler, and SageMaker distributed libraries.

SageMaker provides different options for ML practitioners to deploy trained transformer models for generating inferences:

  • Real-time inference endpoints, which are suitable for workloads that need to be processed with low latency requirements in the order of milliseconds.
  • Batch transform, which is ideal for offline predictions on large batches of data.
  • Asynchronous inference, which is ideal for workloads requiring large payload size (up to 1 GB) and long inference processing times (up to 15 minutes). Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process.
  • Serverless Inference, which is a new purpose-built inference option that makes it easy for you to deploy and scale ML models.

In this post, we explore how to use SageMaker Serverless Inference to deploy Hugging Face transformer models and discuss the inference performance and cost-effectiveness in different scenarios.

Overview of solution

AWS recently announced the General Availability of SageMaker Serverless Inference, a purpose-built serverless inference option that makes it easy for you to deploy and scale ML models. With Serverless Inference, you don’t need to provision capacity and forecast usage patterns upfront. As a result, it saves you time and eliminates the need to choose instance types and manage scaling policies. Serverless Inference endpoints automatically start, scale, and shut down capacity, and you only pay for the duration of running the inference code and the amount of data processed, not for idle time. Serverless inference is ideal for applications with less predictable or intermittent traffic patterns. Visit Serverless Inference to learn more.

Since its preview launch at AWS re:Invent 2021, we have made the following improvements:

  • General Availability across all commercial Regions where SageMaker is generally available (except AWS China Regions).
  • Increased maximum concurrent invocations per endpoint limit to 200 (from 50 during preview), allowing you to onboard high traffic workloads.
  • Added support for the SageMaker Python SDK, abstracting the complexity from ML model deployment process.
  • Added model registry support. You can register your models in the model registry and deploy directly to serverless endpoints. This allows you to integrate your SageMaker Serverless Inference endpoints with your MLOps workflows.

Let’s walk through how to deploy Hugging Face models on SageMaker Serverless Inference.

Deploy a Hugging Face model using SageMaker Serverless Inference

The Hugging Face framework is supported by SageMaker, and you can directly use the SageMaker Python SDK to deploy the model into the Serverless Inference endpoint by simply adding a few lines in the configuration. We use the SageMaker Python SDK in our example scripts. If you have models from different frameworks that need more customization, you can use the AWS SDK for Python (Boto3) to deploy models on a Serverless Inference endpoint. For more details, refer to Deploying ML models using SageMaker Serverless Inference (Preview).

First, you need to create HuggingFaceModel. You can select the model you want to deploy on the Hugging Face Hub; for example, distilbert-base-uncased-finetuned-sst-2-english. This model is a fine-tuned checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. See the following code:

import sagemaker
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig

# Specify Model Image_uri
image_uri='763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.9.1-transformers4.12.3-cpu-py38-ubuntu20.04'

# Hub Model configuration. <https://huggingface.co/models>
hub = {
    'HF_MODEL_ID':'cardiffnlp/twitter-roberta-base-sentiment',
    'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                      # configuration for loading model from Hub
   role=sagemaker.get_execution_role(),                    
                                 # iam role with permissions
   transformers_version="4.17",  # transformers version used
   pytorch_version="1.10",        # pytorch version used
   py_version='py38',            # python version used
   image_uri=image_uri,          # image uri
)

After you add the HuggingFaceModel class, you need to define the ServerlessInferenceConfig, which contains the configuration of the Serverless Inference endpoint, namely the memory size and the maximum number of concurrent invocations for the endpoint. After that, you can deploy the Serverless Inference endpoint with the deploy method:

# Specify MemorySizeInMB and MaxConcurrency
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096, max_concurrency=10,
)

# deploy the serverless endpoint
predictor = huggingface_model.deploy(
    serverless_inference_config=serverless_config
)

This post covers the sample code snippet for deploying the Hugging Face model on a Serverless Inference endpoint. For a step-by-step Python code sample with detailed instructions, refer to the following notebook.

Inference performance

SageMaker Serverless Inference greatly simplifies hosting for serverless transformer or deep learning models. However, its runtime can vary depending on the configuration setup. You can choose a Serverless Inference endpoint memory size from 1024 MB (1 GB) up to 6144 MB (6 GB). In general, memory size should be at least as large as the model size. However, it’s a good practice to refer to memory utilization when deciding the endpoint memory size, in addition to the model size itself.

Serverless Inference integrates with AWS Lambda to offer high availability, built-in fault tolerance, and automatic scaling. Although Serverless Inference auto-assigns compute resources proportional to the model size, a larger memory size implies the container would have access to more vCPUs and have more compute power. However, as of this writing, Serverless Inference only supports CPUs and does not support GPUs.

On the other hand, with Serverless Inference, you can expect cold starts when the endpoint has no traffic for a while and suddenly receives inference requests, because it would take some time to spin up the compute resources. Cold starts may also happen during scaling, such as if new concurrent requests exceed the previous concurrent requests. The duration for a cold start varies from under 100 milliseconds to a few seconds, depending on the model size, how long it takes to download the model, as well as the startup time of the container. For more information about monitoring, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

In the following table, we summarize the results of experiments performed to collect latency information of Serverless Inference endpoints over various Hugging Face models and different endpoint memory sizes. As input, a 128 sequence length payload was used. The model latency increased when the model size became large; the latency number is related to the model specifically. The overall model latency could be improved in various ways, including reducing model size, a better inference script, loading the model more efficiently, and quantization.

Model-Id Task Model Size Memory Size Model Latency Average

Model

Latency

p99

Overhead Latency

Average

Overhead Latency

p99

distilbert-base-uncased-finetuned-sst-2-english Text classification 255 MB 4096 MB 220 ms 243 ms 17 ms 43 ms
xlm-roberta-large-finetuned-conll03-english Token classification 2.09 GB 5120 MB 1494 ms 1608 ms 18 ms 46 ms
deepset/roberta-base-squad2 Question answering 473 MB 5120 MB 451 ms 468 ms 18 ms 31 ms
google/pegasus-xsum Summarization 2.12 GB 6144 MB 22501 ms 32516 ms 27 ms 97 ms
sshleifer/distilbart-cnn-12-6 Summarization 1.14 GB 6144 MB 12683 ms 18669 ms 25 ms 53 ms

Regarding price performance, we can use the DistilBERT model as an example and compare the serverless inference option to an ml.t2.medium instance (2 vCPUs, 4 GB Memory, $42 per month) on a real-time endpoint. With the real-time endpoint, DistilBERT has a p99 latency of 240 milliseconds and total request latency of 251 milliseconds. Whereas in the preceding table, Serverless Inference has a model latency p99 of 243 milliseconds and an overhead latency p99 of 43 milliseconds. With Serverless Inference, you only pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. Therefore, we can estimate the cost of running a DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english, with price estimates for the us-east-1 Region) as follows:

  • Total request time – 243ms + 43ms = 286ms
  • Compute cost (4096 MB) – $0.000080 USD per second
  • 1 request cost – 1 * 0.286 * $0.000080 = $0.00002288
  • 1 thousand requests cost – 1,000 * 0.286 * $0.000080 = $0.02288
  • 1 million requests cost – 1,000,000 * 0.286 * $0.000080 = $22.88

The following figure is the cost comparison for different Hugging Face models on SageMaker Serverless Inference vs. real-time inference (using a ml.t2.medium instance).

Figure 1: Cost comparison for different Hugging Face models on SageMaker Serverless Inference vs. real-time inference

Figure 1: Cost comparison for different Hugging Face models on SageMaker Serverless Inference vs. real-time inference

As you can see, Serverless Inference is a cost-effective option when the traffic is intermittent or low. If you have a strict inference latency requirement, consider using real-time inference with higher compute power.

Conclusion

Hugging Face and AWS announced a partnership earlier in 2022 that makes it even easier to train Hugging Face models on SageMaker. This functionality is available through the development of Hugging Face AWS DLCs. These containers include the Hugging Face Transformers, Tokenizers, and Datasets libraries, which allow us to use these resources for training and inference jobs. For a list of the available DLC images, see Available Deep Learning Containers Images. They are maintained and regularly updated with security patches. We can find many examples of how to train Hugging Face models with these DLCs in the following GitHub repo.

In this post, we introduced how you can use the newly announced SageMaker Serverless Inference to deploy Hugging Face models. We provided a detailed code snippet using the SageMaker Python SDK to deploy Hugging Face models with SageMaker Serverless Inference. We then dived deep into inference latency over various Hugging Face models as well as price performance. You are welcome to try this new service and we’re excited to receive more feedback!


About the Authors

James Yi is a Senior AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys playing soccer, traveling and spending time with his family.

Rishabh Ray Chaudhury is a Senior Product Manager with Amazon SageMaker, focusing on Machine Learning inference. He is passionate about innovating and building new experiences for Machine Learning customers on AWS to help scale their workloads. In his spare time, he enjoys traveling and cooking. You can find him on LinkedIn.

Philipp Schmid is a Machine Learning Engineer and Tech Lead at Hugging Face, where he leads the collaboration with the Amazon SageMaker team. He is passionate about democratizing, optimizing, and productionizing cutting-edge NLP models and improving the ease of use for Deep Learning.

Yanyan Zhang is a Data Scientist in the Energy Delivery team with AWS Professional Services. She is passionate about helping customers solve real problems with AI/ML knowledge. Outside of work, she loves traveling, working out and exploring new things.

Read More

How Nordic Aviation Capital uses Amazon Rekognition to streamline operations and save up to EUR200,000 annually

Nordic Aviation Capital (NAC) is the industry’s leading regional aircraft lessor, serving almost 70 airlines in approximately 45 countries worldwide.

In 2021, NAC turned to AWS to help it use artificial intelligence (AI) to further improve its leasing operations and reduce its reliance on manual labor.

With Amazon Rekognition Custom Labels, NAC built an AI solution that could automatically scan aircraft maintenance records and identify, based on their visual layouts, the specific documents requiring further review. This reduced their reliance on external contractors to do this work, improving speed and saving an estimated EUR200,000 annually in costs.

In this post, we share how NAC uses Amazon Rekognition to streamline their operations.

“Amazon Rekognition Custom Labels has given us superpowers when it comes to improving our aircraft maintenance reviews. We’re both impressed and excited by the opportunities this opens up for our team and the value it can help us create for our customers.”

-Mads Krog-Jensen, Senior Vice President of IT, NAC

Automating the document review process with AI

A key part of NAC’s leasing process is the validation of each of its leased aircraft’s maintenance history to determine the safety and operability of its component parts.

This process requires NAC’s maintenance technicians to validate a variety of key forms, with the collection of documents containing the maintenance history of each of the aircraft’s key parts, known as the maintenance package.

These maintenance packages are extensive and unstructured, often amounting to as many as 10,000 pages, and containing various types and formats of documents that can vary widely based on the age and maintenance history of the aircraft.

The task of finding these specific forms was long and menial, generally performed by external contractors, who could take as long as a week to review each maintenance package and identify any essential forms requiring further review. This created a key process bottleneck that added additional cost and time to NAC’s lending process.

To streamline this process, NAC set out to develop an AI-driven document review workflow that could automate this manual process by scanning entire maintenance packages to accurately identify and return only those documents that required further review by NAC specialists.

Building a custom computer vision solution with Amazon Rekognition

To solve this, NAC’s Director of Software Engineering, Martin Høst Normark, turned to Rekognition Custom Labels, a fully-managed computer vision service that helps developers quickly and easily train and deploy custom computer vision models tailored to any use case.

Rekognition Custom Labels accelerates the development of custom computer vision models by building on the capabilities of Amazon Rekognition and simplifying the key steps of the computer vision development process, such as image labeling, data inspection, and algorithm selection and deployment. Rekognition Custom Labels allows you to build custom computer vision models for image classification and object detection tasks. You can navigate through the image labeling process from within the Rekognotion Custom Labels console or use Amazon SageMaker Ground Truth to allow for image labeling at scale. Rekognition Custom Labels automatically inspects the data, selects the right model framework and algorithm, optimizes the hyperparameters, and trains the model. When you’re satisified with the model accuracy, you can host the trained model with just one click.

NAC chose Amazon Rekognition because it significantly reduced the undifferentiated heavy lifting of training and deploying a custom computer vision model. For example, instead of requiring thousands of labeled training images to get started, as is the case with most custom computer vision models, NAC was able to get started with just a few hundred examples of the types of documents it needed to identify. These images, together with an equal number of negative examples chosen at random, were then loaded into an Amazon Simple Storage Service (Amazon S3) bucket to be used for model training. This also enabled NAC to use Rekognition Custom Label’s automatic labeling service, which could infer the labels of the two types of documents based solely on their S3 folder names.

From there, NAC was able to start training its model in just a few clicks, at which point Rekognition Custom Labels took care of loading and inspecting the training data, selecting the correct machine learning algorithm, training and testing the model, and reporting its performance metrics.

In order for the solution to deliver real business value, NAC identified a minimum performance baseline of 75% recall for its computer vision model, meaning that the solution had to be able to capture at least 75% of all relevant documents in any given maintenance package to warrant being used in production.

Using Rekognition Custom Labels and training on only those initial images, NAC was able to produce an initial model within its first week of development that delivered a recall of 98%, beating its performance baseline by 23 percentage points.

NAC then spent an additional week inspecting the types of pages causing classification errors, and added some additional examples of those challenging examples to its S3 bucket to retrain its model. This step further optimized performance above 99% recall and far exceeded its production performance requirements.

Improving operational efficiency and increasing innovation with AWS

With Rekognition Custom Labels, NAC was able to build, in just two weeks, a production-ready custom computer vision solution that could accurately identify and return relevant documents at higher than 99% accuracy, reducing to a matter of minutes a process that previously took manual reviewers about a week to complete.

This success has enabled NAC to move this solution to production, removing key process bottlenecks in its aircraft maintenance review processes to improve efficiency, reduce reliance on external contractors, and continue to deliver on its 30-year history of technical and commercial innovation in the regional aircraft industry.

Conclusion

Rekognition Custom Labels can help you develop custom computer vision models with ease by simplifying key steps such as image labeling, data inspection, and algorithm selection and deployment.

Learn more about how you can build custom computer vision models tailored to your specific use case by visiting Getting Started with Amazon Rekognition Custom Labels or reviewing the Amazon Rekognition Custom Labels Guide.


About the Author

Daniel Burke is the European lead for AI and ML in the Private Equity group at AWS. Daniel works directly with Private Equity funds and their portfolio companies, helping them accelerate their AI and ML adoption and improve innovation and increase enterprise value.

Read More

Secure AWS CodeArtifact access for isolated Amazon SageMaker notebook instances

AWS CodeArtifact allows developers to connect internal code repositories to upstream code repositories like Pypi, Maven, or NPM. AWS CodeArtifact is a powerful addition to CI/CD workflows on AWS, but it is similarly effective for code-bases hosted on a Jupyter notebook. This is a common development paradigm for Machine Learning developers that build and train ML models regularly.

In this post, we demonstrate how to securely connect to AWS CodeArtifact from an Internet-disabled SageMaker Notebook Instance. This post is for network and security architects that support decentralized data science teams on AWS.

In another post, we discussed how to create an Internet-disabled notebook in a private subnet of an Amazon VPC while maintaining connectivity to AWS services via AWS Private Link endpoints. The examples in this post will connect an Internet-disabled notebook instance to AWS CodeArtifact and download open-source code packages without needing to traverse the public internet.

Solution overview

The following diagram describes the solution we will implement. We create a SageMaker notebook instance in a private subnet of a VPC. We also create an AWS CodeArtifact domain and a repository. Access to the repository will be controlled by CodeArtifact repository policies and PrivateLink access policies.

The architecture allows our Internet-disabled SageMaker notebook instance to access CodeArtifact repositories without traversing the public internet. Because the network traffic doesn’t traverse the public internet, we improve the security posture of the notebook instance by ensuring that only users with the expected network access can access the notebook instance. Furthermore, this paradigm allows security administrators to restrict library consumption to only “approved” distributions of code packages. By combining network security with secure package management, security engineers can transparently manage open-source libraries for data scientists without impeding their ability to work.

Prerequisites

For this post, we need an Internet-disabled SageMaker notebook instance, and a VPC with a private subnet. Visit this link to create a private subnet in a VPC, as well as the previous post in this series to get started with these prerequisites.

We also need a CodeArtifact domain in the AWS Region where you created your Internet-disabled SageMaker notebook instance. The domain is a way to organize repositories logically in the CodeArtifact service. Name the domain, select AWS managed key for encryption, and create the domain. This link discusses how to create an AWS CodeArtifact Domain.

Configure AWS CodeArtifact

To configure AWS CodeArtifact, we create a repository in the domain. Before proceeding, be sure to select the same region where the notebook instance has been deployed. Perform the following steps to configure AWS CodeArtifact:

  1. On the AWS CodeArtifact console, choose Create repository.
  2. Give the repository a name and description. Select the public upstream repository you want to use. We use the pypi-store in this post.
  3. Choose Next.

  4. Choose the AWS account you are working in and the AWS CodeArtifact domain you use for this account.
  5. Choose Next.

  6. Review the repository information summary and choose Create repository.
    1. Note: In the review screen section marked “Package flow” there is a flowchart describing the flow of dependencies from external connections into the domain managed by AWS CodeArtifact.
    2. This flowchart describes what happens when we create our repository. We are actually creating two repositories. The first is the “pypi-store” repository that connects to the externally hosted pypi repository. This repository was created by AWS CodeArtifact and is used to stage connections to the upstream repository. The second repository, “isolatedsm”, connects to the “pypi-store” repository. This transitive repository lets us combine external connections and stage third-party libraries before using them.
    3. This form of transitive repository management allows us to enforce least privileged access on the libraries we use for data science workloads.

Alternatively, we can perform these steps with the AWS CLI using the following command:

aws codeartifact create-repository --domain <domain_name> --domain-owner <account_number> --repository <repo_name> --description <repo_description> --region <region_name>

The result at the end of this process is two repositories in our domain.

CodeArtifact Repository Policy

For brevity, we will grant a relatively open policy for our IsolatedSM repository allowing any of our developers to access the CodeArtifact repository. This policy should be modified for production use cases. Later in the post, we will discuss how to implement least-privilege access at the role level using an IAM policy attached to the notebook instance role. For now, navigate to the repository in the AWS Management Console and expand the Details section of the repository configuration page. Choose Apply a repository policy under Repository policy.

On the next screen, paste the following policy document in the text field marked Edit repository policy and then choose Save:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeartifact:AssociateExternalConnection",
                "codeartifact:CopyPackageVersions",
                "codeartifact:DeletePackageVersions",
                "codeartifact:DeleteRepository",
                "codeartifact:DeleteRepositoryPermissionsPolicy",
                "codeartifact:DescribePackageVersion",
                "codeartifact:DescribeRepository",
                "codeartifact:DisassociateExternalConnection",
                "codeartifact:DisposePackageVersions",
                "codeartifact:GetPackageVersionReadme",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ListPackageVersionAssets",
                "codeartifact:ListPackageVersionDependencies",
                "codeartifact:ListPackageVersions",
                "codeartifact:ListPackages",
                "codeartifact:PublishPackageVersion",
                "codeartifact:PutPackageMetadata",
                "codeartifact:PutRepositoryPermissionsPolicy",
                "codeartifact:ReadFromRepository",
                "codeartifact:UpdatePackageVersionsStatus",
                "codeartifact:UpdateRepository"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Principal": {
                "AWS": "arn:aws:iam::329542461890:role/GeneralIsolatedNotebook"
            }
        }
    ]
}

This policy provides full access to the repository for the role attached to the isolated notebook instance. This policy is a sample policy enables developer access to CodeArtifact. For more information on policy definitions for CodeArtifact repositories (especially if you need more restrictive role-based access), see the CodeArtifact User Guide. To configure the same repository policy using the AWS CLI, save the preceding policy document as policy.json and run the following command:

aws codeartifact put-repository-permissions-policy --domain <domain_name> --domain-owner <account_number> --repository <repo_name> --policy-document file:///PATH/TO/policy.json --region <region_name>

Configure access to AWS CodeArtifact

Connecting to a CodeArtifact repository requires logging into the repository as a user. By navigating into a repository and choosing View connection instructions, we can select the appropriate connection instructions for our package manager of choice. We will use pip in this post; from the drop-down, select pip and copy the connection instructions.

The AWS CLI command to use should be similar to the following:

aws codeartifact login --tool pip --repository <repo_name> --domain <domain_name> --domain-owner <account_number> --region <region_name>

This command is a CodeArtifact API call that returns an authentication cookie for the role that requested access to this repository. This command can be run in a Jupyter notebook to authenticate access to the CodeArtifact repository and will configure package managers to install libraries from that upstream repository. We can test this in our Internet-disabled SageMaker notebook instance.

When running the login command in our isolated notebook instance, nothing happens for some time. After a while (approximately 300 seconds), Jupyter will output a connection timeout error. This is because our notebook instance lives in an isolated network subnet. This is expected behavior, it confirms our notebook instance is Internet-disabled. We need to provision network access between this subnet and our CodeArtifact repository.

Create A PrivateLink connection between the Notebook and CodeArtifact

AWS PrivateLink is a networking service that creates VPC endpoints in your VPC for other AWS services like Amazon Elastic Compute Cloud (Amazon EC2), Amazon S3, and Amazon Simple Notification Service (Amazon SNS). Private endpoints facilitate API requests to other AWS services through your VPC instead of through the public internet. This is the crucial component that lets our solution privately and securely access the CodeArtifact repository we’ve created.

Before we create our PrivateLink endpoints, we must create a security group to associate with the endpoints. Before proceeding, make sure you are in the same region as the Internet-disabled SageMaker notebook instance.

  1. On the Amazon VPC console, choose Security Groups.
  2. Choose Create security group.
  3. Give the group an appropriate name and description.
  4. Select the VPC that you will deploy the PrivateLink endpoints to. This should be the same VPC that hosts the isolated SageMaker notebook.
  5. Under Inbound Rules, choose Add Rule and then permit All Traffic from the security group hosting the isolated SageMaker notebook.
  6. Outbound Rules should remain the default. Create the security group.

You can replicate these steps in the CLI with the following:

> aws ec2 create-security-group --group-name endpoint-sec-group --description "group for endpoint security" --vpc-id vpc_id --region region
> {
    "GroupId": endpoint-sec-group-id
}
> aws ec2 authorize-security-group-ingress --group-id endpoint-sec-group-id --protocol all --port -1 --source-group isolated-SM-sec-group-id --region region

For this next step, we recommend that customers use the AWS CLI for simplicity. First, save the following policy document as policy.json in your local file system.

{
  "Statement": [
    {
      "Action": "codeartifact:*",
      "Effect": "Allow",
      "Resource": "*",
      "Principal": "*"
    },
    {
      "Effect": "Allow",
      "Action": "sts:GetServiceBearerToken",
      "Resource": "*",
      "Principal": "*"
    },
    {
      "Action": [
        "codeartifact:CreateDomain",
        "codeartifact:CreateRepository",
        "codeartifact:DeleteDomain",
        "codeartifact:DeleteDomainPermissionsPolicy",
        "codeartifact:DeletePackageVersions",
        "codeartifact:DeleteRepository",
        "codeartifact:DeleteRepositoryPermissionsPolicy",
        "codeartifact:PutDomainPermissionsPolicy",
        "codeartifact:PutRepositoryPermissionsPolicy",
        "codeartifact:UpdateRepository"
      ],
      "Effect": "Deny",
      "Resource": "*",
      "Principal": "*"
    }
  ]
}

Then, run the following commands using the AWS CLI to create the PrivateLink endpoints for CodeArtifact. This first command creates the VPC Endpoint for CodeArtifact repository API commands.

> aws ec2 create-vpc-endpoint 
--vpc-endpoint-type Interface 
--vpc-id vpc-id 
--service-name com.amazonaws.region.codeartifact.repositories 
--subnet-ids [list-of-subnet-ids] 
-–security-group-ids endpoint-sec-group-id 
–-private-dns-enabled 
--policy-document file:///PATH/TO/policy.json 
--region region

This second commands creates the VPC Endpoint for CodeArtifact non-repository API commands. Note in this command, we do not enable private DNS for the endpoint. Note the output of this command as we will use it to enable private DNS in a subsequent CLI command.

> aws ec2 create-vpc-endpoint 
--vpc-endpoint-type Interface 
--vpc-id vpc-id 
--service-name com.amazonaws.region.codeartifact.api 
--subnet-ids [list-of-subnet-ids] 
-–security-group-ids endpoint-sec-group-id 
–-no-private-dns-enabled 
--policy-document file:///PATH/TO/policy.json 
--region region
{
    "VpcEndpoint": {
        "VpcEndpointId": "vpc-endpoint-id",
		...
    }
}

Once this VPC endpoint has been created, enable private DNS for the endpoint by running the following, final command:

> aws ec2 modify-vpc-endpoint 
--vpc-endpoint-id vpc-endpoint-id 
--private-dns-enabled 
--region region 

This policy document permits common CodeArtifact operations performed by developers to be permitted over this PrivateLink endpoint. This is acceptable for our use case because CodeArtifact lets us define access policies on the repositories themselves. We are only blocking CodeArtifact administrative commands from being sent over this endpoint. We block administrative commands because we do not want developers to perform administrative commands on the repository.

The following screenshot shows the API endpoint and the security group it belongs to.

The inbound rules on this security group should list one inbound rule, allowing All traffic from the isolated SageMaker notebook’s security group.

Network test

Once the security groups have been configured and the PrivateLink endpoints have been created, open a Jupyter notebook on the isolated SageMaker notebook instance. In a cell, run the connection instructions for the CodeArtifact repository we created earlier. Instead of a long pause with an eventual timeout error, we now get an AccessDenied exception.

Recall that CodeArtifact connection instructions can be found in the AWS Management Console by navigating to the CodeArtifact repository and selecting View connection instructions. For this post, select the connection instructions for pip.

At this point, our isolated SageMaker notebook instance can connect to the CodeArtifact service via PrivateLink. We now need to give our notebook instance’s role the relevant permissions required to interact with the service from a Jupyter notebook.

Modify notebook permissions

Once our CodeArtifact repository is configured, we need to modify the permissions on our isolated notebook instance role to allow our notebook to read from the artifact repository. In the AWS Management Console, navigate to the IAM service and, under Policies, choose Create Policy. Choose the JSON tab and paste the following JSON document in the text window:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "codeartifact:GetAuthorizationToken",
            "Resource": "arn:aws:codeartifact:<region>:<account_no>:domain/<domain_name>"
        },
        {
            "Effect": "Allow",
            "Action": "sts:GetServiceBearerToken",
            "Resource": "*"
        }
    ]
}

To create this policy document in the CLI, save this JSON as policy.json and run the following command:

aws iam create-policy --policy-name <policy_name> --policy-document file:///PATH/TO/policy.json

When attached to a role, this policy document permits the retrieval of an authorization token from the CodeArtifact service. Attach this policy to our notebook instance role by navigating to the IAM service in the AWS Management Console. Choose the notebook instance role you are using and attach this policy directly to the instance role. This can be done with the AWS CLI by running the following command:

aws iam attach-role-policy --role-name <role_name> --policy-arn <policy_arn>

Equipped with a role that allows authentication to CodeArtifact, we can now continue testing.

Permissions Test

In the AWS Management Console, navigate to the SageMaker service and open a Jupyter notebook from the Internet-disabled notebook instance. In the notebook cell, attempt to log into the CodeArtifact repository using the same command from the network test (found in the Network Test section).

Instead of an access denied exception, the output should show a successful authentication to the repository with an expiration on the token. Continue testing by using pip to download, install, and uninstall packages. These commands are authorized based on the policy attached to the CodeArtifact repository. If you want to restrict access to the repository based on the user, for example, restricting the ability to uninstall a package, modify the CodeArtifact repository policy.

We can confirm that the packages are installed by navigating to the repository in the AWS Management Console and searching for the installed package.

Clean up

When you destroy the VPC endpoints, the notebook instance loses access to the CodeArtifact repository. This reintroduces the timeout error from earlier in this post. This is expected behavior. Additionally, you may also delete the CodeArtifact repository, which charges customers based on the number of GBs of data stored per month.

Conclusion

By combining VPC endpoints with SageMaker notebooks, we can extend the availability of other AWS services to Internet-disabled private notebook instances. This allows us to improve the security posture of our development environment, without sacrificing developer productivity.


About the Author

frgud HeadshotDan Ferguson is a Solutions Architect at Amazon Web Services, focusing primarily on Private Equity & Growth Equity investments into late-stage startups.

Read More

Specify and extract information from documents using the new Queries feature in Amazon Textract

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. Amazon Textract now offers the flexibility to specify the data you need to extract from documents using the new Queries feature within the Analyze Document API. You don’t need to know the structure of the data in the document (table, form, implied field, nested data) or worry about variations across document versions and formats.

In this post, we discuss the following topics:

  • Success stories from AWS customers and benefits of the new Queries feature
  • How the Analyze Document Queries API helps extract information from documents
  • A walkthrough of the Amazon Textract console
  • Code examples to utilize the Analyze Document Queries API
  • How to process the response with the Amazon Textract parser library

Benefits of the new Queries feature

Traditional OCR solutions struggle to extract data accurately from most semi-structured and unstructured documents because of significant variations in how the data is laid out across multiple versions and formats of these documents. You need to implement custom postprocessing code or manually review the extracted information from these documents. With the Queries feature, you can specify the information you need in the form of natural language questions (for example, “What is the customer name”) and receive the exact information (“John Doe”) as part of the API response. The feature uses a combination of visual, spatial, and language models to extract the information you seek with high accuracy. The Queries feature is pre-trained on a large variety of semi-structured and unstructured documents. Some examples include paystubs, bank statements, W-2s, loan application forms, mortgage notes, and vaccine and insurance cards.

Amazon Textract enables us to automate the document processing needs of our customers. With the Queries feature, we will be able to extract data from a variety of documents with even greater flexibility and accuracy,said Robert Jansen, Chief Executive Officer at TekStream Solutions. “We see this as a big productivity win for our business customers, who will be able to use the Queries capability as part of our IDP solution to quickly get key information out of their documents.

Amazon Textract enables us to extract text as well as structured elements like Forms and Tables from images with high accuracy. Amazon Textract Queries has helped us drastically improve the quality of information extraction from several business-critical documents such as safety data sheets or material specificationssaid Thorsten Warnecke, Principal | Head of PC Analytics, Camelot Management Consultants. “The natural language query system offers great flexibility and accuracy which has reduced our post-processing load and enabled us to add new documents to our data extraction tools quicker.

How the Analyze Document Queries API helps extract information from documents

Companies have increased their adoption of digital platforms, especially in light of the COVID-19 pandemic. Most organizations now offer a digital way to acquire their services and products utilizing smartphones and other mobile devices, which offers flexibility to users but also adds to the scale at which digital documents need to be reviewed, processed, and analyzed. In some workloads where, for example, mortgage documents, vaccination cards, paystubs, insurance cards, and other documents must be digitally analyzed, the complexity of data extraction can become exponentially aggravated because these documents lack a standard format or have significant variations in data format across different versions of the document.

Even powerful OCR solutions struggle to extract data accurately from these documents, and you may have to implement custom postprocessing for these documents. This includes mapping possible variations of form keys to customer-native field names or including custom machine learning to identify specific information in an unstructured document.

The new Analyze Document Queries API in Amazon Textract can take natural language written questions such as “What is the interest rate?” and perform powerful AI and ML analysis on the document to figure out the desired information and extract it from the document without any postprocessing. The Queries feature doesn’t require any custom model training or setting up of templates or configurations. You can quickly get started by uploading your documents and specifying questions on those documents via the Amazon Textract console, the AWS Command Line Interface (AWS CLI), or AWS SDK.

In subsequent sections of this post, we go through detailed examples of how to use this new functionality on common workload use cases and how to use the Analyze Document Queries API to add agility to the process of digitalizing your workload.

Use the Queries feature on the Amazon Textract console

Before we get started with the API and code samples, let’s review the Amazon Textract console. The following image shows an example of a vaccination card on the Queries tab for the Analyze Document API on the Amazon Textract console. After you upload the document to the Amazon Textract console, choose Queries in the Configure Document section. You can then add queries in the form of natural language questions. After you add all your queries, choose Apply Configuration. The answers to the questions are located on the Queries tab.

Code examples

In this section, we explain how to invoke the Analyze Document API with the Queries parameter to get answers to natural language questions about the document. The input document is either in a byte array format or located in an Amazon Simple Storage Service (Amazon S3) bucket. You pass image bytes to an Amazon Textract API operation by using the Bytes property. For example, you can use the Bytes property to pass a document loaded from a local file system. Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.

You can use the Queries feature to get answers from different types of documents like paystubs, vaccination cards, mortgage documents, bank statements, W-2 forms, 1099 forms, and others. In the following sections, we go over some of these documents and show how the Queries feature works.

Paystub

In this example, we walk through the steps to analyze a paystub using the Queries feature, as shown in the following example image.

We use the following sample Python code:

import boto3
import json

#create a Textract Client
textract = boto3.client('textract')

image_filename = "paystub.jpg"

response = None
with open(image_filename, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Textract AnalyzeDocument by passing a document from local disk
response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
        "Queries": [{
            "Text": "What is the year to date gross pay",
            "Alias": "PAYSTUB_YTD_GROSS"
        },
        {
            "Text": "What is the current gross pay?",
            "Alias": "PAYSTUB_CURRENT_GROSS"
        }]
    })

The following code is a sample AWS CLI command:

aws textract analyze-document —document '{"S3Object":{"Bucket":"your-s3-bucket","Name":"paystub.jpg"}}' —feature-types '["QUERIES"]' —queries-config '{"Queries":[{"Text":"What is the year to date gross pay", "Alias": "PAYSTUB_YTD_GROSS"}]}' 

Let’s analyze the response we get for the two queries we passed to the Analyze Document API in the preceding example. The following response has been trimmed to only show the relevant parts:

{
         "BlockType":"QUERY",
         "Id":"cbbba2fa-45be-452b-895b-adda98053153", #id of first QUERY
         "Relationships":[
            {
               "Type":"ANSWER",
               "Ids":[
                  "f2db310c-eaa6-481d-8d18-db0785c33d38" #id of first QUERY_RESULT
               ]
            }
         ],
         "Query":{
            "Text":"What is the year to date gross pay", #First Query
            "Alias":"PAYSTUB_YTD_GROSS"
         }
      },
      {
         "BlockType":"QUERY_RESULT",
         "Confidence":87.0,
         "Text":"23,526.80", #Answer to the first Query
         "Geometry":{...},
         "Id":"f2db310c-eaa6-481d-8d18-db0785c33d38" #id of first QUERY_RESULT
      },
      {
         "BlockType":"QUERY",
         "Id":"4e2a17f0-154f-4847-954c-7c2bf2670c52", #id of second QUERY
         "Relationships":[
            {
               "Type":"ANSWER",
               "Ids":[
                  "350ab92c-4128-4aab-a78a-f1c6f6718959"#id of second QUERY_RESULT
               ]
            }
         ],
         "Query":{
            "Text":"What is the current gross pay?", #Second Query
            "Alias":"PAYSTUB_CURRENT_GROSS"
         }
      },
      {
         "BlockType":"QUERY_RESULT",
         "Confidence":95.0,
         "Text":"$ 452.43", #Answer to the Second Query
         "Geometry":{...},
         "Id":"350ab92c-4128-4aab-a78a-f1c6f6718959" #id of second QUERY_RESULT
      }

The response has a BlockType of QUERY that shows the question that was asked and a Relationships section that has the ID for the block that has the answer. The answer is in the BlockType of QUERY_RESULT. The alias that is passed in as an input to the Analyze Document API is returned as part of the response and can be used to label the answer.

We use the Amazon Textract Response Parser to extract just the questions, the alias, and the corresponding answers to those questions:

import trp.trp2 as t2

d = t2.TDocumentSchema().load(response)
page = d.pages[0]

# get_query_answers returns a list of [query, alias, answer]
query_answers = d.get_query_answers(page=page)
for x in query_answers:
    print(f"{image_filename},{x[1]},{x[2]}")

from tabulate import tabulate
print(tabulate(query_answers, tablefmt="github"))

The preceding code returns the following results:

|------------------------------------|-----------------------|-----------|
| What is the current gross pay?     | PAYSTUB_CURRENT_GROSS | $ 452.43  |
| What is the year to date gross pay | PAYSTUB_YTD_GROSS     | 23,526.80 |

More questions and the full code can be found in the notebook on the GitHub repo.

Mortgage note

The Analyze Document Queries API also works well with mortgage notes like the following.

The process to call the API and process results is the same as the previous example. You can find the full code example on the GitHub repo.

The following code shows the example responses obtained using the API:

|------------------------------------------------------------|----------------------------------|---------------|
| When is this document dated?                               | MORTGAGE_NOTE_DOCUMENT_DATE      | March 4, 2022 |
| What is the note date?                                     | MORTGAGE_NOTE_DATE               | March 4, 2022 |
| When is the Maturity date the borrower has to pay in full? | MORTGAGE_NOTE_MATURITY_DATE      | April, 2032   |
| What is the note city and state?                           | MORTGAGE_NOTE_CITY_STATE         | Anytown, ZZ   |
| what is the yearly interest rate?                          | MORTGAGE_NOTE_YEARLY_INTEREST    | 4.150%        |
| Who is the lender?                                         | MORTGAGE_NOTE_LENDER             | AnyCompany    |
| When does payments begin?                                  | MORTGAGE_NOTE_BEGIN_PAYMENTS     | April, 2022   |
| What is the beginning date of payment?                     | MORTGAGE_NOTE_BEGIN_DATE_PAYMENT | April, 2022   |
| What is the initial monthly payments?                      | MORTGAGE_NOTE_MONTHLY_PAYMENTS   | $ 2500        |
| What is the interest rate?                                 | MORTGAGE_NOTE_INTEREST_RATE      | 4.150%        |
| What is the principal amount borrower has to pay?          | MORTGAGE_NOTE_PRINCIPAL_PAYMENT  | $ 500,000     |

Vaccination card

The Amazon Textract Queries feature also works very well to extract information from vaccination cards or cards that resemble it, like in the following example.

The process to call the API and parse the results is the same as used for a paystub. After we process the response, we get the following information:

|------------------------------------------------------------|--------------------------------------|--------------|
| What is the patients first name                            | PATIENT_FIRST_NAME                   | Major        |
| What is the patients last name                             | PATIENT_LAST_NAME                    | Mary         |
| Which clinic site was the 1st dose COVID-19 administrated? | VACCINATION_FIRST_DOSE_CLINIC_SITE   | XYZ          |
| Who is the manufacturer for 1st dose of COVID-19?          | VACCINATION_FIRST_DOSE_MANUFACTURER  | Pfizer       |
| What is the date for the 2nd dose covid-19?                | VACCINATION_SECOND_DOSE_DATE         | 2/8/2021     |
| What is the patient number                                 | PATIENT_NUMBER                       | 012345abcd67 |
| Who is the manufacturer for 2nd dose of COVID-19?          | VACCINATION_SECOND_DOSE_MANUFACTURER | Pfizer       |
| Which clinic site was the 2nd dose covid-19 administrated? | VACCINATION_SECOND_DOSE_CLINIC_SITE  | CVS          |
| What is the lot number for 2nd dose covid-19?              | VACCINATION_SECOND_DOSE_LOT_NUMBER   | BB5678       |
| What is the date for the 1st dose covid-19?                | VACCINATION_FIRST_DOSE_DATE          | 1/18/21      |
| What is the lot number for 1st dose covid-19?              | VACCINATION_FIRST_DOSE_LOT_NUMBER    | AA1234       |
| What is the MI?                                            | MIDDLE_INITIAL                       | M            |

The full code can be found in the notebook on the GitHub repo.

Insurance card

The Queries feature also works well with insurance cards like the following.

The process to call the API and process results is the same as showed earlier. The full code example is available in the notebook on the GitHub repo.

The following are the example responses obtained using the API:

|-------------------------------------|-----------------------------------|---------------|
| What is the insured name?           | INSURANCE_CARD_NAME               | Jacob Michael |
| What is the level of benefits?      | INSURANCE_CARD_LEVEL_BENEFITS     | SILVER        |
| What is medical insurance provider? | INSURANCE_CARD_PROVIDER           | Anthem        |
| What is the OOP max?                | INSURANCE_CARD_OOP_MAX            | $6000/$12000  |
| What is the effective date?         | INSURANCE_CARD_EFFECTIVE_DATE     | 11/02/2021    |
| What is the office visit copay?     | INSURANCE_CARD_OFFICE_VISIT_COPAY | $55/0%        |
| What is the specialist visit copay? | INSURANCE_CARD_SPEC_VISIT_COPAY   | $65/0%        |
| What is the member id?              | INSURANCE_CARD_MEMBER_ID          | XZ 9147589652 |
| What is the plan type?              | INSURANCE_CARD_PLAN_TYPE          | Pathway X-EPO |
| What is the coinsurance amount?     | INSURANCE_CARD_COINSURANCE        | 30%           |

Best practices for crafting queries

When crafting your queries, consider the following best practices:

  • In general, ask a natural language question that starts with “What is,” “Where is,” or “Who is.” The exception is when you’re trying to extract standard key-value pairs, in which case you can pass the key name as a query.
  • Avoid ill-formed or grammatically incorrect questions, because these could result in unexpected answers. For example, an ill-formed query is “When?” whereas a well-formed query is “When was the first vaccine dose administered?”
  • Where possible, use words from the document to construct the query. Although the Queries feature tries to do acronym and synonym matching for some common industry terms such as “SSN,” “tax ID,” and “Social Security number,” using language directly from the document improves results. For example, if the document says “job progress,” try to avoid using variations like “project progress,” “program progress,” or “job status.”
  • Construct a query that contains words from both the row header and column header. For example, in the preceding vaccination card example, in order to know the date of the second vaccination, you can frame the query as “What date was the 2nd dose administered?”
  • Long answers increase response latency and can lead to timeouts. Try to ask questions that respond with answers fewer than 100 words.
  • Passing only the key name as the question works when trying to extract standard key-value pairs from a form. We recommend framing full questions for all other extraction use cases.
  • Be as specific as possible. For example:
    • When the document contains multiple sections (such as “Borrower” and “Co-Borrower”) and both sections have a field called “SSN,” ask “What is the SSN for Borrower?” and “What is the SSN for Co-Borrower?”
    • When the document has multiple date-related fields, be specific in the query language and ask “What is the date the document was signed on?” or “What is the date of birth of the application?” Avoid asking ambiguous questions like “What is the date?”
  • If you know the layout of the document beforehand, give location hints to improve accuracy of results. For example, ask “What is the date at the top?” or “What is the date on the left?” or “What is the date at the bottom?”

For more information about the Queries feature, refer to [link to documentation].

Conclusion

In this post, we provided an overview of the new Queries feature of Amazon Textract to quickly and easily retrieve information from documents such as paystubs, mortgage notes, insurance cards, and vaccination cards based on natural language questions. We also described how you can parse the response JSON.

For more information, see Analyzing Documents , or check out the Amazon Textract console and try out this feature.


About the Authors

Uday Narayanan is a Sr. Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.

Rafael Caixeta is a Sr. Solutions Architect at AWS based in California. He has over 10 years of experience developing architectures for the cloud. His core areas are serverless, containers, and machine learning. In his spare time, he enjoys reading fiction books and traveling the world.

Navneeth Nair is a Senior Product Manager, Technical with the Amazon Textract team. He is focused on building machine learning-based services for AWS customers.

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions. He joined AWS in 2014, first guiding some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. Currently, he’s obsessed with extracting information from documents.

Read More