How we're using AI to help address the climate crisis

How we’re using AI to help address the climate crisis

Communities around the world are facing the effects of climate change — from devastating floods and wildfires to challenges around food security. As global leaders meet in Egypt for COP27, a key area of focus will be on how we can work together to adapt to climate change and implement sustainable solutions. At Google, we’re investing in technologies that can help communities prepare for and respond to climate-related disasters and threats.

Tools to alert people and governments about immediate risks

Natural disasters are increasing in frequency and intensity due to climate change. As part of our Crisis Response efforts, we’re working to bring trusted information to people in critical moments to keep them safe and informed. To do so, we rely on the research and development of our AI-powered technologies and longstanding partnerships with frontline emergency workers and organizations. Here’s a look at some of our crisis response efforts and new ways we’re expanding these tools.

  • Floods: Catastrophic damage from flooding affects more than 250 million people every year. In 2018, we launched our flood forecasting initiative that uses machine learning models to provide people with detailed alerts. In 2021, we sent 115 million flood alert notifications to 23 million people over Search and Maps, helping save countless lives. Today, we’re expanding our flood forecasts to river basins in 18 additional countries across Africa, Latin America and Southeast Asia. We’re also announcing the global launch of the new FloodHub, a platform that displays flood forecasts and shows when and where floods may occur to help people directly at risk and provide critical information to aid organizations and governments. This expansion in geographic coverage is possible thanks to our recent breakthroughs in AI-based flood forecasting models, and we’re committed to expanding to more countries.
An image of a FloodHub map showing areas where riverine floods my occur

The new Google FloodHub at g.co/floodhub shows forecasts for riverine floods. Forecasts are now available in 18 additional countries: Brazil, Colombia, Sri Lanka, Burkina Faso, Cameroon, Chad, Democratic Republic of Congo, Ivory Coast, Ghana, Guinea, Malawi, Nigeria, Sierra Leone, Angola, South Sudan, Namibia, Liberia, South Africa.

  • Wildfires: Wildfires affect hundreds of thousands of people each year, and are increasing in frequency and size. I experienced firsthand the need for accurate information when wildfires occur and this inspired our crisis response work. We detect wildfire boundaries using new AI models based on satellite imagery and show their real-time location in Search and Maps. Since July, we’ve covered more than 30 big wildfire events in the U.S. and Canada, helping inform people and firefighting teams with over 7 million views in Search and Maps. Today, wildfire detection is now available in the U.S., Canada, Mexico and Australia.
Picture shows the location of the Pukatawagan fire in Manitoba, Canada.

The location of the Pukatawagan fire in Manitoba, Canada.

  • Hurricanes: Access to authoritative forecasts and safety information about hurricanes can be life-saving. In the days before a hurricane in North America or a typhoon in Japan, detailed forecasts from authoritative sources appear on SOS Alerts in Search and Maps to show a storm’s predicted trajectory. We’re also using machine learning to analyze satellite imagery after disasters and identify which areas need help. When Hurricane Ian hit Florida in September, this technology was deployed in partnership with Google.org grantee GiveDirectly to quickly allocate aid to those most affected.

Managing current and future climate impacts

Climate change poses a threat to our world’s natural resources and food security. We’re working with governments, organizations and communities to provide information and technologies to help adapt to these changes.

  • Keeping cities greener and healthier: Extreme temperatures and poor air quality are increasingly common in cities and can impact public health. To mitigate this, our Project Green Light uses AI to optimize traffic lights at intersections around the world with the aim to help minimize congestion and related pollution. Project Air View also brings detailed air quality maps to scientists, policymakers and communities. And we’re working to expand our Environmental Insights Explorer’s Tree Canopy Insights tool to hundreds of cities by the end of this year so they can use trees to lower street-level temperatures and improve quality of life.
  • Meeting the world’s growing demand for food: Mineral — a project from X, Alphabet’s moonshot factory — is working to build a more sustainable and productive food system. The team is joining diverse data sets in radically new ways — from soil and weather data to drone and satellite images — and using AI to reveal insights never before possible about what’s happening with crops. As part of our Startups For Sustainable Development program, we’re also supporting startups addressing food security. These include startups like OKO, which provides crop insurance to keep farmers in business in case of adverse weather events and has reached tens of thousands of farmers in Mali and Uganda.
  • Helping farmers protect their crops: Pest infestations can threaten entire crops and impact the livelihoods of millions. In collaboration with InstaDeep and the Food and Agriculture Organization of the United Nations, our team at the Google AI Center in Ghana is using AI to better detect locust outbreaks so that it’s possible to implement control measures. In India, Google.org Fellows worked with Wadhwani AI to create an AI-powered app that helps identify and treat infestations of pests, resulting in a 20% reduction in pesticide sprays and a 26% increase in profit margins for farmers. Google Cloud is also working with agricultural technology companies to use machine learning and cloud services to improve crop yields.
  • Analyzing a changing planet: Using Google Cloud and Google Earth Engine, organizations and businesses can better assess and manage climate risks. For example, the U.S. Forest Service uses these tools to analyze land-cover changes to better respond to new wildfire threats and monitor the impacts of invasive insects, diseases and droughts. Similarly, the Bank of Montreal is integrating climate data — like precipitation trends — into its business strategy and risk management for clients.

AI already plays a critical role in addressing many urgent, climate-related challenges. It is important that we continue to invest in research and raise awareness about why we are doing this work. Google Arts and Culture has collaborated with artists on the Culture meets Climate collection so everyone can explore more perspectives on climate change. And at COP27 we hope to generate more awareness and engage in productive discussions about how to use AI, innovations, and shared data to help global communities address the changing climate.

Read More

Study urges caution when comparing neural networks to the brain

Study urges caution when comparing neural networks to the brain

Neural networks, a type of computing system loosely modeled on the organization of the human brain, form the basis of many artificial intelligence systems for applications such speech recognition, computer vision, and medical image analysis.

In the field of neuroscience, researchers often use neural networks to try to model the same kind of tasks that the brain performs, in hopes that the models could suggest new hypotheses regarding how the brain itself performs those tasks. However, a group of researchers at MIT is urging that more caution should be taken when interpreting these models.

In an analysis of more than 11,000 neural networks that were trained to simulate the function of grid cells — key components of the brain’s navigation system — the researchers found that neural networks only produced grid-cell-like activity when they were given very specific constraints that are not found in biological systems.

“What this suggests is that in order to obtain a result with grid cells, the researchers training the models needed to bake in those results with specific, biologically implausible implementation choices,” says Rylan Schaeffer, a former senior research associate at MIT.

Without those constraints, the MIT team found that very few neural networks generated grid-cell-like activity, suggesting that these models do not necessarily generate useful predictions of how the brain works.

Schaeffer, who is now a graduate student in computer science at Stanford University, is the lead author of the new study, which will be presented at the 2022 Conference on Neural Information Processing Systems this month. Ila Fiete, a professor of brain and cognitive sciences and a member of MIT’s McGovern Institute for Brain Research, is the senior author of the paper. Mikail Khona, an MIT graduate student in physics, is also an author.

Modeling grid cells

Neural networks, which researchers have been using for decades to perform a variety of computational tasks, consist of thousands or millions of processing units connected to each other. Each node has connections of varying strengths to other nodes in the network. As the network analyzes huge amounts of data, the strengths of those connections change as the network learns to perform the desired task.

In this study, the researchers focused on neural networks that have been developed to mimic the function of the brain’s grid cells, which are found in the entorhinal cortex of the mammalian brain. Together with place cells, found in the hippocampus, grid cells form a brain circuit that helps animals know where they are and how to navigate to a different location.

Place cells have been shown to fire whenever an animal is in a specific location, and each place cell may respond to more than one location. Grid cells, on the other hand, work very differently. As an animal moves through a space such as a room, grid cells fire only when the animal is at one of the vertices of a triangular lattice. Different groups of grid cells create lattices of slightly different dimensions, which overlap each other. This allows grid cells to encode a large number of unique positions using a relatively small number of cells.

This type of location encoding also makes it possible to predict an animal’s next location based on a given starting point and a velocity. In several recent studies, researchers have trained neural networks to perform this same task, which is known as path integration.

To train neural networks to perform this task, researchers feed into it a starting point and a velocity that varies over time. The model essentially mimics the activity of an animal roaming through a space, and calculates updated positions as it moves. As the model performs the task, the activity patterns of different units within the network can be measured. Each unit’s activity can be represented as a firing pattern, similar to the firing patterns of neurons in the brain.

In several previous studies, researchers have reported that their models produced units with activity patterns that closely mimic the firing patterns of grid cells. These studies concluded that grid-cell-like representations would naturally emerge in any neural network trained to perform the path integration task.

However, the MIT researchers found very different results. In an analysis of more than 11,000 neural networks that they trained on path integration, they found that while nearly 90 percent of them learned the task successfully, only about 10 percent of those networks generated activity patterns that could be classified as grid-cell-like. That includes networks in which even only a single unit achieved a high grid score.

The earlier studies were more likely to generate grid-cell-like activity only because of the constraints that researchers build into those models, according to the MIT team.

“Earlier studies have presented this story that if you train networks to path integrate, you’re going to get grid cells. What we found is that instead, you have to make this long sequence of choices of parameters, which we know are inconsistent with the biology, and then in a small sliver of those parameters, you will get the desired result,” Schaeffer says.

More biological models

One of the constraints found in earlier studies is that the researchers required the model to convert velocity into a unique position, reported by one network unit that corresponds to a place cell. For this to happen, the researchers also required that each place cell correspond to only one location, which is not how biological place cells work: Studies have shown that place cells in the hippocampus can respond to up to 20 different locations, not just one.

When the MIT team adjusted the models so that place cells were more like biological place cells, the models were still able to perform the path integration task, but they no longer produced grid-cell-like activity. Grid-cell-like activity also disappeared when the researchers instructed the models to generate different types of location output, such as location on a grid with X and Y axes, or location as a distance and angle relative to a home point.

“If the only thing that you ask this network to do is path integrate, and you impose a set of very specific, not physiological requirements on the readout unit, then it’s possible to obtain grid cells,” Fiete says. “But if you relax any of these aspects of this readout unit, that strongly degrades the ability of the network to produce grid cells. In fact, usually they don’t, even though they still solve the path integration task.”

Therefore, if the researchers hadn’t already known of the existence of grid cells, and guided the model to produce them, it would be very unlikely for them to appear as a natural consequence of the model training.

The researchers say that their findings suggest that more caution is warranted when interpreting neural network models of the brain.

“When you use deep learning models, they can be a powerful tool, but one has to be very circumspect in interpreting them and in determining whether they are truly making de novo predictions, or even shedding light on what it is that the brain is optimizing,” Fiete says.

Kenneth Harris, a professor of quantitative neuroscience at University College London, says he hopes the new study will encourage neuroscientists to be more careful when stating what can be shown by analogies between neural networks and the brain.

“Neural networks can be a useful source of predictions. If you want to learn how the brain solves a computation, you can train a network to perform it, then test the hypothesis that the brain works the same way. Whether the hypothesis is confirmed or not, you will learn something,” says Harris, who was not involved in the study. “This paper shows that ‘postdiction’ is less powerful: Neural networks have many parameters, so getting them to replicate an existing result is not as surprising.”

When using these models to make predictions about how the brain works, it’s important to take into account realistic, known biological constraints when building the models, the MIT researchers say. They are now working on models of grid cells that they hope will generate more accurate predictions of how grid cells in the brain work.

“Deep learning models will give us insight about the brain, but only after you inject a lot of biological knowledge into the model,” Khona says. “If you use the correct constraints, then the models can give you a brain-like solution.”

The research was funded by the Office of Naval Research, the National Science Foundation, the Simons Foundation through the Simons Collaboration on the Global Brain, and the Howard Hughes Medical Institute through the Faculty Scholars Program. Mikail Khona was supported by the MathWorks Science Fellowship.

Read More

MAEEG: Masked Auto-encoder for EEG Representation Learning

This paper was accepted at the Workshop on Learning from Time Series for Health at NeurIPS 2022.
Decoding information from bio-signals such as EEG, using machine learning has been a challenge due to the small data-sets and difficulty to obtain labels. We propose a reconstruction-based self-supervised learning model, the masked auto-encoder for EEG (MAEEG), for learning EEG representations by learning to reconstruct the masked EEG features using a transformer architecture. We found that MAEEG can learn representations that significantly improve sleep stage classification (∼ 5% accuracy…Apple Machine Learning Research

Automated exploratory data analysis and model operationalization framework with a human in the loop

Automated exploratory data analysis and model operationalization framework with a human in the loop

Identifying, collecting, and transforming data is the foundation for machine learning (ML). According to a Forbes survey, there is widespread consensus among ML practitioners that data preparation accounts for approximately 80% of the time spent in developing a viable ML model.

In addition, many of our customers face several challenges during the model operationalization phase to accelerate the journey from model conceptualization to productionization. Quite often, models are built and deployed using poor-quality, under-representative data samples, which leads to more iterations and more manual effort in data inspection, which in turn makes the process more time consuming and cumbersome.

Because your models are only as good as your training data, expert data scientists and practitioners spend an enormous time understanding the data and generating valuable insights prior to building the models. If we view our ML models as an analogy to cooking a meal, the importance of high-quality data for an advanced ML system is similar to the relationship between high-quality ingredients and a successful meal. Therefore, before rushing into building the models, make sure you’re spending enough time getting high-quality data and extracting relevant insights.

The tools and technologies to assist with data preprocessing have been growing over the years. Now we have low-code and no-code tools like Amazon SageMaker Data Wrangler, AWS Glue DataBrew, and Amazon SageMaker Canvas to assist with data feature engineering.

However, a lot of these processes are still currently done manually by a data engineer or analyst who analyzes the data using these tools. If their the knowledge of the tools is limited, the insights generated prior to building the models won’t do justice to all the steps that can be performed. Additionally, we won’t be able to make an informed decision post-analysis of those insights prior to building the ML models. For instance, the models can turn out to be biased due to lack of detailed insights that you received using AWS Glue or Canvas, and you end up spending a lot of time and resources building the model training pipeline, to eventually receive an unsatisfactory prediction.

In this post, we introduce a novel intelligent framework for data and model operationalization that provides automated data transformations and optimal model deployment. This solution can accelerate accurate and timely inspection of data and model quality checks, and facilitate the productivity of distinguished data and ML teams across your organization.

Overview of solution

Our solution demonstrates an automated end-to-end approach to perform exploratory data analysis (EDA) with a human in the loop to determine the model quality thresholds and approve the optimal and qualified data to be pushed into Amazon SageMaker Pipelines in order to push the final data into Amazon SageMaker Feature Store, thereby speeding up the executional framework.

Furthermore, the approach includes deploying the best candidate model and creating the model endpoint on the transformed dataset that was automatically processed as new data arrives in the framework.

The following diagram illustrates the initial setup for the data preprocessing step prior to automating the workflow.

This step comprises the data flow initiation to process the raw data stored in an Amazon Simple Storage Service (Amazon S3) bucket. A sequence of steps in the Data Wrangler UI are created to perform feature engineering on the data (also referred to as a recipe). The data flow recipe consists of preprocessing steps along with a bias report, multicollinearity report, and model quality analysis.

Then, an Amazon SageMaker Processing job is run to save the flow to Amazon S3 and store the transformed features into Feature Store for reusable purposes.

After the flow has been created, which includes the recipe of instructions to be run on the data pertaining to the use case, the goal is to automate the process of creating the flow on any new incoming data, and initiate the process of extracting model quality insights using Data Wrangler. Then, the information regarding the transformations performed on the new data is parsed to an authorized user to inspect the data quality, and the pipeline waits for approval to run the model building and deployment step automatically.

The following architecture showcases the end-to-end automation of data transformation followed by human in the loop approval to facilitate the steps of model training and deployment.

The steps consist of an end-to-end orchestration for automated data transformation and optimal model deployment (with a human in the loop) using the following sequence of steps:

  1. A new object is uploaded into the S3 bucket (in our case, our training data).
  2. An AWS Lambda function is triggered when the object is uploaded in Amazon S3, which invokes AWS Step Functions and notifies the authorized user via a registered email.The following steps occur within the Step Functions orchestration:
  3. The Data Wrangler Flow Creation Lambda function fetches the Data Wrangler flow and processes the new data to be ingested into the Data Wrangler flow. It creates a new flow, which, when imported into the Data Wrangler UI, includes all the transformations, along with a model quality report and bias report. The function saves this latest flow in a new destination bucket.
  4. The User Callback Approval Lambda function sends a trigger notification via Amazon Simple Notification Service (Amazon SNS) to the registered persona via email to review the analyzed flow created on new unseen data information. In the email, the user has the option to accept or reject the data quality outcome and feature engineering flow.
  5. The next step is based on the approver’s decision:
    1. If the human in the loop approved the changes, the Lambda function initiates the SageMaker pipeline in the next state.
    2. If the human in the loop rejected the changes, the Lambda function doesn’t initiate the pipeline, and allows the user to look into the steps within the flow to perform additional feature engineering.
  6. The SageMaker Pipeline Execution Lambda function runs the SageMaker pipeline to create a SageMaker Processing job, which stores the feature engineered data in Feature Store. Another pipeline is created in parallel to save the transformed data to Amazon S3 as a CSV file.
  7. The AutoML Model Job Creation and Deployment Lambda function initiates an Amazon SageMaker Autopilot job to build and deploy the best candidate model and create a model endpoint, which authorized users can invoke for inference.

A Data Wrangler flow is available in our code repository that includes a sequence of steps to run on the dataset. We use Data Wrangler within our Amazon SageMaker Studio IDE, which can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

Dataset

To demonstrate the orchestrated workflow, we use an example dataset regarding diabetic patient readmission. This data contains historical representations of patient and hospital outcomes, wherein the goal involves building an ML model to predict hospital readmission. The model has to predict whether the high-risk diabetic patients are likely to be readmitted to the hospital after a previous encounter within 30 days or after 30 days. Because this use case deals with multiple outcomes, this is a multi-class classification ML problem. You can try out the approach with this example and experiment with additional data transformations following similar steps with your own datasets.

The sample dataset we use in this post is a sampled version of the Diabetes 130-US hospitals for years 1999-2008 Data Set (Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.). It contains historical data including over 15 features with patient and hospital outcomes. The dataset contains approximately 69,500 rows. The following table summarizes the data schema.

Column Name Data Type Data Description
race STRING Caucasian, Asian, African American, or Hispanic.
time_in_hospital INT Number of days between admission and discharge (length of stay).
number_outpatient INT Number of outpatient visits of the patient in a given year before the encounter.
number_inpatient INT Number of inpatient visits of the patient in a given year before the encounter.
number_emergency INT Number of emergency visits of the patient in a given year before the encounter.
number_diagnoses INT Number of diagnoses entered in the system.
num_procedures INT Number of procedures (other than lab tests) performed during the encounter.
num_medications INT Number of distinct generic medicines administrated during the encounter.
num_lab_procedures INT Number of lab tests performed during the encounter.
max_glu_serum STRING The range of result or if the test wasn’t taken. Values include >200, >300, normal, and none (if not measured).
gender STRING Values include Male, Female and Unknown/Invalid.
diabetes_med INT Indicates if any diabetes medication was prescribed.
change STRING Indicates if there was a change in diabetes medications (ether dosage or generic name). Values are change or no change.
age INT Age of patient at the time of encounter.
a1c_result STRING Indicates the range of the result of blood sugar levels. Values include >8, >7, normal, and none.
readmitted STRING Days to inpatient readmission. Values include <30 if patient was readmitted in less than 30 days, >30 if patient was readmitted after 30 days of encounter, and no for no record of readmission.

Prerequisites

This walkthrough includes the following prerequisites:

Upload the historical dataset to Amazon S3

The first step is to download the sample dataset and upload it into an S3 bucket. In our case, our training data (diabetic-readmission.csv) is uploaded.

Data Wrangler initial flow

Prior to automating the Step Functions workflow, we need to perform a sequence of data transformations to create a data flow.

If you want to create the Data Wrangler steps manually, refer to the readme in the GitHub repo.

To import the flow to automate the Data Wrangler steps, complete the following steps:

  1. Download the flow from the GitHub repo and save it in your system.
  2. Open Studio and import the Data Wrangler flow.You need to update the location of where it needs to import the latest dataset. In your case, this is the bucket you defined with the respective prefix.
  3. Choose the plus sign next to Source and choose Edit dataset.
  4. Point to the S3 location of the dataset you downloaded.
    s3-pointer
  5. Inspect all the steps in the transformation and make sure they align with the sequence steps.
    wrangled-stepsdataset

Save data flow to Feature Store

To save the data flow to Feature Store, complete the following steps:

  1. Choose the plus sign next to Steps and choose Export to.
    save-to-featurestore
  2. Choose SageMaker Feature Store (via Jupyter Notebook).
    SageMaker generates a Jupyter notebook for you and opens it in a new tab in Studio. This notebook contains everything you need to run the transformations over our historical dataset and ingest the resulting features into Feature Store.This notebook uses Feature Store to create a feature group, runs your Data Wrangler flow on the entire dataset using a SageMaker processing job, and ingests the processed data to Feature Store.code-feature-store
  3. Choose the kernel Python 3 (Data Science) on the newly opened notebook tab.
  4. Read through and explore the Jupyter notebook.
  5. In the Create Feature Group section of the generated notebook, update the following fields for the event time and record identifier with the column names we created in the previous Data Wrangler step:
    record_identifier_name = "Record_id" 
    event_time_feature_name = "EventTime"

  6. Choose Run and then choose Run All Cells.
    record-id
  7. Enter flow_name = "HealthCareUncleanWrangler".
  8. Run the following cells to create your feature group name.
    featuregroup-codeAfter running a few more cells in the code, the feature group is successfully created.
  9. Now that the feature group is created, you use a processing job to process your data at scale and ingest the transformed data into this feature group.
    If we keep the default bucket location, the flow will be saved in a SageMaker bucket located in the specific Region where you launched your SageMaker domain.uploadflowtos3With Feature_store_offline_S3_uri, Feature Store writes the data in the OfflineStore of a FeatureGroup to an Amazon S3 location owned by you.Wait for the processing job to finish. If it finishes successfully, your feature group should be populated with the transformed feature values. In addition, the raw parameters used by the processing job are printed.It takes 10–15 minutes to run the processing job to create and run the Data Wrangler flow on the entire dataset and save the output flow in the respective bucket within the SageMaker session.
  10. Next, run the FeatureStoreAutomation.ipynb notebook by importing it in Studio from GitHub and running all the cells. Follow the instructions in the notebook.
  11. Copy the following variables from the Data Wrangler generated output from the previous step and add them to the cell in the notebook:
    feature_group_name = "<FEATURE GROUP NAME>"
    output_name = "<OUTPUT NAME>"
    flow_uri='<FLOW URI>'

  12. Run the rest of the code following the instructions in the notebook to create a SageMaker pipeline to automate the storing of features to Feature Store in the feature group that you created.
  13. Next, similar to the previous step in the Data Wrangler export option, choose the plus sign and choose Export to.
  14. Choose SageMaker Pipelines (via Jupyter Notebook).
  15. Run all the cells to create a CSV flow as an output to be stored to Amazon S3.That pipeline name is invoked in a Lambda function later to automate the pipeline on a new flow.
  16. Within the code, whenever you see the following instance count, change instance_count to 1:
    # Processing Job Instance count and instance type.
    instance_count = 2

  17. Otherwise, your account may hit the service quota limits of running an m5.4x large instance for processing jobs being run within the notebook. You have to request an increase in service quota if you want more instances to run the job.
  18. As you walk through the pipeline code, navigate to Create SageMaker Pipeline, where you define the pipeline steps.
  19. In the Output Amazon S3 settings cell, change the location of the Amazon S3 output path to the following code (commenting the output prefix):
    #S3_output_prefix = f"export-{flow_export_name}/output" 
    S3_output_path = f"S3://{bucket}/WrangledOutput"

  20. Locate the following code:
    from SageMaker.workflow.parameters import ParameterString
    from SageMaker.workflow.functions import Join
    
    parameters = []
    for name, val in parameter_overrides.items():
    parameters.append(
    ParameterString(
    name=name,
    default_value=json.dumps({name: val}),
    )
    )

  21. Replace it with the following:
    from SageMaker.workflow.steps import ProcessingStep
    
    data_wrangler_step = ProcessingStep(
        name="DataWranglerProcessingStep",
        processor=processor,
        inputs=[flow_input] + data_sources, 
        outputs=[processing_job_output],
        job_arguments=[f"--output-config '{json.dumps(output_config)}'"],
    )
    

  22. Remove the following cell:
    from SageMaker.workflow.steps import ProcessingStep
    
    data_wrangler_step = ProcessingStep(
        name="DataWranglerProcessingStep",
        processor=processor,
        inputs=[flow_input] + data_sources, 
        outputs=[processing_job_output],
        job_arguments=[f"--output-config '{json.dumps(output_config)}'"] 
            + [f"--refit-trained-params '{json.dumps(refit_trained_params)}'"]
            + [Join(on="", values=["--parameter-override '", param, "'"]) for param in parameters],
    )
    

  23. Continue running the next steps until you reach the Define a Pipeline of Parameters section with the following code. Append the last line input_flow to the code segment:
    from SageMaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    )
    # Define Pipeline Parameters
    instance_type = ParameterString(name="InstanceType", default_value="ml.m5.4xlarge")
    instance_count = ParameterInteger(name="InstanceCount", default_value=1)
    input_flow= ParameterString(name='InputFlow', default_value='S3://placeholder-bucket/placeholder.flow')

  24. Also, add the input_flow as an additional parameter to the next cell:
    pipeline = Pipeline(
        name=pipeline_name,
        parameters=[instance_type, instance_count, input_flow],
        steps=pipeline_steps,
        SageMaker_session=sess
    )

  25. In the section Submit the pipeline to SageMaker and start execution, locate the following cell:
    pipeline.upsert(role_arn=iam_role)
    execution = pipeline.start(
        parameters={
            key: json.dumps({key: val})
            for key, val in parameter_overrides.items()
        }
    )

  26. Replace it with the following code:
    pipeline.upsert(role_arn=iam_role)
    execution = pipeline.start()

  27. Copy the name of the pipeline you just saved.
    This will be your S3_Pipeline_Name value that is added as the environment variable stored in DataWrangler Flow CreationLambda Function.
  28. Replace S3_Pipeline_Name with the name of the pipeline that you just created after running the preceding notebook.
    Now, when a new object is uploaded in Amazon S3, a SageMaker pipeline runs the processing job of creating the Data Wrangler flow on the entire dataset and stores the transformed dataset in Amazon S3 as a CSV file. This object is used in the next step (the Step Functions workflow) for model training and endpoint deployment.We have created and stored a transformed dataset in Amazon S3 by running the preceding notebook. We also created a feature group in Feature Store for storing the respective transformed features for later reuse.
  29. Update both pipeline names in the Data Wrangler Flow Creation Lambda function (created with the AWS CDK) for the Amazon S3 pipeline and Feature Store pipeline.

Step Functions orchestration workflow

Now that we have created the processing job, we need to run these processing jobs on any incoming data that arrives in Amazon S3. We initiate the data transformation automatically, notify the authorized user of the new flow created, and wait for the approver to approve the changes based on data and model quality insights. Then, the Step Functions callback action is triggered to initiate the SageMaker pipeline and start the model training and optimal model deployment endpoint in the environment.

The Step Functions workflow includes a series of Lambda functions to run the overall orchestration. The Step Functions state machine, S3 bucket, Amazon API Gateway resources, and Lambda function codes are stored in the GitHub repo.

The following figure illustrates our Step Function workflow.

stepfunction

Run the AWS CDK code located in GitHub to automatically set up the stack containing the components needed to run the automated EDA and model operationalization framework. After setting up the AWS CDK environment, run the following command in the terminal:

cdk deploy --parameters EmailID=enter_email_id --parameters DataBucketName=enter_unique_s3bucket_name

Create a healthcare folder in the bucket you named via your AWS CDK script. Then upload flow-healthcarediabetesunclean.csv to the folder and let the automation happen!

In the following sections, we walk through each step in the Step Functions workflow in more detail.

Data Wrangler Flow Creation

As new data is uploaded into the S3 bucket, a Lambda function is invoked to trigger the Step Functions workflow. The Data Wrangler Flow Creation Lambda function fetches the Data Wrangler flow. It runs the processing job to create a new Data Wrangler flow (which includes data transformations, model quality report, bias report, and so on) on the ingested dataset and pushes the new flow to the designated S3 bucket.

This Lambda function parses the information to the User Callback Approval Lambda function and sends the trigger notification via Amazon SNS to the registered email with the location of the designated bucket where the flow has been saved.

User Callback Approval

The User Callback Approval step initiates the Lambda function that receives the updated flow information and sends a notification to the authorized user with the approval/rejection link to approve or reject the new flow. The user can review the analyzed flow created on the unseen data by downloading the flow from the S3 bucket and uploading it in the Data Wrangler UI.

After the user reviews the flow, they can go back to the email to approve the changes.

Manual Approval Choice

This Lambda function is waiting for the authorized user to approve or reject the flow.

If the answer received is yes (the user approved the flow), the SageMaker Pipeline Execution Lambda function initiates the SageMaker pipeline for storing the transformed features in Feature Store. Another SageMaker pipeline is initiated in parallel to save the transformed features CSV to Amazon S3, which is used by the next state (the AutoML Model Job Creation & Model Deployment Lambda function) for model training and deployment.

If the answer received is no (the user rejected the flow), the Lambda function doesn’t initiate the pipeline to run the flow. The user can look into the steps within the flow to perform additional feature engineering. Later, the user can rerun the entire sequence after adding additional data transformation steps in the flow.

SageMaker Pipeline Execution

This step initiates a Lambda function that runs the SageMaker pipeline to store the feature engineered data in Feature Store. Another pipeline in parallel saves the transformed data to Amazon S3.

You can monitor the two pipelines in Studio by navigating to the Pipelines page.

pipeline-monitor

You can choose the graph to inspect the input, output, logs, and information.

readmissionhealthcarefeaturestore

Similarly, you can inspect the information of the other pipeline, which saves the transformed features CSV to Amazon S3.

datawranglerprocessingstep

AutoML Model Job Creation & Model Deployment

This step initiates a Lambda function that starts an Autopilot job to ingest the CSV from the previous Lambda function, and build and deploy the best candidate model. This step creates a model endpoint that can be invoked by authorized users. When the AutoML job is complete, you can navigate to Studio, choose Experiment and trials, and view the information associated with your job.

experiment-trials

As all of these steps are run, the SageMaker dashboard reflects the processing job, batch transform job, training job, and hyperparameter tuning job that are being created in the process and the creation of the endpoint that can be invoked when the overall process is complete.

sagemaker-dashboard

Clean up

To avoid ongoing charges, make sure to delete the SageMaker endpoint and stop all the notebooks running in Studio, including the Data Wrangler instances. Also, delete the output data in Amazon S3 you created while running the orchestration workflow via Step Functions. You have to delete the data in the S3 buckets before you can delete the buckets.

Conclusion

In this post, we demonstrated an end-to-end approach to perform automated data transformation with a human in the loop to determine model quality thresholds and approve the optimal qualified data to be pushed to a SageMaker pipeline to push the final data into Feature Store, thereby speeding up the executional framework. Furthermore, the approach includes deploying the best candidate model and creating the model endpoint on the final feature engineered data that was automatically processed when new data arrives.

References

For further information about Data Wrangler, Feature Store, SageMaker pipelines, Autopilot, and Step Functions, we recommend the following resources:


About the Author(s)

Shikhar Kwatra is an AI/ML Specialist Solutions Architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 400 patents in the AI/ML and IoT domains. He has over 8 years of industry experience from startups to large-scale enterprises, from IoT Research Engineer, Data Scientist, to Data & AI Architect. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for organizations and supports GSI partners in building strategic industry solutions on AWS.

Sachin Thakkar is a Senior Solutions Architect at Amazon Web Services, working with a leading Global System Integrator (GSI). He brings over 22 years of experience as an IT Architect and as Technology Consultant for large institutions. His focus area is on data and analytics. Sachin provides architectural guidance and supports GSI partners in building strategic industry solutions on AWS.

Read More

Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines

Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines

Amazon SageMaker Autopilot automatically builds, trains, and tunes the best custom machine learning (ML) models based on your data. It’s an automated machine learning (AutoML) solution that eliminates the heavy lifting of handwritten ML models that requires ML expertise. Data scientists need to only provide a tabular dataset and select the target column to predict, and Autopilot automatically infers the problem type, performs data preprocessing and feature engineering, selects the algorithms and training mode, and explores different configurations to find the best ML model. Then you can directly deploy the model to an Amazon SageMaker endpoint or iterate on the recommended solutions to further improve the model quality.

Although Autopilot eliminates the heavy lifting of building ML models, MLOps engineers still have to create, automate, and manage end-to-end ML workflows. Amazon SageMaker Pipelines helps you automate the different steps of the ML lifecycle, including data preprocessing, training, tuning and evaluating ML models, and deploying them.

In this post, we show how to create an end-to-end ML workflow to train and evaluate an Autopilot generated ML model using Pipelines and register it in the SageMaker model registry. The ML model with the best performance can be deployed to a SageMaker endpoint.

Dataset overview

We use the publicly available hospital readmission dataset for diabetic patients to predict readmission of diabetic patients within 30 days after discharge. It is a sampled version of the “Diabetes 130-US hospitals for years 1999-2008 Data Set”. This is a multi-class classification problem because the readmission options are either < 30 if the patient is readmitted within 30 days, > 30 if the patient is readmitted after 30 days, or no for no record of readmission.

The dataset contains 50,000 rows and 15 columns. This includes demographic information about patients along with their hospital visit records and readmitted as the target column. The following table summarizes the column details.

Column Name Description
Race_Caucasian Values: 0 for no, 1 for yes
Race_African_American Values: 0 for no, 1 for yes
Race_Hispanic Values: 0 for no, 1 for yes
Race_Asian Values: 0 for no, 1 for yes
Race_Other Values: 0 for no, 1 for yes
Age 0–100 age range
Time in Hospital Number of days between admission and discharge
Number of lab procedures Number of lab tests performed during the encounter
Number of medications Number of distinct generic names administered during the encounter
Number of emergency visits Number of emergency visits of the patient in the year preceding the encounter
Number of inpatient visits Number of inpatient visits of the patient in the year preceding the encounter
Number of diagnoses Number of diagnoses entered to the system
Change of medications Indicates if there was a change in diabetic medications (either dosage or generic name); values: 0 and 1
Diabetic medications Indicates if there was any diabetic medication prescribed; values: 0 for no changes in prescription and 1 for change in prescription
Readmitted Days to inpatient readmission; values: <30 if the patient was readmitted in less than 30 days, >30 if the patient was readmitted in more than 30 days, and no for no record of readmission

Solution overview

We use Pipelines in Amazon SageMaker Studio to orchestrate different pipeline steps required to train an Autopilot model. An Autopilot experiment is created and run using the AWS SDKs as described in this post. Autopilot training jobs start their own dedicated SageMaker backend processes, and dedicated SageMaker API calls are required to start new training jobs, monitor training job statuses, and invoke trained Autopilot models.

The following are the steps required for this end-to-end Autopilot training process:

  1. Create an Autopilot training job.
  2. Monitor the training job status.
  3. Evaluate performance of the trained model on a test dataset.
  4. Register the model in the model registry.
Overview of the SageMaker pipeline steps

SageMaker pipeline steps

When the registered model meets the expected performance requirements after a manual review, you can deploy the model to a SageMaker endpoint using a standalone deployment script.

The following architecture diagram illustrates the different pipeline steps necessary to package all the steps in a reproducible, automated, and scalable Autopilot training pipeline. Each step is responsible for a specific task in the workflow:

  1. An AWS Lambda function starts the Autopilot training job.
  2. A Callback step continuously monitors that job status.
  3. When the training job status is complete, we use a SageMaker processing job to evaluate the model’s performance.
  4. Finally, we use another Lambda function to register the ML model and the performance metrics to the SageMaker model registry.

The data files are read from the Amazon Simple Storage Service (Amazon S3) bucket and the pipeline steps are called sequentially.

Architecture diagram of the SageMaker pipeline

Architecture diagram of the SageMaker pipeline

In the following sections, we review the code and discuss the components of each step. To deploy the solution, reference the GitHub repo, which provides step-by-step instructions for implementing an Autopilot MLOps workflow using Pipelines.

Prerequisites

For this walkthrough, complete the following prerequisite steps:

  1. Set up an AWS account.
  2. Create a Studio environment.
  3. Create two AWS Identity and Access Management (IAM) roles: LambdaExecutionRole and SageMakerExecutionRole, with permissions as outlined in the SageMaker notebook. The managed policies should be scoped down further for improved security. For instructions, refer to Creating a role to delegate permissions to an IAM user.
  4. On the Studio console, upload the code from the GitHub repo.
  5. Open the SageMaker notebook autopilot_pipelines_demo_notebook.ipynb and run the cells under Get dataset to download the data and upload it to your S3 bucket.
    1. Download the data and unzip it to a folder named data:
      !unzip -o data/data.zip -d data
      !mkdir data
      !wget https://static.us-east-1.prod.workshops.aws/public/d56bf7ad-9738-4edf-9be0-f03cd22d8cf2/static/resources/hcls/diabetic.zip -nc -O data/data.zip
      

    2. Split the data into train-val and test files and upload them to your S3 bucket. The train-val file is automatically split into training and validation datasets by Autopilot. The test file is split into two separate files: one file without the target column and another file with only the target column.
      data = pd.read_csv(DATASET_PATH)
      train_val_data = data.sample(frac=0.8)
      test_data = data.drop(train_val_data.index)
      train_val_data.to_csv(train_val_dataset_s3_path.default_value, index=False, header=True)
      test_data.drop(target_attribute_name.default_value, axis=1).to_csv(
      x_test_s3_path.default_value, index=False, header=False
      )
      test_data[target_attribute_name.default_value].to_csv(
      y_test_s3_path.default_value, index=False, header=True)
      

When the dataset is ready to use, we can now set up Pipelines to establish a repeatable process to build and train custom ML models using Autopilot. We use Boto3 and the SageMaker SDK to launch, track, and evaluate the AutoML jobs in an automated fashion.

Define the pipeline steps

In this section, we walk you through setting up the four steps in the pipeline.

Start the Autopilot job

This pipeline step uses a Lambda step, which runs a serverless Lambda function. We use a Lambda step because the API call to Autopilot is lightweight. Lambda functions are serverless and well suited for this task. For more information about Lambda steps, refer to Use a SageMaker Pipeline Lambda step for lightweight model deployments. The Lambda function in the start_autopilot_job.py script creates an Autopilot job.

We use the Boto3 Autopilot API call create_auto_ml_job to specify the Autopilot job configuration, with the following parameters:

  • AutoMLJobName – The Autopilot job name.
  • InputDataConfig – The training data, data location in Amazon S3, and S3 data type with valid values such as S3Prefix, ManifestFile, and AugmentedManifestFile.
  • OutputDataConfig – The S3 output path where artifacts from the AutoML job are stored.
  • ProblemType – The problem type (MulticlassClassification for our use case).
  • AutoMLJobObjectiveF1macro is our objective metric for our use case.
  • AutoMLJobConfig – The training mode is specified here. We use the newly released ensemble training mode powered by AutoGluon.

See the following code:

def lambda_handler(event, context):
sagemaker_client.create_auto_ml_job(
AutoMLJobName=event["AutopilotJobName"],
InputDataConfig=[
{
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": event["TrainValDatasetS3Path"],
}
},
"TargetAttributeName": event["TargetAttributeName"],
}
],
OutputDataConfig={"S3OutputPath": event["TrainingOutputS3Path"]},
ProblemType=event["ProblemType"],
AutoMLJobObjective={"MetricName": event["AutopilotObjectiveMetricName"]},
AutoMLJobConfig={
"CompletionCriteria": {
"MaxCandidates": event["MaxCandidates"],
"MaxRuntimePerTrainingJobInSeconds": event[
"MaxRuntimePerTrainingJobInSeconds"
],
"MaxAutoMLJobRuntimeInSeconds": event["MaxAutoMLJobRuntimeInSeconds"],
},
"Mode": event["AutopilotMode"],
},
RoleArn=event["AutopilotExecutionRoleArn"],
)

Check Autopilot job status

A Callback step helps us keep track of the status of the Autopilot training job.

The step repeatedly keeps track of the training job status by using a separate Lambda function in check_autopilot_job_status.py until its completion.

The Callback step places a token in an Amazon Simple Queue Service (Amazon SQS) queue that triggers a Lambda function to check the training job status:

  • If the job is still running, the Lambda function raises an exception and the message is placed back into the SQS queue
  • If the job is complete, the Lambda function sends a success message back to the Callback step and the pipeline continues with the next step

We use a combination of a Callback step and a Lambda function. There is an alternate option of using a SageMaker processing job instead.

Evaluate the best Autopilot model

The SageMaker processing step launches a SageMaker batch transform job to evaluate the trained Autopilot model against an evaluation dataset (the test set that was saved to the S3 bucket) and generates the performance metrics evaluation report and model explainability metrics. The evaluation script takes the Autopilot job name as an input argument and launches the batch transform job.

When the batch transform job is complete, we get output predictions for the test set. The output predictions are compared to the actual (ground truth) labels using Scikit-learn metrics functions. We evaluate our results based on the F1 score, precision, and recall. The performance metrics are saved to a JSON file, which is referenced when registering the model in the subsequent step.

Register the Autopilot model

We use another Lambda step, in which the Lambda function in register_autopilot_job.py registers the Autopilot model to the SageMaker model registry using the evaluation report obtained in the previous SageMaker processing step. A Lambda step is used here for cost efficiency and latency.

At this point, we have successfully registered our new Autopilot model to the SageMaker model registry. You can view the new model on Studio by choosing Model registry on the SageMaker resources menu and opening autopilot-demo-package. Choose any version of a training job to view the objective metrics under Model quality.

You can use the explainability report on the Explainability tab to understand your model’s predictions.

To view the experiments run for each model created, navigate to the Experiments and trials page. Choose (right-click) one of the listed experiments and choose Describe AutoML job to view the model leaderboard.

To view the pipeline steps on the Experiments and trials page, choose (right-click) the experiment and choose Open pipeline details.

Create and run the pipeline

After we define the pipeline steps, we combine them into a SageMaker pipeline. The steps are run sequentially. The pipeline runs all of the steps for an AutoML job, using Autopilot for training, model evaluation, and model registration. See the following code:

pipeline = Pipeline(
name="autopilot-demo-pipeline",
parameters=[
autopilot_job_name,
target_attribute_name,
train_val_dataset_s3_path,
x_test_s3_path,
y_test_s3_path,
max_autopilot_candidates,
max_autopilot_job_runtime,
max_autopilot_training_job_runtime,
instance_count,
instance_type,
model_approval_status,
],
steps=[
step_start_autopilot_job,
step_check_autopilot_job_status_callback,
step_autopilot_model_evaluation,
step_register_autopilot_model,
],
sagemaker_session=sagemaker_session,
)

Deploy the model

After we have manually reviewed the ML model’s performance, we can deploy our newly created model to a SageMaker endpoint. For this, we can run the cell in the notebook that creates the model endpoint using the model configuration saved in the SageMaker model registry.

Note that this script is shared for demonstration purposes, but it’s recommended to follow a more robust CI/CD pipeline for production deployment. For more information, refer to Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines.

Conclusion

This post described an easy-to-use ML pipeline approach to automatically train tabular ML models (AutoML) using Autopilot, Pipelines, and Studio. AutoML improves ML practitioners’ efficiency, accelerating the path from ML experimentation to production without the need for extensive ML expertise. We outlined the respective pipeline steps needed for ML model creation, evaluation, and registration.

Get started by accessing the code on the GitHub repo to train and deploy your own custom AutoML models.

For more information on Pipelines and Autopilot, refer to Amazon SageMaker Pipelines and Automate model development with Amazon SageMaker Autopilot, respectively.


About the Authors

Pierre de Malliard is a Full-Stack Data Scientist for AWS and is passionate about helping customers improve their business outcomes with machine learning. He has been building AI/ML solutions across the healthcare sector. He holds multiple AWS certifications. In his free time, Pierre enjoys backcountry skiing and spearfishing.

Paavani Dua is an Applied Scientist in the AWS AI organization. At the Amazon ML Solutions Lab, she works with customers to solve their business problems using ML solutions. Outside of work, she enjoys hiking, reading, and baking.

Marcelo Aberle is an ML Engineer in the AWS AI organization. He is leading MLOps efforts at the Amazon ML Solutions Lab, helping customers design and implement scalable ML systems. His mission is to guide customers on their enterprise ML journey and accelerate their ML path to production. He is an admirer of California nature and enjoys hiking and cycling around San Francisco.

Read More

Startups across AWS Accelerators use AI and ML to solve mission-critical customer challenges

Startups across AWS Accelerators use AI and ML to solve mission-critical customer challenges

Relentless advancement in technology is improving the decision-making capacity of humans and enterprises alike. Digitization of the physical world has accelerated the three dimensions of data: velocity, variety, and volume. This has made information more widely available than before, allowing for advancements in problem-solving. Now, with cloud-enabled democratized availability, technologies like artificial intelligence (AI) and machine learning (ML) are able to increase the speed and accuracy of decision-making by humans and machines.

Nowhere is this speed and accuracy of decisions more important than in the public sector, where organizations across defense, healthcare, aerospace, and sustainability are solving challenges that impact citizens around the world. Many public sector customers see the benefits of using AI/ML to address these challenges, but can be overwhelmed with the range of solutions. AWS launched AWS Accelerators to find and develop startups with technologies that meet public sector customers’ unique challenges. Read on to learn more about AI/ML use cases from startups in the AWS Accelerator that are making an impact for public sector customers.

Healthcare

Pieces: Healthcare providers want to spend more time caring for patients and less time on paperwork. Pieces, an AWS Healthcare Accelerator startup, uses AWS to make it easier to input, manage, store, organize, and gain insight from Electronic Health Record (EHR) data to address social determinants of health and improve patient care. With AI, natural language processing (NLP), and clinically reviewed algorithms, Pieces can provide projected hospital discharge dates, anticipated clinical and non-clinical barriers to discharge, and risk of readmission. Pieces services also provide insights to healthcare providers in plain language and optimize clarity of patients’ clinical issues to help care teams work more efficiently. According to Pieces, the software delivers a 95% positive prediction in identifying barriers to patient discharge, and at one hospital, has shown its ability to reduce patient hospital stays on average by 2 days.

Pieces uses Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), and Amazon Managed Streaming for Apace Kafka (Amazon MSK) for collecting and processing streamed clinical data. Pieces uses Amazon Elastic Kubernetes Service (Amazon EKS), Amazon OpenSearch Service, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run multiple ML models on data in production at scale.

PEP Health: Patient experience is a key priority, but gathering patient feedback can be a challenge. PEP Health, a startup in the AWS Healthcare Accelerator’s UK cohort, uses NLP technology to analyze millions of online, publicly posted patient comments, generating scores that highlight areas for celebration or concern, and identifying the reasons for improving or declining patient satisfaction. This data can be used to improve experiences, drive better outcomes, and democratize the patient voice.

PEP Health uses AWS Lambda, AWS Fargate, and Amazon EC2 to ingest information in real time from hundreds of thousands of webpages. With proprietary NLP models built and run on Amazon SageMaker, PEP Health identifies and scores themes relevant to the quality of care. These results feed PEP Health’s Patient Experience Platform and ML algorithms built and powered by Lambda, Fargate, Amazon EC2, Amazon RDS, SageMaker, and Amazon Cognito, which enable relationship analysis and uncover patterns between people, places, and things that may otherwise seem disconnected.

“Through the accelerator, PEP Health was able to scale its operations significantly with the introduction of AWS Lambda to collect more comments faster and more affordably. Additionally, we’ve been able to use Amazon SageMaker to derive further insights for customers.”

– Mark Lomax, PEP Health CEO.

Defense and space

Lunar Outpost: Lunar Outpost was part of the AWS Space Accelerator’s inaugural cohort in 2021. The company is taking part in missions to the Moon and is developing Mobile Autonomous Platform (MAP) rovers that will be capable of surviving and navigating the extreme environments of other planetary bodies. To successfully navigate in conditions that can’t be found on Earth, Lunar Outpost makes extensive use of robotic simulations to validate AI navigation algorithms.

Lunar Outpost uses AWS RoboMaker, Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Simple Storage Service (Amazon S3), Amazon Virtual Private Cloud (Amazon VPC), Lambda, AWS CodeBuild, and Amazon QuickSight to test rovers by deploying lunar simulations. As Lunar Outpost develops navigation technologies for the lunar surface, simulation instances are spun up. These simulations will be used during lunar missions to assist human operators and decrease risk. Data streamed back from the lunar surface will be imported into their simulation, giving a real-time view of the rover’s activities. Simulation of digital MAP rovers allows for trial runs of navigation trajectories without moving the physical rover, dramatically reducing the risks of moving rovers in space.

Adarga: Adarga, part of the first AWS Defense Accelerator cohort, is delivering an AI-driven intelligence platform to rapidly understand risks and opportunities for theater entry preparation and deployment. Adarga uses AI to find insights buried within large volumes of unstructured data, such as news, presentations, reports, videos, and more.

Adarga uses Amazon EC2, OpenSearch Service, Amazon Aurora, Amazon DocumentDB (with MongoDB compatibility), Amazon Translate, and SageMaker. Adarga ingests information in real time, translates foreign language documents, and transcribes audio and video files into text. In addition to SageMaker, Adarga uses proprietary NLP models to extract and classify details, like people, places, and things, deploying disambiguation techniques to contextualize the information. These details are mapped into a dynamic intelligence picture for customers. Adarga’s ML algorithms, together with AWS AI/ML services, enable relationship analysis, uncovering patterns that may otherwise seem disconnected.

“We are proud to be part of this pioneering initiative as we continue to work closely with AWS and a wider ecosystem of tech players to deliver game-changing capabilities to defence, enabled by hyperscale cloud.”

– Robert Bassett-Cross, CEO, Adarga

Sustainable cities

SmartHelio: Within the commercial solar farm industry, it is critical to determine the health of installed solar infrastructure. SmartHelio combines physics and SageMaker to construct models that determine the current health of solar assets, build predictions on which assets will fail, and determine proactively which assets to service first.

SmartHelio’s solution, built on AWS, analyzes incredibly complex photovoltaic physics and power systems. A data lake on Amazon S3 stores billions of data points streamed on a real-time basis from Supervisory Control and Data Acquisition (SCADA) servers on solar farms, Internet of Things (IoT) devices, or third-party Content Management Systems (CMS) platforms. SmartHelio uses SageMaker to run deep learning models to recognize patterns, quantify solar farm health, and predict farm losses on a real-time basis, delivering intelligent insights instantly to its customers.

After being selected for the first AWS Sustainable Cities Accelerator cohort, SmartHelio secured several pilots with new customers. In CEO Govinda Upadhyay’s words, “the AWS Accelerator gave us global exposure to markets, mentors, potential customers, and investors.”

Automotus: Automotus uses computer vision technology to give drivers the ability to view in real time if curb space is available, significantly reducing time spent searching for parking. Automotus helps cities and airports manage and monetize their curbs using a fleet of computer vision sensors powered by AWS IoT Greengrass. Automotus’s sensors upload training data to Amazon S3, where a workflow powered by Lambda indexes sample data to create complex datasets for training new models and improving existing ones.

Automotus uses SageMaker to automate and containerize its computer vision model training process, the outputs of which are deployed back to the edge using a simple, automated process. Equipped with these trained models, Automotus sensors send metadata to the cloud using AWS IoT Core, uncovering granular insights about curb activity and enabling fully automated billing and enforcement at the curb. With one customer, Automotus increased enforcement efficiency and revenue by more than 500%, resulting in a 24% increase in parking turnover and a 20% reduction in traffic.

What’s next for AI/ML and startups

Customers have embraced AI/ML to solve a wide spectrum of challenges, which is a testament to the advancement of the technology and the increased confidence customers have in using data to improve decision-making. AWS Accelerators aim to continue the acceleration and adoption of AI/ML solutions by helping customers brainstorm and share critical problem statements, and finding and connecting startups with these customers.

Interested in advancing solutions for public good through your startup? Or have a challenge in need of a disruptive solution? Connect with the AWS Worldwide Public Sector Venture Capital and Startups team today to learn more about AWS Accelerators and other resources available to drive decision-making innovations.


About the authors

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Manpreet Mattu is the Global Head for Venture Capital and Startups Business Development for the World Wide Public Sector at Amazon Web Services (AWS). He has 15 years of experience in venture Investments and acquisitions in leading-edge technology and non-tech segments. Beyond tech, Manpreet’s interest spans history, philosophy, and economics. He is also an endurance runner.

Read More