Identifying Disfluencies in Natural Speech

People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus:

But that’s it’s not, it’s not, it’s, uh, it’s a word play on what you just said.

It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Removing the disfluencies makes the sentence much easier to read and understand:

But it’s a word play on what you just said.

While people generally don’t even notice disfluencies in day-to-day conversation, early foundational work in computational linguistics demonstrated how common they are. In 1994, using the Switchboard corpus, Elizabeh Shriberg demonstrated that there is a 50% probability for a sentence of 10–13 words to include a disfluency and that the probability increases with sentence length.

The proportion of sentences from the Switchboard dataset with at least one disfluency plotted against sentence length measured in non-disfluent (i.e., efficient) tokens in the sentence. The longer a sentence gets, the more likely it is to contain a disfluency.

In “Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection”, we present research findings on how to “clean up” transcripts of spoken text. We create more readable transcripts and captions of human speech by finding and removing disfluencies in people’s speech. Using labeled data, we created machine learning (ML) algorithms that identify disfluencies in human speech. Once those are identified we can remove the extra words to make transcripts more readable. This also improves the performance of natural language processing (NLP) algorithms that work on transcripts of human speech. Our work puts special priority on ensuring that these models are able to run on mobile devices so that we can protect user privacy and preserve performance in scenarios with low connectivity.

Base Model Overview
At the core of our base model is a pre-trained BERTBASE encoder with 108.9 million parameters. We use the standard per-token classifier configuration, with a binary classification head being fed by the sequence encodings for each token.

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.


Illustration of how tokens in text become numerical embeddings, which then lead to output labels.


We refined the BERT encoder by continuing the pretraining on the comments from the Pushrift Reddit dataset from 2019. Reddit comments are not speech data, but are more informal and conversational than the wiki and book data. This trains the encoder to better understand informal language, but may run the risk of internalizing some of the biases inherent in the data. For our particular use case, however, the model only captures the syntax or overall form of the text, not its content, which avoids potential issues related to semantic-level biases in the data.

We fine-tune our model for disfluency classification on hand-labeled corpora, such as the Switchboard corpus mentioned above. Hyperparameters (batch size, learning rate, number of training epochs, etc.) were optimized using Vizier.

We also produce a range of “small” models for use on mobile devices using a knowledge distillation technique known as “self training”. Our best small model is based on the Small-vocab BERT variant with 3.1 million parameters. This smaller model achieves comparable results to our baseline at 1% the size (in MiB). You can read more about how we achieved this model miniaturization in our 2021 Interspeech paper.

Some of the latest use cases for automatic speech transcription include automated live captioning, such as produced by the Android “Live Captions” feature, which automatically transcribes spoken language in audio being played on the device. For disfluency removal to be of use in improving the readability of the captions in this setting, then it must happen quickly and in a stable manner. That is, the model should not change its past predictions as it sees new words in the transcript.

We call this live token-by-token processing streaming. Accurate streaming is difficult because of temporal dependencies; most disfluencies are only recognizable later. For example, a repetition does not actually become a repetition until the second time the word or phrase is said.

To investigate whether our disfluency detection model is effective in streaming applications, we split the utterances in our training set into prefix segments, where only the first N tokens of the utterance were provided at training time, for all values of N up to the full length of the utterance. We evaluated the model simulating a stream of spoken text by feeding prefixes to the models and measuring the performance with several metrics that capture model accuracy, stability, and latency including streaming F1, time to detection (TTD), edit overhead (EO), and average wait time (AWT). We experimented with look-ahead windows of either one or two tokens, allowing the model to “peek” ahead at additional tokens for which the model is not required to produce a prediction. In essence, we’re asking the model to “wait” for one or two more tokens of evidence before making a decision.

While adding this fixed look-ahead did improve the stability and streaming F1 scores in many contexts, we found that in some cases the label was already clear even without looking ahead to the next token and the model did not necessarily benefit from waiting. Other times, waiting for just one extra token was sufficient. We hypothesized that the model itself could learn when it should wait for more context. Our solution was a modified model architecture that includes a “wait” classification head that decides when the model has seen enough evidence to trust the disfluency classification head.

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.


Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.


We constructed a training loss function that is a weighted sum of three factors:

  1. The traditional cross-entropy loss for the disfluency classification head
  2. A cross-entropy term that only considers up to the first token with a “wait” classification
  3. A latency penalty that discourages the model from waiting too long to make a prediction

We evaluated this streaming model as well as the standard baseline with no look-ahead and with both 1- and 2-token look-ahead values:

Graph of the streaming F1 score versus the average wait time in tokens. Three data points indicate F1 scores above 0.82 across multiple wait times. The proposed streaming model achieves near top performance with much shorter wait times than the fixed look ahead models.

The streaming model achieved a better streaming F1 score than both a standard baseline with no look ahead and a model with a look ahead of 1. It performed nearly as well as the variant with fixed look ahead of 2, but with much less waiting. On average the model waited for only 0.21 tokens of context.

Our best outcomes so far have been with English transcripts. This is mostly due to resourcing issues: while there are a number of relatively large labeled conversational datasets that include disfluencies in English, other languages often have very few such datasets available. So, in order to make disfluency detection models available outside English a method is needed to build models in a way that does not require finding and labeling hundreds of thousands of utterances in each target language. A promising solution is to leverage multi-language versions of BERT to transfer what a model has learned about English disfluencies to other languages in order to achieve similar performance with much less data. This is an area of active research, but we do have some promising results to outline here.

As a first effort to validate this approach, we added labels to about 10,000 lines of dialogue from the German CALLHOME dataset. We then started with the Geotrend English and German Bilingual BERT model (extracted from Multilingual BERT) and fine-tuned it with approximately 77,000 disfluency-labeled English Switchboard examples and 1.3 million examples of self-labeled transcripts from the Fisher Corpus. Then, we did further fine tuning with about 7,500 in-house–labeled examples from the German CALLHOME dataset.

Diagram illustrating the flow of labeled data and self-trained output in our best multilingual training setup. By training on both English and German data we are able to improve performance via transfer learning.

Our results indicate that fine-tuning on a large English corpus can produce acceptable precision using zero-shot transfer to similar languages like German, but at least a modest amount of German labels were needed to improve recall from less than 60% to greater than 80%. Two-stage fine-tuning of an English-German bilingual model produced the highest precision and overall F1 score.

Approach Precision Recall F1
German BERTBASE model fine-tuned on 7,300 human-labeled German CALLHOME examples 89.1% 81.3% 85.0
Same as above but with additional 7,500 self-labeled German CALLHOME examples 91.5% 83.3% 87.2
English/German Bilingual BERTbase model fine-tuned on English Switchboard+Fisher, evaluated on German CALLHOME (zero-shot language transfer) 87.2% 59.1% 70.4
Same as above but subsequently fine-tuned with 14,800 German CALLHOME (human- and self-labeled) examples 95.5% 82.6% 88.6

Cleaning up disfluencies from transcripts can improve not just their readability for people, but also the performance of other models that consume transcripts. We demonstrate effective methods for identifying disfluencies and expand our disfluency model to resource-constrained environments, new languages, and more interactive use cases.

Thank you to Vicky Zayats, Johann Rocholl, Angelica Chen, Noah Murad, Dirk Padfield, and Preeti Mohan for writing the code, running the experiments, and composing the papers discussed here. Wealso thank our technical product manager Aaron Schneider, Bobby Tran from the Cerebra Data Ops team, and Chetan Gupta from Speech Data Ops for their support obtaining additional data labels.

Read More

Secure Amazon SageMaker Studio presigned URLs Part 1: Foundational infrastructure

You can access Amazon SageMaker Studio notebooks from the Amazon SageMaker console via AWS Identity and Access Management (IAM) authenticated federation from your identity provider (IdP), such as Okta. When a Studio user opens the notebook link, Studio validates the federated user’s IAM policy to authorize access, and generates and resolves the presigned URL for the user. Because the SageMaker console runs on an internet domain, this generated presigned URL is visible in the browser session. This presents an undesired threat vector for exfiltration and gaining access to customer data when proper access controls are not enforced.

Studio supports a few methods for enforcing access controls against presigned URL data exfiltration:

  • Client IP validation using the IAM policy condition aws:sourceIp
  • Client VPC validation using the IAM condition aws:sourceVpc
  • Client VPC endpoint validation using the IAM policy condition aws:sourceVpce

When you access Studio notebooks from the SageMaker console, the only available option is to use client IP validation with the IAM policy condition aws:sourceIp. However, you can use browser traffic routing products such as Zscaler to ensure scale and compliance for your workforce internet access. These traffic routing products generate their own source IP, whose IP range is not controlled by the enterprise customer. This makes it impossible for these enterprise customers to use the aws:sourceIp condition.

To use client VPC endpoint validation using the IAM policy condition aws:sourceVpce, the creation of a presigned URL needs to originate in the same customer VPC where Studio is deployed, and resolution of the presigned URL needs to happen via a Studio VPC endpoint on the customer VPC. This resolution of the presigned URL during access time for corporate network users can be accomplished using DNS forwarding rules (both in Zscaler and corporate DNS) and then into the customer VPC endpoint using an AWS Route 53 inbound resolver.

In this part, we discuss the overarching architecture for securing studio pre-signed url and demonstrate how to set up the foundational infrastructure to create and launch a Studio presigned URL through your VPC endpoint over a private network without traversing the internet. This serves as the foundational layer for preventing data exfiltration by external bad actors gaining access to Studio pre-signed URL and unauthorized or spoofed corporate user access within a corporate environment.

Solution overview

The following diagram illustrates over-arching solution architecture.

The process includes the following steps:

  1. A corporate user authenticates via their IdP, connects to their corporate portal, and opens the Studio link from the corporate portal.
  2. The corporate portal application makes a private API call using an API Gateway VPC endpoint to create a presigned URL.
  3. The API Gateway VPC endpoint “create presigned URL” call is forwarded to the Route 53 inbound resolver on the customer VPC as configured in the corporate DNS.
  4. The VPC DNS resolver resolves it to the API Gateway VPC endpoint IP. Optionally, it looks up a private hosted zone record if it exists.
  5. The API Gateway VPC endpoint routes the request via the Amazon private network to the “create presigned URL API” running in the API Gateway service account.
  6. API Gateway invokes the create-pre-signedURL private API and proxies the request to the create-pre-signedURL Lambda function.
  7. The create-pre-signedURL Lambda call is invoked via the Lambda VPC endpoint.
  8. The create-pre-signedURL function runs in the service account, retrieves authenticated user context (user ID, Region, and so on), looks up a mapping table to identify the SageMaker domain and user profile identifier, makes a sagemaker createpre-signedDomainURL API call, and generates a presigned URL. The Lambda service role has the source VPC endpoint conditions defined for the SageMaker API and Studio.
  9. The generated presigned URL is resolved over the Studio VPC endpoint.
  10. Studio validates that the presigned URL is being accessed via the customer’s VPC endpoint defined in the policy, and returns the result.
  11. The Studio notebook is returned to the user’s browser session over the corporate network without traversing the internet.

The following sections walk you through how to implement this architecture to resolve Studio presigned URLs from a corporate network using VPC endpoints. We demonstrate a complete implementation by showing the following steps:

  1. Set up the foundational architecture.
  2. Configure the corporate app server to access a SageMaker presigned URL via a VPC endpoint.
  3. Set up and launch Studio from the corporate network.

Set up the foundational architecture

In the post Access an Amazon SageMaker Studio notebook from a corporate network, we demonstrated how to resolve a presigned URL domain name for a Studio notebook from a corporate network without traversing the internet. You can follow the instructions in that post to set up the foundational architecture, and then return to this post and proceed to the next step.

Configure the corporate app server to access a SageMaker presigned URL via a VPC endpoint

To enable accessing Studio from your internet browser, we set up an on-premises app server on Windows Server on the on-premises VPC public subnet. However, the DNS queries for accessing Studio are routed through the corporate (private) network. Complete the following steps to configure routing Studio traffic through the corporate network:

  1. Connect to your on-premises Windows app server.

  2. Choose Get Password then browse and upload your private key to decrypt your password.
  3. Use an RDP client and connect to the Windows Server using your credentials.
    Resolving Studio DNS from the Windows Server command prompt results in using public DNS servers, as shown in the following screenshot.
    Now we update Windows Server to use the on-premises DNS server that we set up earlier.
  4. Navigate to Control Panel, Network and Internet, and choose Network Connections.
  5. Right-click Ethernet and choose the Properties tab.
  6. Update Windows Server to use the on-premises DNS server.
  7. Now you update your preferred DNS server with your DNS server IP.
  8. Navigate to VPC and Route Tables and choose your STUDIO-ONPREM-PUBLIC-RT route table.
  9. Add a route to with the target as the peering connection that we created during the foundational architecture setup.

Set up and launch Studio from your corporate network

To set up and launch Studio, complete the following steps:

  1. Download Chrome and launch the browser on this Windows instance.
    You may need to turn off Internet Explorer Enhanced Security Configuration to allow file downloads and then enable file downloads.
  2. In your local device Chrome browser, navigate to the SageMaker console and open the Chrome developer tools Network tab.
  3. Launch the Studio app and observe the Network tab for the authtokenparameter value, which includes the generated presigned URL along with the remote server address that the URL is routed to for resolution.In this example, the remote address is one of the public DNS server addresses to resolve the SageMaker DNS domain name
  4. Repeat these steps from the Amazon Elastic Compute Cloud (Amazon EC2) Windows instance that you configured as part of the foundational architecture.

We can observe that the remote address is not the public DNS IP, instead it’s the Studio VPC endpoint


In this post, we demonstrated how to resolve a Studio presigned URL from a corporate network using Amazon private VPC endpoints without exposing the presigned URL resolution to the internet. This further secures your enterprise security posture for accessing Studio from a corporate network for building highly secure machine learning workloads on SageMaker. In part 2 of this series, we further extend this solution to demonstrate how to build a private API for accessing Studio with aws:sourceVPCE IAM policy validation and token authentication. Try out this solution and leave your feedback in the comments!

About the Authors

Ram Vittal is a machine learning solutions architect at AWS. He has over 20+ years of experience architecting and building distributed, hybrid and cloud applications. He is passionate about building secure and scalable AI/ML and Big Data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis and photography.

Neelam Koshiya is an enterprise solution architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

Read More

Secure Amazon SageMaker Studio presigned URLs Part 2: Private API with JWT authentication

In part 1 of this series, we demonstrated how to resolve an Amazon SageMaker Studio presigned URL from a corporate network using Amazon private VPC endpoints without traversing the internet. In this post, we will continue to build on top of the previous solution to demonstrate how to build a private API Gateway via Amazon API Gateway as a proxy interface to generate and access Amazon SageMaker presigned URLs. Furthermore, we add an additional guardrail to ensure presigned URLs are only generated and accessed for the authenticated end-user within the corporate network.

Solution overview

The following diagram illustrates the architecture of the solution.

The process includes the following steps:

  1. In the Amazon Cognito user pool, first set up a user with the name matching their Studio user profile and register Studio as the app client in the user pool.
  2. The user federates from their corporate identity provider (IdP) and authenticates with the Amazon Cognito user pool for accessing Studio.
  3. Amazon Cognito returns a token to the user authorizing access to the Studio application.
  4. The user invokes createStudioPresignedUrl API on API Gateway along with a token in the header.
  5. API Gateway invokes a custom AWS Lambda authorizer and validates the token.
  6. When the token is valid, Amazon Cognito returns an access grant policy with studio user profile id to API Gateway.
  7. API Gateway invokes the createStudioPresignedUrl Lambda function for creating the studio presigned url .
  8. The createStudioPresignedUrl function creates a presigned URL using the SageMaker API VPC endpoint and returns to caller.
  9. User acccesses the presigned URL from their corporate network that resolves over the Studio VPC endpoint .
  10. The function’s AWS Identity and Access Management (IAM) policy makes sure that the presigned URL creation and access are performed via VPC endpoints.

The following sections walk you through solution deployment, configuration, and validation for the API Gateway private API for creating and resolving a Studio presigned URL from a corporate network using VPC endpoints.

  1. Deploy the solution
  2. Configure the Amazon Cognito user
  3. Authenticating the private API for the presigned URL using a JSON Web Token
  4. Configure the corporate DNS server for accessing the private API
  5. Test the API Gateway private API for a presigned URL from the corporate network
  6. Pre-Signed URL Lambda Auth Policy
  7. Cleanup

Deploy the solution

You can deploy the solution through either the AWS Management Console or the AWS Serverless Application Model (AWS SAM).

To deploy the solution via the console, launch the following AWS CloudFormation template in your account by choosing Launch Stack. It takes approximately 10 minutes for the CloudFormation stack to complete.

To deploy the solution using AWS SAM, you can find the latest code in the aws-security GitHub repository, where you can also contribute to the sample code. The following commands show how to deploy the solution using the AWS SAM CLI. If not currently installed, install the AWS SAM CLI.

  1. Clone the repository at
  2. After you clone the repo, navigate to the source and run the following code:
    sam deploy –guided

Configure the Amazon Cognito user

To configure your Amazon Cognito user, complete the following steps:

  1. Create an Amazon Cognito user with the same name as a SageMaker user profile:
    aws cognito-idp admin-create-user --user-pool-id <user_pool_id> --username <sagemaker_username>

  2. Set the user password:
    aws cognito-idp admin-set-user-password --user-pool-id <user_pool_id> --username <sagemaker_username> --password <password> --permanent

  3. Get an access token:
    aws cognito-idp initiate-auth --auth-flow USER_PASSWORD_AUTH --client-id <cognito_app_client_id> --auth-parameters USERNAME=<sagemaker_username>,PASSWORD=<password>

Authenticating the private API for the presigned URL using a JSON Web Token

When you deployed a private API for creating a SageMaker presigned URL, you added a guardrail to restrict access to access the presigned URL by anyone outside the corporate network and VPC endpoint. However, without implementing another control to the private API within the corporate network, any internal user within the corporate network would be able to pass unauthenticated parameters for the SageMaker user profile and access any SageMaker app.

To mitigate this issue, we propose passing a JSON Web Token (JWT) for the authenticated caller to the API Gateway and validating that token with a JWT authorizer. There are multiple options for implementing an authorizer for the private API Gateway, using either a custom Lambda authorizer or Amazon Cognito.

With a custom Lambda authorizer, you can embed a SageMaker user profile name in the returned policy. This prevents any users within the corporate network from being able to send any SageMaker user profile name for creating a presigned URL that they’re not authorized to create. We use Amazon Cognito to generate our tokens and a custom Lambda authorizer to validate and return the appropriate policy. (For more information, refer to Building fine-grained authorization using Amazon Cognito, API Gateway, and IAM). The Lambda authorizer uses the Amazon Cognito user name as the user profile name.

If you’re unable to use Amazon Cognito, you can develop a custom application to authenticate and pass end-user tokens to the Lambda authorizer. For more information, refer to Use API Gateway Lambda authorizers.

Configure the corporate DNS server for accessing the private API

To configure your corporate DNS server, complete the following steps:

  1. On the Amazon Elastic Compute Cloud (Amazon EC2) console, choose your on-premises DNSA EC2 instance and connect via Systems Manager Session Manager.
  2. Add a zone record in the /etc/named.conf file for resolving to the API Gateway’s DNS name via your Amazon Route 53 inbound resolver, as shown in the following code:
    zone "zxgua515ef.execute-api.<region>" {
      type forward;
      forward only;
      forwarders {;; };

  3. Restart the named service using the following command:
    sudo service named restart

Validate requesting a presigned URL from the API Gateway private API for authorized users

In a real-world scenario, you would implement a front-end interface that would pass the appropriate Authorization headers for authenticated and authorized resources using either a custom solution or leverage AWS Amplify. For brevity of this blog post, the following steps leverages Postman to quickly validate the solution we deployed actually restricts requesting the presigned URL for an internal user, unless authorized to do so.

To validate the solution with Postman, complete the following steps:

  1. Install Postman on the WINAPP EC2 instance. See instructions here
  2. Open Postman and add the access token to your Authorization header:
    Authorization: Bearer <access token>

  3. Modify the API Gateway URL to access it from your internal EC2 instance:
    1. Add the VPC endpoint into your API Gateway URL:

    2. Add the Host header with a value of your API Gateway URL:

    3. First, change the EMPLOYEE_ID to your Amazon Cognito user and SageMaker user profile name. Make sure you receive an authorized presigned URL.
    4. Then change the EMPLOYEE_ID to a user that is not yours and make sure you receive an access failure.
  4. On the Amazon EC2 console, choose your on-premises WINAPP instance and connect via your RDP client.
  5. Open a Chrome browser and navigate to your authorized presigned URL to launch Studio.

Studio is launched over VPC endpoint with remote address as the Studio VPC endpoint IP.

If the presigned URL is accessed outside of the corporate network, the resolution fails because the IAM policy condition for the presigned URL enforces creation and access from a VPC endpoint.

Pre-Signed URL Lambda Auth Policy

Above solution created the following Auth Policy for the Lambda that generated Pre-Signed URL for accessing SageMaker Studio.

    "Version": "2012-10-17",
    "Statement": [
            "Condition": {
                "IpAddress": {
                    "aws:VpcSourceIp": ""
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:user-profile/*/*",
            "Effect": "Allow"
            "Condition": {
                "IpAddress": {
                    "aws:SourceIp": ""
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:user-profile/*/*",
            "Effect": "Allow"
            "Condition": {
                "StringEquals": {
                    "aws:sourceVpce": [
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<region>:<account-id>:user-profile/*/*",
            "Effect": "Allow"

The above policy enforces Studio pre-signed URL is both generated and accessed via one of these three entrypoints:

  1. aws:VpcSourceIp as your AWS VPC CIDR
  2. aws:SourceIp as your corporate network CIDR
  3. aws:sourceVpce as your SageMaker API VPC endpoints


To avoid incurring ongoing charges, delete the CloudFormation stacks you created. Alternatively, if you deployed the solution using SAM, you need to authenticate to the AWS account the solution was deployed and run sam delete.


In this post, we demonstrated how to access Studio using a private API Gateway from a corporate network using Amazon private VPC endpoints, preventing access to presigned URLs outside the corporate network, and securing the API Gateway with a JWT authorizer using Amazon Cognito and custom Lambda authorizers.

Try out with this solution and experiment integrating this with your corporate portal, and leave your feedback in the comments!

About the Authors

Ram Vittal is a machine learning solutions architect at AWS. He has over 20+ years of experience architecting and building distributed, hybrid and cloud applications. He is passionate about building secure and scalable AI/ML and Big Data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis, photography, and action movies.

Jonathan Nguyen is a Shared Delivery Team Senior Security Consultant at AWS. His background is in AWS Security with a focus on Threat Detection and Incident Response. Today, he helps enterprise customers develop a comprehensive AWS Security strategy, deploy security solutions at scale, and train customers on AWS Security best practices.

Chris Childers is a Cloud Infrastructure Architect in Professional Services at AWS. He works with AWS customers to design and automate their cloud infrastructure and improve their adoption of DevOps culture and processes.

Read More

Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE

Some things are easy as A, B, C. But when it comes to autonomous vehicles, the key may be in one, two, three.

Faction, a Bay Area-based startup and NVIDIA Inception member, is preparing to debut its business-to-business autonomous delivery service, with three-wheel production electric vehicles purpose-built for driverless operation, streamlining time to market.

In addition, the company has built its autonomous driving system on NVIDIA DRIVE AGX for robust, automotive-grade AI compute.

The demand for last-mile enterprise delivery has significantly increased over the past decade, with few signs of slowing down. The number of business-to-business parcels grew from 7 billion to 11 billion from 2019 to 2021, according to ABI Research. The firm expects this number to continue rising, to reach 75 billion in 2030.

However, with a rising labor shortage that has hit the trucking industry especially hard, it’s difficult for driver supply to meet this demand.

Faction aims to narrow this gap with affordable, production autonomous vehicles ready to hit the road this year.

Smaller Vehicles, Bigger Brains

Faction’s flagship vehicle, the D1, is built on EV maker Arcimoto’s low-cost vehicle platform. The vehicle is designed to be completely driverless, combining autonomous driving and teleoperation to navigate delivery routes.

The D1 delivery vehicle can reach speeds up to 75 miles per hour, with over 100 miles of battery range, and tote 500 pounds of cargo.

Inside the vehicle, NVIDIA DRIVE AGX delivers high-performance and energy-efficient AI compute for autonomous driving.

The centralized platform runs the redundant and diverse deep neural networks that power the vehicle’s AI capabilities, while leaving enough compute headroom to continuously add new features. It’s also automotive grade, achieving systematic safety standards such as ISO 26262 ASIL-D.

“Our goal is to deploy cost-efficient autonomous vehicles in the near term,” said Faction CEO Ain McKendrick. “We chose NVIDIA DRIVE because it’s an automotive-grade platform that meets our needs today.”

Making the Inception Connection

As a member of NVIDIA Inception, Faction taps into the latest AI technologies and expertise to create vehicles that are always at the cutting edge.

Inception supports all stages of a startup’s life cycle. NVIDIA works closely with members to provide the best technical tools, latest resources and opportunities to connect with investors.

McKendrick added that Inception has helped Faction take full advantage of the latest software tools for faster iteration and streamlined development.

Expanding Services

In addition to last-mile delivery, Faction is targeting its vehicles for the micro-mobility market.

The startup plans to next year launch single-rider vehicles that can be requested via an app. The vehicle will drive autonomously to the customer, who will then take control and manually drive it to their destination.

Faction’s single-rider shared mobility vehicle based on the ElectraMeccanica SOLO EV.

The goal is to meet single-rider demand with a cost-efficient and sustainable shared mobility offering.

By keeping its delivery and mobility vehicles in a compact package without sacrificing compute, Faction proves that three truly is a magic number.

The post Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE appeared first on NVIDIA Blog.

Read More

Minerva: Solving Quantitative Reasoning Problems with Language Models

Language models have demonstrated remarkable performance on a variety of natural language tasks — indeed, a general lesson from many works, including BERT, GPT-3, Gopher, and PaLM, has been that neural networks trained on diverse data at large scale in an unsupervised way can perform well on a variety of tasks.

Quantitative reasoning is one area in which language models still fall far short of human-level performance. Solving mathematical and scientific questions requires a combination of skills, including correctly parsing a question with natural language and mathematical notation, recalling relevant formulas and constants, and generating step-by-step solutions involving numerical calculations and symbolic manipulation. Due to these challenges, it is often believed that solving quantitative reasoning problems using machine learning will require significant advancements in model architecture and training techniques, granting models access to external tools such as Python interpreters, or possibly a more profound paradigm shift.

In “Solving Quantitative Reasoning Problems With Language Models” (to be released soon on the arXiv), we present Minerva, a language model capable of solving mathematical and scientific questions using step-by-step reasoning. We show that by focusing on collecting training data that is relevant for quantitative reasoning problems, training models at scale, and employing best-in-class inference techniques, we achieve significant performance gains on a variety of difficult quantitative reasoning tasks. Minerva solves such problems by generating solutions that include numerical calculations and symbolic manipulation without relying on external tools such as a calculator. The model parses and answers mathematical questions using a mix of natural language and mathematical notation. Minerva combines several techniques, including few-shot prompting, chain of thought or scratchpad prompting, and majority voting, to achieve state-of-the-art performance on STEM reasoning tasks. You can explore Minerva’s output with our interactive sample explorer!

Solving a multi-step problem: A question from the MATH dataset and Minerva’s solution. The model writes down a line equation, simplifies it, substitutes a variable, and solves for y.

A Model Built for Multi-step Quantitative Reasoning
To promote quantitative reasoning, Minerva builds on the Pathways Language Model (PaLM), with further training on a 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats. Standard text cleaning procedures often remove symbols and formatting that are essential to the semantic meaning of mathematical expressions. By maintaining this information in the training data, the model learns to converse using standard mathematical notation.

Example questions from the Joint Entrance Examination Main Math 2020 exam taken each year by almost 2M Indian high-school students intended to study engineering and similar fields (left), and the National Math Exam in Poland (May 2022) taken by approximately 270K high-school students every year (right).
A dataset for quantitative reasoning: Careful data processing preserves mathematical information, allowing the model to learn mathematics at a higher level.

Minerva also incorporates recent prompting and evaluation techniques to better solve mathematical questions. These include chain of thought or scratchpad prompting — where Minerva is prompted with several step-by-step solutions to existing questions before being presented with a new question — and majority voting. Like most language models, Minerva assigns probabilities to different possible outputs. When answering a question, rather than taking the single solution Minerva scores as most likely, multiple solutions are generated by sampling stochastically from all possible outputs. These solutions are different (e.g., the steps are not identical), but often arrive at the same final answer. Minerva uses majority voting on these sampled solutions, taking the most common result as the conclusive final answer.

Majority voting: Minerva generates multiple solutions to each question and chooses the most common answer as the solution, improving performance significantly.

Evaluation on STEM Benchmarks
To test Minerva’s quantitative reasoning abilities we evaluated the model on STEM benchmarks ranging in difficulty from grade school level problems to graduate level coursework.

  • MATH: High school math competition level problems
  • MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
  • GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.

We also evaluated Minerva on OCWCourses, a collection of college and graduate level problems covering a variety of STEM topics such as solid state chemistry, astronomy, differential equations, and special relativity that we collected from MIT OpenCourseWare.

In all cases, Minerva obtains state-of-the-art results, sometimes by a wide margin.

Evaluation results on MATH and MMLU-STEM, which include high school and college level questions covering a range of STEM topics.
Model   MATH     MMLU-STEM     OCWCourses     GSM8k  
Minerva 50.3% 75% 30.8% 78.5%
Published state of the art    6.9% 55% 74.4%
Minerva 540B significantly improves state-of-the-art performance on STEM evaluation datasets.

What Minerva Gets Wrong
Minerva still makes its fair share of mistakes. To better identify areas where the model can be improved, we analyzed a sample of questions the model gets wrong, and found that most mistakes are easily interpretable. About half are calculation mistakes, and the other half are reasoning errors, where the solution steps do not follow a logical chain of thought.

It is also possible for the model to arrive at a correct final answer but with faulty reasoning. We call such cases “false positives”, as they erroneously count toward a model’s overall performance score. In our analysis, we find that the rate of false positives is relatively low (Minerva 62B produces less than 8% false positives on MATH).

Below are a couple of example mistakes the model makes.

Calculation mistake: The model incorrectly cancels the square root on both sides of the equation.
Reasoning mistake: The model computes the number of free throws at the fourth practice, but then uses this number as the final answer for the first practice.

Our approach to quantitative reasoning is not grounded in formal mathematics. Minerva parses questions and generates answers using a mix of natural language and LaTeX mathematical expressions, with no explicit underlying mathematical structure. This approach has an important limitation, in that the model’s answers cannot be automatically verified. Even when the final answer is known and can be verified, the model can arrive at a correct final answer using incorrect reasoning steps, which cannot be automatically detected. This limitation is not present in formal methods for theorem proving (e.g., see Coq, Isabelle, HOL, Lean, Metamath, and Mizar). On the other hand, an advantage of the informal approach is that it can be applied to a highly diverse set of problems which may not lend themselves to formalization.

Future Directions
While machine learning models have become impressive tools in many scientific disciplines, they are often narrowly scoped to solve specific tasks. We hope that general models capable of solving quantitative reasoning problems will help push the frontiers of science and education. Models capable of quantitative reasoning have many potential applications, including serving as useful aids for researchers, and enabling new learning opportunities for students. We present Minerva as a small step in this direction. To see more samples from Minerva, such as the one below, please visit the interactive sample explorer!

Solving a problem using calculus and trigonoometry: A question from the MATH dataset asking for the speed of a particle in circular motion. Minerva finds a correct step-by-step solution. In the process, Minerva computes a time derivative and applies a trigonometric identity.

Minerva was a collaborative effort that spanned multiple teams in Google Research. We would like to thank our coauthors Aitor Lewkowycz, Ambrose Slone, Anders Andreassen, Behnam Neyshabur, Cem Anil, David Dohan, Henryk Michalewski, Imanol Schlag, Theo Gutman-Solo, Vedant Misra, Vinay Ramasesh, and Yuhuai Wu, as well as our collaborators Erik Zelikman and Yasaman Razeghi. Minerva builds upon the work of many others at Google, and we would like to thank the PaLM team, the T5X team, the Flaxformer team, and the JAX team for their efforts. We thank Tom Small for designing the animation in this post. We would also like to especially thank Vedant Misra for developing the Minerva sample explorer.

Read More

Mahima Pushkarna is making data easier to understand

Five years ago, information designer Mahima Pushkarna joined Google to make data easier to understand. As a senior interaction designer on the People + AI Research (PAIR) team, she designed Data Cards to help everyone better understand the contexts of the data they are using. The Data Cards Playbook puts Google’s AI Principles into practice by providing opportunities for feedback, relevant explanations and appeal.

Recently, Mahima’s paper on Data Cards (co-written with Googlers Andrew Zaldivar and Oddur Kjartansson) was accepted to the ACM Conference on Fairness, Accountability and Transparency (ACM FAccT). Let’s catch up with her and find out more about what brought her to Google.

How did your background lead you to the work you’re doing now?

I’ve always been fascinated by conjuring up solutions to things. The kind of questions that I’ve found meaningful are those that are never truly solved, or never have one correct answer. (The kind of questions that exasperate us!) Those have been the problems I am always drawn towards.

Early in my career, I realized the power in visualizing data, but spreadsheets were intimidating. I wondered how design could make communicating complexity easier. So I found myself in grad school in Boston studying information design and data visualization. I focused on how people experience data and how our relationships to each other and our contexts are mediated.

I joined Google Brain as the first visual designer in a full-time capacity, though I had no background in artificial intelligence or machine learning — this was the deep end of the pool. This opened up the space to explore human-AI interaction, and make AI more accessible to a broader class of developers. At PAIR, my work focuses on making information experiences more meaningful for developers, researchers and others who build AI technologies.

What’s it like to have a unique background as a designer on a technical AI research team?

When you’re an engineer and immersed in building technology, it’s easy to assume everyone has a similar experience to your own — especially when you’re surrounded by peers who share your expertise. The actual user experience is very personal and varies drastically across users and contexts. That particular clarity is what designers bring to the table.

I’ve been able to engage my engineering and research colleagues with simple, people-centered questions right in the very beginning. How are people using an AI tool? What are they learning from it? Who else might be involved in the conversation? Do they have the proficiency we assume they have?

Pull quote: “Identifying what we don’t know about data is just as important as articulating what we do know.”

How did you begin designing Data Cards?

This project started when I was working on another visualization toolkit, Facets, to communicate the skews and imbalances within datasets to help machine learning practitioners make informed decisions. At the time, transparency was a moving target. Andrew, Tulsee Doshi and I started to proactively think about fairness in data, and saw a huge gap in the documentation of human decisions that dot a dataset’s lifecycle.

This “invisible” information shapes how we use data and the outcomes of models trained on them. For example, a model trained on a dataset that captures age in just two or three buckets will have very different outcomes compared to a dataset with ten buckets. The goal of Data Cards is to make both visible and invisible information about datasets available and simple to understand, so people from a variety of backgrounds can knowledgeably make decisions.

As we cover in our FAccT paper, Andrew and Oddur and I arrived at two insights. The first is that identifying what we don’t know about data is just as important as articulating what we do know. In capturing these nuances, it is possible to narrow those knowledge gaps before even collecting data. The second thing that surprised us was the sheer number of people involved in a dataset’s life cycle, and how fragile knowledge is. Context is easily lost in translation both between and within teams, across documents, emails, people and time.

Data Cards stand on the shoulders of giants, like Data Sheets (Gebru, et al.) and Model Cards (Mitchell et al.). We’ve been immensely lucky to have had the support of many original authors on these seminal papers that have paved our path to FAccT.

How do you hope the paper is used across the tech industry?

Imagine a world in which finding verifiable information about the motivations of a dataset’s creators or performance of a model is as easy as learning about the ethical beliefs of a celebrity or the rating of a movie. Our vision for Data Cards is that they become a cultural mainstay — invisible, but their absence would be missed by ML practitioners.

In this paper, we introduce frameworks that other teams can use in their work. Alongside that, we’ve open-sourced the Data Cards Playbook, so we’re trying to lower the barrier to access in every way possible.

Read More

The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More

Turn the TV on. GeForce NOW is leveling up gaming in the living room.

The Samsung Gaming Hub launched today, delivering GeForce NOW natively on 2022 Samsung Smart TVs.

Plus, the SHIELD Software Experience Upgrade 9.1 is now rolling out to all NVIDIA SHIELD TVs, delivering new gaming features that improve GeForce NOW.

Great living room gaming pairs perfectly with a great gaming controller. GeForce NOW members can claim a new reward for 20% off all SteelSeries gaming controllers on — available through the end of August.

Gear up for the final game release in June with six games available to stream today, with titles from Motorsport Games joining the GeForce NOW library. And 13 additions are coming in July. The announcement arrives just in time to grab games at discounted prices during the Steam Summer Sale through Thursday, July 7.

To cap it all off, the GeForce NOW v2.0.42 update improves streaming performance with new optimizations that adjust streaming resolutions to best fit network conditions.

What’s All the Hubbub? 

Today’s launch of the Samsung Gaming Hub brings the best of gaming from leading game streaming services like GeForce NOW to 2022 Samsung Smart TVs.

The Samsung Gaming Hub is a new game-streaming discovery platform that bridges hardware and software to provide a better player experience. Gamers can instantly play the biggest games from GeForce NOW and other top gaming partners with no downloads, storage limits or console required.

The best Samsung Smart TVs combine the latest game streaming technology with intelligent technology for picture quality and sound to create a console-like performance, eliminating the hassle of downloads and worries about precious storage space or latency.

The Samsung Gaming Hub is available now on supported TVs in the US, UK, Brazil, Canada, France, Italy, Germany, and Spain. Members can also stream their PC games on the GeForce NOW app on Samsung TVs in other supported GeForce NOW regions.

GeForce NOW RTX 3080 members also have the advantages of ultra-low latency powered by GeForce NOW SuperPODs with faster game rendering, more efficient encoding and higher streaming frame rates. They also benefit from maximized eight-hour gaming sessions and dedicated RTX 3080 servers.

Gamers can even pair their favorite controllers to the Samsung Gaming Hub for a seamless experience.

The Best Keeps Getting Better

SHIELD TV continues to upgrade the best cloud gaming experience in the living room, adding to its existing GeForce NOW support for 4K HDR, 7.1 surround sound, a wide range of controllers, streaming to Twitch, and in-game voice chat with USB headsets. The latest SHIELD update, Software Experience Upgrade 9.1, takes gaming in the living room to new heights.

GeForce NOW streaming on NVIDIA SHIELD
Just switch on your SHIELD and play hit titles you own from the GeForce NOW library.

SHIELD now automatically switches TVs with automatic low-latency mode to “game mode” when playing games or video conferencing — and then reverts to the previous setting when playing movies or streaming TV shows. This latency-saving feature replaces the cumbersome process of finding the TV remote, switching the mode setting, and changing it back when gaming sessions are complete.

Another new feature is night listening mode, which enables users to stream games or watch movies at night, without disturbing others. SHIELD will automatically adjust sound levels for loud explosions, quiet dialogue and everything in between to deliver a consistent listening experience regardless of volume settings.

The update also includes microphone notifications that help identify the hot mic when multiple devices are connected.

Whether gaming from the cloud with GeForce NOW or playing an Android game locally, the latest SHIELD update helps members get the most responsive gaming experience in the living room.

Get Rewarded With SteelSeries

Members can take control of their gaming with 20% off SteelSeries gaming controllers.

Steel Series Stratus+ on GeForce NOW
Put the “joy” in joystick with a new controller from SteelSeries.

SteelSeries wireless gaming controllers bring the PC gaming experience to any platform with easy pairing, extreme durability and a battery life of up to 50 hours of playtime. They’re also a part of the full lineup of GeForce NOW Recommended products. The discount is valid for the Nimbus +, the Stratus Duo and even the newest Stratus+ models. Redemption is valid through Wednesday, August 31 for select North American and European regions.

It’s easy to get membership rewards for streaming games on the cloud. Log in to your NVIDIA account and select “GEFORCE NOW” from the header, then scroll down to “REWARDS” and click the “UPDATE REWARDS SETTINGS” button. Check the box in the dialogue window that shows up to start receiving special offers and in-game spoils.

Start July Off With a Bang

This GFN Thursday closes out the month with six new games streaming this week, including games from Motorsport Games. It also kicks off July with the list of 13 titles on the way,

Kart Kraft on GeForce NOW
Buckle up. Fast-paced competitive racing titles are on the way.

GeForce NOW welcomes video game publisher Motorsport Games to the cloud. From NASCAR 21: Ignition, the officially licensed video game of the world’s most popular stock-car racing series with, to the thrilling and realistic physics of KartKraft, more gamers than ever can experience racing entertainment streaming to low-powered PCs, Macs and mobile devices.  

Catch the games ready to play today:

And coming this July:

  • Matchpoint-Tennis Championships (New release on Steam, July 7)
  • Sword and Fairy Inn 2 (New release on Steam, July 8)
  • Loopmancer (New release on Steam, July 13)
  • Stones Keeper: King Aurelius (New release on Steam, July 14)
  • Endling – Extinction is Forever (New release on Steam and Epic Games Store, July 19)
  • Grimstar: Welcome to the Savage Planet (New release on Steam, July 19)
  • Sweet Transit (New release on Steam, July 28)
  • Panzer Arena: Prologue (New release on Steam, July 20)
  • Hell Pie (New release on Steam, July 21)
  • Turbo Sloths (New release on Steam, July 27)
  • Arma Reforger (Steam)
  • Dungeon Defenders: Going Rogue (Steam)
  • rFactor 2 (Steam)

Blow Off Some Steam With a Summer Sale

Speaking of games, it’s the best time to build your collection with the Steam Summer Sale, running through Thursday, July 7.

Life is Strange on GeForce NOW
Speed on over to Steam. Tons of great games, like “Life is Strange: True Colors,” are on sale now.

Get PC games streaming from the GeForce NOW library during Valve’s special event to stream across low-powered PCs, Macs and mobile devices on the cloud. Once purchased, they’re yours forever, and the cloud saves all your progress.

Check out the “Steam Summer Sale” row in the GeForce NOW app to find deals on your next adventure. Race to grab titles like NASCAR 21: Ignition and KartKraft from Motorsport Games and check if any of the GeForce NOW games on your wishlist are on sale. With over 1,300 games streaming on the cloud, it’s a good chance they are.

Extra Games From June

On top of the 25 games announced in June, another seven joined over the month:

Finally, tune into this question we’ve got for you this week. Let us know your answer on Twitter or in the comments below.

The post The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More appeared first on NVIDIA Blog.

Read More

FIGS: Attaining XGBoost-level performance with the interpretability and speed of CART

FIGS: Attaining XGBoost-level performance with the interpretability and speed of CART

FIGS (Fast Interpretable Greedy-tree Sums): A method for building interpretable models by simultaneously growing an ensemble of decision trees in competition with one another.

Recent machine-learning advances have led to increasingly complex predictive models, often at the cost of interpretability. We often need interpretability, particularly in high-stakes applications such as in clinical decision-making; interpretable models help with all kinds of things, such as identifying errors, leveraging domain knowledge, and making speedy predictions.

In this blog post we’ll cover FIGS, a new method for fitting an interpretable model that takes the form of a sum of trees. Real-world experiments and theoretical results show that FIGS can effectively adapt to a wide range of structure in data, achieving state-of-the-art performance in several settings, all without sacrificing interpretability.

FIGS: Attaining XGBoost-level performance with the interpretability and speed of CART

FIGS (Fast Interpretable Greedy-tree Sums): A method for building interpretable models by simultaneously growing an ensemble of decision trees in competition with one another.

Recent machine-learning advances have led to increasingly complex predictive models, often at the cost of interpretability. We often need interpretability, particularly in high-stakes applications such as in clinical decision-making; interpretable models help with all kinds of things, such as identifying errors, leveraging domain knowledge, and making speedy predictions.

In this blog post we’ll cover FIGS, a new method for fitting an interpretable model that takes the form of a sum of trees. Real-world experiments and theoretical results show that FIGS can effectively adapt to a wide range of structure in data, achieving state-of-the-art performance in several settings, all without sacrificing interpretability.