Increase your content reach with automated document-to-speech conversion using Amazon AI services

Reading the printed word opens up a world of information, imagination, and creativity. However, scanned books and documents may be difficult for people with vision impairment and learning disabilities to consume. In addition, some people prefer to listen to text-based content versus reading it. A document-to-speech solution extends the reach of digital content by giving text content a voice. It has uses across different industry sectors, such as:

  • EntertainmentYou can create your own audiobooks.
  • Education – Students can convert their lecture notes to speech and access them anywhere.
  • Patient care – Dosage instructions and precautions are typically in small fonts and hard to read. With this solution, you could take a picture, convert to speech, and listen to the instructions in order to avoid potential harm.

The document-to-speech solution converts scanned books or documents taken on a mobile phone or handheld device automatically to speech. This solution extends the capabilities of Amazon Polly. We extract text from scanned documents using Amazon Textract, and then convert the text to speech using Amazon Polly. Solution benefits include mobility and freedom for the user plus enhanced learning capabilities for early readers.

The idea originated from Harry Pan, one of the blog author’s favorite parent-child activities – reading books. “My son enjoys storybooks, but is too young to read on his own. I love reading to him, but sometimes I need to work or tend to household chores. This sparked an idea to build a document-to-speech solution that could read to him when I was busy”.

Overview of solution

The solution is an event-driven serverless architecture that uses Amazon AI services to convert scanned documents to speech. Amazon Textract and Amazon Polly belong to the topmost layer of the AWS machine learning (ML) stack. These services allow developers to easily add intelligence to any application without prior ML knowledge.

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data without any manual effort.

Amazon Polly is a text-to-speech service that turns text into lifelike speech, allowing you to create applications that talk and to build entirely new categories of speech-enabled products. Amazon Polly uses advanced deep learning technologies to synthesize speech that sounds like a human voice.

There are significant advantages of using Amazon AI services:

  • They take little effort; you can integrate these APIs into any application
  • They offer highly scalable and cost-effective solutions
  • Your organization can shift its focus from development of custom models to business outcomes

The solution also uses Amazon API Gateway to quickly stand up APIs that the web UI can invoke to perform operations like uploading documents and converting scanned documents to speech. API Gateway provides a scalable way to create, publish, and maintain secure APIs. In this solution, we also use API Gateway WebSocket support to establish a persistent connection between the web UI and the backend, so the backend can keep sending progress updates to user in real time.

We use AWS Lambda functions to trigger Amazon Textract and Amazon Polly asynchronous jobs. Lambda is a highly available and scalable compute service that lets you run code without provisioning resources.

We use an AWS Step Functions state machine to orchestrate two parallel Lambda functions – one to moderate text and the other to store text in Amazon Simple Storage Service (Amazon S3). Step Functions is a serverless orchestration service to define application workflows as a series of event-driven steps.

Architecture and code

As described in the previous section, we use two key AI services, Amazon Textract and Amazon Polly, to build a document-to-speech conversion solution. One additional service that we haven’t touched upon is AWS Amplify. Amplify allows front-end developers to quickly build extensible, full stack web and mobile apps. With Amplify, you can easily configure a backend, connect an application to it within minutes, and scale effortlessly. We use Amplify to host a web UI that allows users to upload their scanned documents.

You can also use your own UI without Amplify. As we dive deep into this solution, we show how you can use any client application to connect to the backend to convert documents to speech – as long as they support REST and WebSocket APIs. The web UI here is simply to demonstrate key features of this solution. As of this writing, the solution supports JPEG, PNG, and PDF input formats, and the English language.

The following diagram illustrates the solution architecture.

We walk through this architecture by following the path of a single user request:

  1. The user visits the web UI hosted on Amplify. The UI code is the index.html file in the client folder of the code repository.
  2. The user chooses a JPG, PDF, or PNG file to upload using the web UI.
  3. The user initiates the Convert & Play Audio process from the web UI, which uploads the input file to an S3 bucket, through a REST API hosted on API Gateway.
  4. When the upload is complete, the document-to-speech conversion starts as a background process:
    1. During the conversion, the web client keeps a persistent WebSocket connection with the API Gateway. This allows the backend processes (Lambda functions) to continuously send progress updates to the web client.
    2. The request goes through the API Gateway and triggers the Lambda function convert-images-to-text. This function calls Amazon Textract asynchronously to convert the document to text.
    3. When the image-to-text conversion is complete, Amazon Textract sends a notification to Amazon Simple Notification Service (Amazon SNS).
    4. The notification triggers the Lambda function on-textract-ready, which kicks off a Step Functions state machine.
    5. The state machine orchestrates the following steps:
      1. It runs the Lambda function retrieve-text to obtain the converted text from Amazon Textract.
      2. It then runs Lambda functions moderate-text and store-text in parallel. moderate-text stops further processing when undesirable words are detected, and store-text stores a copy of the converted text to an S3 bucket.
      3. After the parallel steps are complete, the state machine runs the Lambda function convert-text-to-audio, which invokes Amazon Polly asynchronously with the converted text, for speech conversion. The state machine finishes after this step.
    6. Similar to Amazon Textract, Amazon Polly sends a notification to Amazon SNS when the job is done. The notification triggers the Lambda function on-polly-ready, which sends a final message to the web UI along with the Amazon S3 location of the converted audio file.
  5. The web UI downloads the final converted audio file from Amazon S3 via a REST API, and then plays it for the user.
  6. The application uses an Amazon DynamoDB table to track job information such as Amazon Textract job ID, Amazon Polly job ID, and more.

The code is hosted on GitHub and is deployed using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to define cloud application resources using familiar programming languages. AWS CDK provisions resources in a repeatable manner through AWS CloudFormation.

Prerequisites

The only prerequisite to deploy this solution is an AWS account.

Deploy the solution

The following steps detail how to deploy the application:

  1. Sign in to your AWS account.
  2. On the AWS Cloud9 console, open an existing environment, or choose Create environment to create a new one.
  3. In your AWS Cloud9 IDE, on the Window menu, choose New Terminal to open a terminal.

All the following steps are done in the same terminal.

  1. Clone the git repository and enter the project directory:
git clone --depth 1 https://github.com/aws-samples/scanned-documents-to-speech.git
cd scanned-documents-to-speech
  1. Create a Python virtual environment:
python3 -m venv .venv
  1. After the init process is complete and the virtual environment is created, use the following step to activate your virtual environment:
source .venv/bin/activate
  1. After the virtual environment is activated, install the required dependencies:
pip install -r requirements.txt
  1. You can now synthesize the CloudFormation templates from the AWS CDK code:
cdk synth
  1. Deploy the AWS CDK application and capture AWS CDK outputs needed later:
cdk deploy --all --outputs-file cdk-outputs.json

You must confirm changes to be deployed for each stack. You can check the stack creation progress on the AWS Cloud Formation console.

  1. To visit the web client, run the following command and follow its output to kick off front-end deployment and use the web client:
./extract-cdk-outputs.py cdk-outputs.json

Key things to note:

  • The extract-cdk-outputs.py script prints out the URL of the web UI. The script also prints out strings of the S3 bucket name, file API endpoint, and conversion API endpoint, which need to be set on the web UI before uploading a document.
  • You can set the list of undesirable words in the variable in the moderate-text Lambda function.

Use the application

The following steps demonstrate how to use the application via the web UI.

  1. Following the last step of the deployment, fill in the fields for S3 Bucket Name, File Endpoint, and Conversion Endpoint in the web UI.
  2. Choose Choose File to upload an input file.
  3. Choose Convert & Play Audio.

The web UI shows the progress of the ongoing conversion.

The web UI plays the audio automatically when the conversion is complete.

Clean up

Run the following command to delete all resources and avoid incurring future charges:

cdk destroy --all

Conclusion

In this post, we demonstrated a solution to quickly deploy a document-to-speech conversion application by using two powerful AI services: Amazon Textract and Amazon Polly. We showed how the solution works and provided a detailed walkthrough of the code and deployment steps. This solution is meant to be a reference architecture or quick start that you can further enhance. Notably, you could add support for more human languages, add a queue for buffering incoming requests, and authenticate users.

As discussed in this post, we see multiple use cases for this solution across different industry verticals. Give it a try and let us know how this solved your use case by leaving feedback in the comments section. You can access the resources for the solution in the document to speech GitHub repository.

References

More information is available at the following resources:


About the Authors

Harry PanHarry Pan is an ISV Solutions Architect at Amazon Web Services based in the San Francisco Bay Area, where he helps software companies achieve their business goals by building well-architected IT systems. He loves spending his spare time with his family, as well as playing tennis, coding in Haskell, and traveling.

Chaitra MathurChaitra Mathur is a Principal Solutions Architect at AWS. She guides partners and customers in building highly scalable, reliable, secure, and cost-effective solutions on AWS. In her spare time, she enjoys reading, yoga and spending time with her daughters.

Read More

Mown Away: Startup Rolls Out Autonomous Lawnmower With Cutting Edge Tech

Jack Morrison and Isaac Roberts, co-founders of Replica Labs, were restless two years after their 3D vision startup was acquired, seeking another adventure. Then, in 2018, when Morrison was mowing his lawn, it struck him: autonomous lawn mowers.

The two, along with Davis Foster, co-founded Scythe Robotics. The company, based in Boulder, Colo., has a 40-person team working with robotics and computer vision to deliver what it believes to be the first commercial electric self-driving mower service.

Scythe’s machine, dubbed M.52, collects a dizzying array of data from eight cameras and more than a dozen other sensors, processed by NVIDIA Jetson AGX Xavier edge AI computing modules.

The company plans to rent its machines to customers much like is done in a software-as-as-service model, but based on acreage of cut grass, reducing upfront costs.

“I thought, if I didn’t enjoy mowing the lawn, what about the folks who are doing it every day. Wasn’t there something better they could do with their time?” said Morrison. “It turned out there’s a strong resounding ‘yes’ from the industry.”

The startup, a member of the NVIDIA Inception program, says it already has thousands of reservations for its on-demand robots. Meanwhile, it has a handful of pilots, including one with Clean Scapes, a large Austin-based commercial landscaping company.

Scythe’s electric machines are coming as regulatory and corporate governance concerns highlight the need for cleaner landscaping technologies.

What M.52 Can Do

Scythe’s M.52 machine is about as state of the art as it gets. Its cameras support vision on all sides, and its dozen sensors include ultrasonics, accelerometers, gyroscopes, magnetometers, GPS and wheel encoders.

To begin a job, the M.52 needs to only be manually driven on the perimeter of an area just once. Scythe’s robot mower relies on its cameras, GPS and wheel encoders to help plot out maps of its environment with simultaneous localization and mapping, or SLAM.

After that, the operator can direct the M.52 to an area and specify a direction and stripe pattern, and it completes the job unsupervised. If it encounters an obstacle that shouldn’t be there — like a bicycle on the grass — it can send alerts for an assist.

“Jetson AGX Xavier is a big enabler of all of this, as it can be used in powerful machines, brings a lot of compute, really low power, and hooks into the whole ecosystem of autonomous machine sensors,” said Morrison.

Scythe’s robo-mowers pack enough battery for eight hours of use on a charge, which can come from a standard level 2 EV charger. And the company says fewer moving parts than combustion engine mowers means less maintenance and a longer service life.

Also, the machine can go 12 miles per hour and includes a platform for operators to stand on for a ride. It’s designed for jobs sites where mowers may need to travel some distance to get to the area of work.

Riding the RaaS Wave

The U.S. market for landscaping services is expected to reach more than $115 billion this year, up from about $70 billion in 2012, according to IBISWorld research.

Scythe is among an emerging class of startups offering robots as a service, or RaaS, in which customers pay according to usage.

It’s also among companies operating robotics services whose systems share similarities with AVs, Morrison said.

“An advantage of Xavier is using these automotive-grade camera standards that allow the imagery to come across with really low latency,” he said. “We are able to turn things around from photon to motion quite quickly.”

Earth-Friendly Machines

Trends toward lower carbon footprints in businesses are a catalyst for Scythe, said Morrison. LEED certification for greener buildings counts the equipment used in maintenance — such as landscaping — driving interest for electric equipment.

Legislation from California, which aims to prohibit sales of gas-driven tools by 2024, factors in as well. A gas-powered commercial lawn mower driven one hour emitted as much of certain pollutants as driving a passenger vehicle 300 miles, according to a fact sheet released from the California Air Resources Board in 2017.

“There’s a lot of excitement around what we are doing because it’s becoming a necessity, and those in landscaping businesses know that they won’t be able to secure the equipment they are used to,” said Morrison.

Learn more about Scythe Robotics in this GTC presentation.

The post Mown Away: Startup Rolls Out Autonomous Lawnmower With Cutting Edge Tech appeared first on NVIDIA Blog.

Read More

Meet the Omnivore: 3D Artist Creates Towering Work With NVIDIA Omniverse

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse to accelerate their 3D workflows and create virtual worlds.

Edward McEvenue

Edward McEvenue grew up making claymations in LEGO towns. Now, he’s creating photorealistic animations in virtual cities, drawing on more than a decade of experience in the motion graphics industry.

The Toronto-based 3D artist learned about NVIDIA Omniverse, a 3D design collaboration and world simulation platform, a few months ago on social media. Within weeks he was using it to build virtual cities.

McEvenue used the Omniverse Create app to make a 3D representation of New York City, composed of 32 million polygons:

And for the following scene, he animated a high-rise building with fracture-and-reveal visual effects:

Light reflects off the building’s glass windows photorealistically, and vehicles whiz by in a physically accurate way — thanks to ray tracing and path tracing enabled by RTX technology, as well as Omniverse’s physics-based simulation capabilities.

McEvenue’s artistic process incorporates many third-party applications, including Autodesk 3ds Max for modeling and animation; Epic Games Unreal Engine for scene layout; tyFlow for visual effects and particle simulations; Redshift and V-Ray for rendering; and Adobe After Effects for post-production compositing.

All of these tools can be connected seamlessly in Omniverse, which is built on Pixar’s Universal Scene Description, an easily extensible, open-source 3D scene description and file format.

Real-Time Rendering and Multi-App Workflows

EDSTUDIOS, McEvenue’s freelance company, creates 3D animations and motion graphics for films, advertisements and other visual projects.

He says Omniverse helps shave a week off his average delivery time for projects, since it eliminates the need to send animations to a separate render farm, and allows him to bring in assets from his favorite design applications.

“It’s a huge relief to finally feel like rendering isn’t a bottleneck for creativity anymore,” McEvenue said. “With Omniverse’s real-time rendering capabilities, you get instant feedback, which is incredibly freeing for the artistic process as it allows creators to focus time and energy into making their designs rather than facing long waits for the beautiful images to display.”

A virtual representation of the ancient sculpture ‘Laocoön and His Sons’ — made by McEvenue with Autodesk 3ds Max, Unreal Engine 5 and Omniverse Create.

McEvenue says he can output more visuals on one computer running Omniverse faster than he previously could with nine. He uses an NVIDIA RTX 3080 Ti GPU and NVIDIA Studio Drivers to accelerate his workflow.

In addition to freelance projects, McEvenue has worked on commercial campaigns across Canada, the U.S. and Uganda as a co-founder of the film production company DAY JOB.

“Omniverse takes away a lot of the friction in dealing with exporting formats, recreating shaders or linking materials,” McEvenue said. “And it’s very intuitively designed, so it’s incredibly easy to get up and running.”

Watch a Twitch stream replay that dives deeper into McEvenue’s workflow, as well as highlights the launch of the Unreal Engine 5 Omniverse Connector.

Join in on the Creation

Creators across the world can download NVIDIA Omniverse for free, and enterprise teams can use the platform for their 3D projects.

Join the #MadeInMachinima contest, running through June 27, for a chance to win the latest NVIDIA Studio laptop.

Learn more about Omniverse by watching GTC sessions on demand — featuring visionaries from the Omniverse team, Adobe, Autodesk, Bentley Systems, Epic Games, Pixar, Unity, Walt Disney Studios and more.

Connect your workflows to Omniverse with software from Adobe, Autodesk, Epic Games, Maxon, Reallusion and more.

Follow Omniverse on Instagram, Twitter, YouTube and Medium for additional resources and inspiration. Check out the Omniverse forums and join our Discord Server to chat with the community.

The post Meet the Omnivore: 3D Artist Creates Towering Work With NVIDIA Omniverse appeared first on NVIDIA Blog.

Read More