Applying voice classification in an Amazon Connect telemedicine contact flow

Given the rising demand for fast and effective COVID-19 detection, customers are exploring the usage of respiratory sound data, like coughing, breathing, and counting, to automatically diagnose COVID-19 based on machine learning (ML) models. University of Cambridge researchers built a COVID-19 sound application and demonstrated that a simple binary ML classifier can classify healthy and COVID-19 coughs with over 80% area under the curve (AUC) for all tasks. Massachusetts Institute of Technology (MIT) researchers published a similar open voice model, and their Convolutional Neural Network (CNN) based binary classifier achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC 0.97). Carnegie Mellon University also built a COVID voice detector to develop an automated AI system to diagnose a COVID-19 infection based on the human voice. The promising results of these preliminary studies based on crowdsourced audio signals shows the power of AI in the medical industry for disease diagnosis and detection.

Although the research has shown a lot of promise, it’s still difficult to create a scalable solution that takes advantage of these promising models. In this post, we demonstrate a smart call center application workflow that integrates a voice classification model to detect COVID-19 infections or other types of respiratory diseases in people calling in to the call center. For the purposes of creating an end-to-end workflow, we train the model on the open-source Coswara data, which relies on a variety of sounds like deep or shallow breathing, coughing, and counting to distinguish healthy versus unhealthy sound. You can replace this model and training data with any other model or datasets to achieve the level of performance as demonstrated in the research papers.

Overview of solution

This solution uses Amazon Connect, an easy-to-use omnichannel cloud contact center contact flow to make real-time inference to an ML model trained and deployed using Amazon SageMaker. The audio recordings are labeled as healthy (negative) and unhealthy (positive), meaning a COVID-19 infection and other respiratory illness. Because the distribution of positive and negative labels are highly imbalanced, we use the oversampling technique from the Python imbalanced learn library to improve the ratio. We used the PyTorch acoustic classification model, which relies on deep Convolutional Neural Network (CNN) for this audio-based COVID prediction. The trained CNN model is deployed to a SageMaker inference endpoint. The AWS Lambda function triggered by the Amazon Connect contact flow is used to make real-time inference based on the audio streams from an Amazon Connect phone call recording in Amazon Kinesis Video Streams.

The following is the architecture diagram for integrating online ML inference in a telemedicine contact flow via Amazon Connect.

The following is the architecture diagram for integrating online ML inference in a telemedicine contact flow via Amazon Connect.

Training and deploying a voice classification model using SageMaker

We first create a SageMaker notebook instance, on which we build a voice classification deep learning model to predict the likelihood of respiratory diseases using the open-source Coswara dataset. To deploy the AWS CloudFormation stack for the notebook instance, choose Launch Stack:

Feel free to change the notebook instance type if necessary. The deployment also clones the following two GitHub repositories:

Go to the Jupyter notebook coswara-audio-classification.ipynb under the applying-voice-classification-in-amazon-connect-contact-flow/sagemaker-voice-classification/notebook folder.

The notebook walks you through the following tasks:

The notebook walks you through the following tasks:

  1. Preprocess the Coswara data, including uncompressing files and generating the metadata CSV files for each type of audio recording.
  2. Build and upload the Docker container image for SageMaker training and inference jobs to Amazon Elastic Container Registry (Amazon ECR).
  3. Upload Coswara data to an Amazon Simple Storage Service (Amazon S3) bucket for the SageMaker training job.
  4. Train a Pytorch CNN estimator for voice classification given the sample hyperparameters.
  5. Create a hyperparameter optimization (HPO) job (optional).
  6. Deploy the trained PyTorch estimator to the SageMaker inference endpoint.
  7. Test batch prediction and invoke the endpoint.

Because this dataset is highly unbalanced, we labeled healthy samples as negative and all non-healthy samples as positive, and over-sampled the positive ones using imbalanced-learn library in the file under the notebook folder:

import torch
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
for data, target in data_loader:
    data_resampled, target_resampled = ros.fit_resample(np.squeeze(data), target)
    data = torch.from_numpy(data_resampled)
    data = data.unsqueeze_(-2)
    target = torch.tensor(target_resampled)

In the preceding code, the data and target are torch tensors returned by the getitem function defined in the CoswareDataset class in the file. The oversampling approach improved the prediction performance by approximately 40%. We implemented a very deep CNN for voice classification in the file with the default number of classes as two, and applied different metrics in the Scikit-learn Python library to evaluate the prediction performance:

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, fbeta_score, roc_auc_score
accuracy = accuracy_score(actuals, predictions)
rocauc = roc_auc_score(actuals, np.exp(prediction_probs))
precision = precision_score(actuals, predictions, average='weighted')
recall = recall_score(actuals, predictions, average='weighted')
f1 = f1_score(actuals, predictions, average='weighted')
f2 = fbeta_score(actuals, predictions, average='weighted', beta=0.5)

The tuning job tries to maximize the F-beta score, which is the weighted harmonic mean of precision and recall. When you’re satisfied with the prediction performance of the training job, you can deploy a SageMaker inference endpoint:

from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data=model_location, 
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge', wait=True)

After deploying the estimator for online prediction, take note of the inference endpoint name, which you use in the next step.

After deploying the estimator for online prediction, take note of the inference endpoint name, which you use in the next step.

It’s noteworthy that the inference endpoint can be invoked by two types of request body defined in the file:

  • A text string for the S3 object of the audio recording WAV file
  • A pickled NumPy array

See the following code:

def input_fn(request_body, request_content_type):
    if request_content_type == 'text/csv':
        print("bucket: {}".format(bucket))
        print("object: {}".format(obj))
        s3.download_file(bucket, obj, '/audioinput.wav')
        print("audio input file size: {}".format(os.path.getsize('/audioinput.wav')))
        waveform, sample_rate = torchaudio.load('/audioinput.wav')
        waveform = torchaudio.transforms.Resample(sample_rate, new_sr)(waveform[0, :].view(1, -1))
        const_len = new_sr * audio_len
        tempData = torch.zeros([1, const_len])
        if waveform.shape[1] < const_len:
            tempData[0, : waveform.shape[1]] = waveform[:]
            tempData[0, :] = waveform[0, :const_len]
        sound = tempData
        tempData = torch.zeros([1, const_len])
        if sound.shape[1] < const_len:
            tempData[0, : sound.shape[1]] = sound[:]
            tempData[0, :] = sound[0, :const_len]
        sound = tempData
        new_const_len = const_len // sampling_ratio
        soundFormatted = torch.zeros([1, 1, new_const_len])
        soundFormatted[0, 0, :] = sound[0, ::5]
        return soundFormatted
    elif request_content_type in ['application/x-npy', 'application/python-pickle']:
        return torch.tensor(np.load(BytesIO(request_body), allow_pickle=True))
        print("unknown request content type: {}".format(request_content_type))
        return request_body

The output is the probability of the positive class from 0 to 1, which indicates how likely the voice is unhealthy in this use case, defined in as well:

def predict_fn(input_data, model):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    with torch.no_grad():
        output = model(
        output = output.permute(1, 0, 2)[0]
        pred_prob = np.exp( output.cpu().detach().numpy()[:,1] )
        return pred_prob[0]

Deploying a CloudFormation template for Lambda functions for audio streaming inference

You can deploy the Lambda function with the following CloudFormation stack one-click deployment in the us-east-1 Region:

You need to fill in the S3 bucket name for the audio recording and the SageMaker inference endpoint as parameters.

You need to fill in the S3 bucket name for the audio recording and the SageMaker inference endpoint as parameters.

If you want to deploy this stack in AWS Regions other than us-east-1, or if you want to change the Lambda functions, go to the connect-audio-stream-solution folder and follow the steps to build and deploy the Serverless Application Model (AWS SAM) stack. Take note of the CloudFormation stack outputs for the Lambda function ARNs, which you use in the next step.

Take note of the CloudFormation stack outputs for the Lambda function ARNs, which you use in the next step.

Setting up an interactive voice response using Amazon Connect

We use an Amazon Connect contact flow to trigger Lambda functions, created in the previous step, to process the captured audio recording in Kinesis Video Streams, assuming you have an Amazon Connect instance ready to use. For instructions on setting up an Amazon Connect instance, see Create an Amazon Connect instance. You also need to enable live audio streaming for your instance. Your instance should be created in the same AWS Region as your previous CloudFormation stack, because your video stream should be created in the same Region for Lambda functions to consume.

You can create a new inbound contact flow by importing the flow configuration file. You need to claim a phone number and associate it with the newly created contact flow. There are two Lambda functions to be configured here: the ARNs of ContactFlowlambdaInitArn and ContactFlowlambdaTriggerArn, located on the Outputs tab of the CloudFormation stack you deployed in the previous step.

You can create a new inbound contact flow by importing the flow configuration file.

After changing the ARNs for the Lambda functions, save and publish the contact flow. Now you’re ready to test it by calling the associated phone number with this contact flow.

Cleaning up

To avoid unexpected future charges, clean up your resources:

  1. Delete the SageMaker inference endpoint.
  2. Empty and delete the S3 bucket DefaultS3Bucket.
  3. Delete the CloudFormation stack for the SageMaker notebook instances and Lambda functions used by Amazon Connect.


This solution was inspired and built upon the following GitHub repos:


In this post, we demonstrated how to predict the likelihood of COVID-19 or other respiratory diseases just based on voice classification. To further improve the ML prediction performance, you can incorporate other related information into the model, like age, gender, or existing symptoms. Audio data augmentation plus handcrafted features can help yield better prediction results, according to existing studies. You can use the audio-based diagnostic prediction in an Amazon Connect contact flow to triage the targeted group of incoming calls and escalate to a doctor to follow up if necessary. The intelligence provided by the acoustic classification can be used by call center agents in conjunction with Contact Lens for Amazon Connect, which provides a turn-by-turn transcript, real-time alerts, automated call categorization based on keywords and phrases, sentiment analysis, issue detection (the reason the customer contacted the call center), and sensitive data redaction.

To find the latest developments to this solution, check out the GitHub repo.

About the Authors

Gang Fu is a Senior Healthcare Solution Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over 10 years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.



Ujjwal Ratan is a Principal Machine Learning Specialist in the Global Healthcare and Life Sciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real-world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials, and quality of care improvement. He has expertise in scaling machine learning and deep learning algorithms on the AWS Cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.


Wei Yih YapWei Yih Yap is a Senior Data Scientist with AWS Professional Services, where he works with customers to address business challenges using machine learning on AWS. In his spare time, he enjoys spending time with his family.

Read More