Scientists’ paper on probabilistic symbolic execution has significantly influenced software testing and analysis.Read More
A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices
Dentists get a bad rap. Dentists also get more people out of more aggravating pain than just about anyone.
Which is why the more technology dentists have, the better.
Overjet, a member of the NVIDIA Inception program for startups, is moving fast to bring AI to dentists’ offices.
On this episode of the NVIDIA AI Podcast, host Noah Kravitz talks to Dr. Wardha Inam, CEO of Overjet, about how her company uses AI to improve patient care.
Overjet’s AI-powered technology analyzes and annotates X-rays for dentists and insurance providers.
It’s a step that promises to take the subjectivity out of X-ray interpretations, boosting medical services.
Interested in learning more about healthcare life sciences AI and simulation innovation at GTC? Check out our Accelerate Healthcare Life Science Innovation with Industry Makers and Breakers at this week’s GTC that will cover innovations in imaging, devices, genomics, drug discovery, and metaverse.
You Might Also Like
Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery With NVIDIA GPUs
It may seem intuitive that AI and deep learning can speed up workflows — including novel drug discovery, a typically years-long and several-billion-dollar endeavor. However, there is a dearth of recent research reviewing how accelerated computing can impact the process. Professors Artem Cherkasov and Olexandr Isayev discuss how GPUs can help democratize drug discovery.
Lending a Helping Hand: Jules Anh Tuan Nguyen on Building a Neuroprosthetic
Is it possible to manipulate things with your mind? Possibly. University of Minnesota postdoctoral researcher Jules Anh Tuan Nguyen discusses allowing amputees to control their prosthetic limbs with their thoughts, using neural decoders and deep learning.
Wild Things: 3D Reconstructions of Endangered Species With NVIDIA’s Sifei Liu
Studying endangered species can be difficult, as they’re elusive, and the act of observing them can disrupt their lives. Sifei Liu, a senior research scientist at NVIDIA, discusses how scientists can avoid these pitfalls by studying AI-generated 3D representations of these endangered species.
Subscribe to the AI Podcast: Now Available on Amazon Music
You can now listen to the AI Podcast through Amazon Music.
Also get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.
Make the AI Podcast better: Have a few minutes to spare? Fill out our listener survey.
The post A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices appeared first on NVIDIA Blog.
What it’s like to be a parent on the Core Data Science team at Meta
What it’s like to be a parent on the Core Data Science team at MetaRead More
Introducing Whisper
We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.OpenAI Blog
Configure a custom Amazon S3 query output location and data retention policy for Amazon Athena data sources in Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Snowflake, and 26 federated query data sources supported by Amazon Athena.
Starting today, when importing data from Athena data sources, you can configure the S3 query output location and data retention period to import data in Data Wrangler to control where and how long Athena stores the intermediary data. In this post, we walk you through this new feature.
Solution overview
Athena is an interactive query service that makes it easy to browse the AWS Glue Data Catalog, and analyze data in Amazon S3 and 26 federated query data sources using standard SQL. When you use Athena to import data, you can use Data Wrangler’s default S3 location for the Athena query output, or specify an Athena workgroup to enforce a custom S3 location. Previously, you had to implement cleanup workflows to remove this intermediary data, or manually set up S3 lifecycle configuration to control storage cost and meet your organization’s data security requirements. This is a big operational overhead, and not scalable.
Data Wrangler now supports custom S3 locations and data retention periods for your Athena query output. With this new feature, you can change the Athena query output location to a custom S3 bucket. You now have a default data retention policy of 5 days for the Athena query output, and you can change this to meet your organization’s data security requirements. Based on the retention period, the Athena query output in the S3 bucket gets cleaned up automatically. After you import the data, you can perform exploratory data analysis on this dataset and store the clean data back to Amazon S3.
The following diagram illustrates this architecture.
For our use case, we use a sample bank dataset to walk through the solution. The workflow consists of the following steps:
- Download the sample dataset and upload it to an S3 bucket.
- Set up an AWS Glue crawler to crawl the schema and store the metadata schema in the AWS Glue Data Catalog.
- Use Athena to access the Data Catalog to query data from the S3 bucket.
- Create a new Data Wrangler flow to connect to Athena.
- When creating the connection, set the retention TTL for the dataset.
- Use this connection in the workflow and store the clean data in another S3 bucket.
For simplicity, we assume that you have already set up the Athena environment (steps 1–3). We detail the subsequent steps in this post.
Prerequisites
To set up the Athena environment, refer to the User Guide for step-by-step instructions, and complete steps 1–3 as outlined in the previous section.
Import your data from Athena to Data Wrangler
To import your data, complete the following steps:
- On the Studio console, choose the Resources icon in the navigation pane.
- Choose Data Wrangler on the drop-down menu.
- Choose New flow.
- On the Import tab, choose Amazon Athena.
A detail page opens where you can connect to Athena and write a SQL query to import from the database. - Enter a name for your connection.
- Expand Advanced configuration.
When connecting to Athena, Data Wrangler uses Amazon S3 to stages the queried data. By default, this data is staged at the S3 locations3://sagemaker-{region}-{account_id}/athena/
with a retention period of 5 days. - For Amazon S3 location of query results, enter your S3 location.
- Select Data retention period and set the data retention period (for this post, 1 day).
If you deselect this option, the data will persist indefinitely.Behind the scenes, Data Wrangler attaches an S3 lifecycle configuration policy to that S3 location to automatically clean up. See the following example policy:
You need
s3:GetLifecycleConfiguration
ands3:PutLifecycleConfiguration
for your SageMaker execution role to correctly apply the lifecycle configuration policies. Without these permissions, you get error messages when you try to import the data.The following error message is an example of missing the
GetLifecycleConfiguration
permission.
The following error message is an example of missing the
PutLifecycleConfiguration
permission. - Optionally, for Workgroup, you can specify an Athena workgroup.
An Athena workgroup isolates users, teams, applications, or workloads into groups, each with its own permissions and configuration settings. When you specify a workgroup, Data Wrangler inherits the workgroup setting defined in Athena. For example, if a workgroup has an S3 location defined to store query results and enable Override client side settings, you can’t edit the S3 query result location.By default, Data Wrangler also saves the Athena connection for you. This is displayed as a new Athena tile in the Import tab. You can always reopen that connection to query and bring different data into Data Wrangler.
- Deselect Save connection if you don’t want to save the connection.
- To configure the Athena connection, choose None for Sampling to import the entire dataset.
For large datasets, Data Wrangler allows you to import a subset of your data to build out your transformation workflow, and only process the entire dataset when you’re ready. This speeds up the iteration cycle and save processing time and cost. To learn more about different data sampling options available, visit Amazon SageMaker Data Wrangler now supports random sampling and stratified sampling. - For Data catalog¸ choose AwsDataCatalog.
- For Database, choose your database.
Data Wrangler displays the available tables. You can choose each table to check the schema and preview the data.
- Enter the following code in the query field:
- Choose Run to preview the data.
- If everything looks good, choose Import.
- Enter a dataset name and choose Add to import the data into your Data Wrangler workspace.
Analyze and process data with Data Wrangler
After you load the data in to Data Wrangler, you can do exploratory data analysis (EDA) and prepare the data for machine learning.
- Choose the plus sign next to the
bank-data
dataset in the data flow, and choose Add analysis.
Data Wrangler provides built-in analyses, including a Data Quality and Insights Report, data correlation, a pre-training bias report, a summary of your dataset, and visualizations (such as histograms and scatter plots). Additionally, you can create your own custom visualization.
- For Analysis type¸ choose Data Quality and Insight Report.
This automatically generates visualizations, analyses to identify data quality issues, and recommendations for the right transformations required for your dataset. - For Target column, choose Y.
- Because this is a classification problem statement, for Problem type, select Classification.
- Choose Create.
Data Wrangler creates a detailed report on your dataset. You can also download the report to your local machine.
- For data preparation, choose the plus sign next to the bank-data dataset in the data flow, and choose Add transform.
- Choose Add step to start building your transformations.
At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.
You can now start building your transforms and analyses based on your business requirements.
Clean up
To avoid ongoing costs, delete the Data Wrangler resources using the steps below when you’re finished.
- Select Running Instances and Kernels icon.
- Under RUNNING APPS, click on the shutdown icon next to the
sagemaker-data-wrangler-1.0 app
. - Choose Shut down all to confirm.
Conclusion
In this post, we provided an overview of customizing your S3 location and enabling S3 lifecycle configurations for importing data from Athena to Data Wrangler. With this feature, you can store intermediary data in a secured S3 location, and automatically remove the data copy after the retention period to reduce the risk for unauthorized access to data. We encourage you to try out this new feature. Happy building!
To learn more about Athena and SageMaker, visit the Athena User Guide and Amazon SageMaker Documentation.
About the authors
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.
Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
Use RStudio on Amazon SageMaker to create regulatory submissions for the life sciences industry
Pharmaceutical companies seeking approval from regulatory agencies such as the US Food & Drug Administration (FDA) or Japanese Pharmaceuticals and Medical Devices Agency (PMDA) to sell their drugs on the market must submit evidence to prove that their drug is safe and effective for its intended use. A team of physicians, statisticians, chemists, pharmacologists, and other clinical scientists review the clinical trial submission data and proposed labeling. If the review establishes that the there is sufficient statistical evidence to prove that the health benefits of the drug outweigh the risks, the drug is approved for sale.
The clinical trial submission package consists of tabulated data, analysis data, trial metadata, and statistical reports consisting of statistical tables, listings, and figures. In the case of the US FDA, the electronic common technical document (eCTD) is the standard format for submitting applications, amendments, supplements, and reports to the FDA’s Center for Biologics Evaluation and Research (CBER) and Center for Drug Evaluation and Research (CDER). For the FDA and Japanese PMDA, it’s a regulatory requirement to submit tabulated data in CDISC Standard Data Tabulation Model (SDTM), analysis data in CDISC Analysis Dataset Model (ADaM), and trial metadata in CDISC Define-XML (based on Operational Data Model (ODM)).
In this post, we demonstrate how we can use RStudio on Amazon SageMaker to create such regulatory submission deliverables. This post describes the clinical trial submission process, how we can ingest clinical trial research data, tabulate and analyze the data, and then create statistical reports—summary tables, data listings, and figures (TLF). This method can enable pharmaceutical customers to seamlessly connect to clinical data stored in their AWS environment, process it using R, and help accelerate the clinical trial research process.
Drug development process
The drug development process can broadly be divided into five major steps, as illustrated in the following figure.
It takes on an average 10–15 years and approximately USD $1–3 billion for one drug to receive a successful approval out of around 10,000 potential molecules. During the early phases of research (the drug discovery phase), promising drug candidates are identified, which move further to preclinical research. During the preclinical phase, researchers try to find out the toxicity of the drug by performing in vitro experiments in the lab and in vivo experiments on animals. After preclinical testing, drugs move on the clinical trial research phase, where they must be tested on humans to ascertain their safety and efficacy. The researchers design clinical trials and detail the study plan in the clinical trial protocol. They define the different clinical research phases—from small Phase 1 studies to determine drug safety and dosage, to a bigger Phase 2 trials to determine drug efficacy and side effects, to even bigger Phase 3 and 4 trials to determine drug efficacy, safety, and monitoring adverse reactions. After successful human clinical trials, the drug sponsor files a New Drug Application (NDA) to market the drug. The regulatory agencies review all the data, work with the sponsor on prescription labeling information, and approve the drug. After the drug’s approval, the regulatory agencies review post-market safety reports to ensure the complete product’s safety.
In 1997, Clinical Data Interchange Standards Consortium (CDISC), a global, non-profit organization comprising of pharmaceutical companies, CROs, biotech, academic institutions, healthcare providers, and government agencies, was started as volunteer group. CDISC has published data standards to streamline the flow of data from collection through submissions, and facilitated data interchange between partners and providers. CDISC has published the following standards:
- CDASH (Clinical Data Acquisition Standards Harmonization) – Standards for collected data
- SDTM (Study Data Tabulation Model) – Standards for submitting tabulated data
- ADaM (Analysis Data Model) – Standards for analysis data
- SEND (Standard for Exchange of Nonclinical Data) – Standards for nonclinical data
- PRM (Protocol Representation Model) – Standards for protocol
These standards can help trained reviewers analyze data more effectively and quickly using standard tools, thereby reducing drug approval times. It’s a regulatory requirement from the US FDA and Japanese PMDA to submit all tabulated data using the SDTM format.
R for clinical trial research submissions
SAS and R are two of the most used statistical analysis software used within the pharmaceutical industry. When development of the SDTM standards was started by CDISC, SAS was in almost universal use in the pharmaceutical industry and at the FDA. However, R is gaining tremendous popularity nowadays because it’s open source, and new packages and libraries are continuously added. Students primarily use R during their academics and research, and they take this familiarity with R to their jobs. R also offers support for emerging technologies such as advanced deep learning integrations.
Cloud providers such as AWS have now become the platform of choice for pharmaceutical customers to host their infrastructure. AWS also provides managed services such as SageMaker, which makes it effortless to create, train, and deploy machine learning (ML) models in the cloud. SageMaker also allows access to the RStudio IDE from anywhere via a web browser. This post details how statistical programmers and biostatisticians can ingest their clinical data into the R environment, how R code can be run, and how results are stored. We provide snippets of code that allow clinical trial data scientists to ingest XPT files into the R environment, create R data frames for SDTM and ADaM, and finally create TLF that can be stored in an Amazon Simple Storage Service (Amazon S3) object storage bucket.
RStudio on SageMaker
On November 2, 2021, AWS in collaboration with RStudio PBC announced the general availability of RStudio on SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to SageMaker in just a few simple steps. To learn more about this exciting collaboration, check out Announcing RStudio on Amazon SageMaker.
Along with the RStudio Workbench, the RStudio suite for R developers also offers RStudio Connect and RStudio Package Manager. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. It makes it easy to share ML and data science insights from data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.
Solution overview
In the following sections, we discuss how we can import raw data from a remote repository or S3 bucket in RStudio on SageMaker. It’s also possible to connect directly to Amazon Relational Database Service (Amazon RDS) and data warehouses like Amazon Redshift (see Connecting R with Amazon Redshift) directly from RStudio; however, this is outside the scope of this post. After data has been ingested from a couple of different sources, we process it and create R data frames for a table. Then we convert the table data frame into an RTF file and store the results back in an S3 bucket. These outputs can then potentially be used for regulatory submission purposes, provided the R packages used in the post have been validated for use for regulatory submissions by the customer.
Set up RStudio on SageMaker
For instructions on setting up RStudio on SageMaker in your environment, refer to Get started with RStudio on SageMaker. Make sure that the execution role of RStudio on SageMaker has access to download and upload data to the S3 bucket in which data is stored. To learn more about how to manage R packages and publish your analysis using RStudio on SageMaker, refer to Announcing Fully Managed RStudio on SageMaker for Data Scientists.
Ingest data into RStudio
In this step, we ingest data from various sources to make it available for our R session. We import data in SAS XPT format; however, the process is similar if you want to ingest data in other formats. One of the advantages of using RStudio on SageMaker is that if the source data is stored in your AWS accounts, then SageMaker can natively access the data using AWS Identity and Access Management (IAM) roles.
Access data stored in a remote repository
In this step, we import ADaM data from the FDA’s GitHub repository. We create a local directory called data
in the RStudio environment to store the data and download demographics data (dm.xpt
) from the remote repository. In this context, the local directory refers to a directory created on the your private Amazon EFS storage that is attached by default to your R session environment. See the following code:
When this step is complete, you can see dm.xpt
being downloaded by navigating to Files, data, dm.xpt.
Access data stored in Amazon S3
In this step, we download data stored in an S3 bucket in our account. We have copied contents from the FDA’s GitHub repository to the S3 bucket named aws-sagemaker-rstudio
for this example. See the following code:
When the step is complete, you can see pp.xpt
being downloaded by navigating to Files, data, pp.xpt.
Process XPT data
Now that we have SAS XPT files available in the R environment, we need to convert them into R data frames and process them. We use the haven
library to read XPT files. We merge CDISC SDTM datasets dm
and pp
to create ADPP dataset. Then we create a summary statistic table using the ADPP data frame. The summary table is then exported in RTF format.
First, XPT files are read using the read_xpt
function of the haven library. Then an analysis dataset is created using the sqldf
function of the sqldf
library. See the following code:
Then, an output data frame is created using functions from the Tplyr
and dplyr
libraries:
The output data frame is then stored as an RTF file in the output folder in the RStudio environment:
Upload outputs to Amazon S3
After the output has been generated, we put the data back in an S3 bucket. We can achieve this by creating a SageMaker session again, if a session isn’t active already, and uploading the contents of the output folder to an S3 bucket using the session$upload_data
function:
With these steps, we have ingested data, processed it, and uploaded the results to be made available for submission to regulatory authorities.
Clean up
To avoid incurring any unintended costs, you need to quit your current session. On the top right corner of the page, choose the power icon. This will automatically stop the underlying instance and therefore stop incurring any unintended compute costs.
Challenges
The post has outlined steps for ingesting raw data stored in an S3 bucket or from a remote repository. However, there are many other sources of raw data for a clinical trial, primarily eCRF (electronic case report forms) data stored in EDC (electronic data capture) systems such as Oracle Clinical, Medidata Rave, OpenClinica, or Snowflake; lab data; data from eCOA (clinical outcome assessment) and ePRO (electronic Patient-Reported Outcomes); real-world data from apps and medical devices; and electronic health records (EHRs) at the hospitals. Significant preprocessing is involved before this data can be made usable for regulatory submissions. Building connectors to various data sources and collecting them in a centralized data repository (CDR) or a clinical data lake, while maintaining proper access controls, poses significant challenges.
Another key challenge to overcome is that of regulatory compliance. The computer system used for creating regulatory submission outputs must be compliant with appropriate regulations, such as 21 CFR Part 11, HIPAA, GDPR, or any other GxP requirements or ICH guidelines. This translates to working in a validated and qualified environment with controls for access, security, backup, and auditability in place. This also means that any R packages that are used to create regulatory submission outputs must be validated before use.
Conclusion
In this post, we saw that the some of the key deliverables for an eCTD submission were CDISC SDTM, ADaM datasets, and TLF. This post outlined the steps needed to create these regulatory submission deliverables by first ingesting data from a couple of sources into RStudio on SageMaker. We then saw how we can process the ingested data in XPT format; convert it into R data frames to create SDTM, ADaM, and TLF; and then finally upload the results to an S3 bucket.
We hope that with the broad ideas laid out in the post, statistical programmers and biostatisticians can easily visualize the end-to-end process of loading, processing, and analyzing clinical trial research data into RStudio on SageMaker and use the learnings to define a custom workflow suited for your regulatory submissions.
Can you think of any other applications of using RStudio to help researchers, statisticians, and R programmers to make their lives easier? We would love to hear about your ideas! And if you have any questions, please share them in the comments section.
Resources
For more information, visit the following links:
- The Drug Development Process
- Get started with RStudio on Amazon SageMaker
- Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists
- SageMaker SDK documentation
- Productionizing R workload with Amazon SageMaker
About the authors
Rohit Banga is a Global Clinical Development Industry Specialist based out of London, UK. He is a biostatistician by training and helps Healthcare and LifeScience customers deploy innovative clinical development solutions on AWS. He is passionate about how data science, AI/ML, and emerging technologies can be used to solve real business problems within the Healthcare and LifeScience industry. In his spare time, Rohit enjoys skiing, BBQing, and spending time with family and friends.
Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.
Churn prediction using Amazon SageMaker built-in tabular algorithms LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular
Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. These algorithms and models can be used for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.
Customer churn is a problem faced by a wide range of companies, from telecommunications to banking, where customers are typically lost to competitors. It’s in a company’s best interest to retain existing customers rather than acquire new customers because it usually costs significantly more to attract new customers. Mobile operators have historical records in which customers continued using the service or ultimately ended up churning. We can use this historical information of a mobile operator’s churn to train an ML model. After training this model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have it predict whether this customer is going to churn or not.
In this post, we train and deploy four recently released SageMaker algorithms—LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular—on a churn prediction dataset. We use SageMaker Automatic Model Tuning (a tool for hyperparameter optimization) to find the best hyperparameters for each model, and compare their performance on a holdout test dataset to select the optimal one.
You can also use this solution as a template to search over a collection of state-of-the-art tabular algorithms and use hyperparameter optimization to find the best overall model. You can easily replace the example dataset with your own to solve real business problems you’re interested in. If you want to jump straight into the SageMaker SDK code we go through in this post, you can refer to the following sample Jupyter notebook.
Benefits of SageMaker built-in algorithms
When selecting an algorithm for your particular type of problem and data, using a SageMaker built-in algorithm is the easiest option, because doing so comes with the following major benefits:
- Low coding – The built-in algorithms require little coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.
- Efficient and scalable algorithm implementations – The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms. If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker and input the hyperparameters you already know rather than port it over and write a training script yourself.
- Transparency – You’re the owner of the resulting model artifacts. You can take that model and deploy it on SageMaker for several different inference patterns (check out all the available deployment types) and easy endpoint scaling and management, or you can deploy it wherever else you need it.
Data visualization and preprocessing
First, we gather our customer churn dataset. It’s a relatively small dataset with 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes range from the US state where the customer resides, to the number of calls they placed to customer service, to the cost they are billed for daytime calls. We’re trying to predict whether the customer will churn or not, which is a binary classification problem. The following is a subset of those features look like, with the label as the last column.
The following are some insights for each column, specifically the summary statistics and histogram of selected features.
We then preprocess the data, split it into training, validation, and test sets, and upload the data to Amazon Simple Storage Service (Amazon S3).
Automatic model tuning of tabular algorithms
Hyperparameters control how our underlying algorithms operate and influence the performance of the model. Those hyperparameters can be the number of layers, learning rate, weight decay rate, and dropout for neural network-based models, or the number of leaves, iterations, and maximum tree depth for tree ensemble models. To select the best model, we apply SageMaker automatic model tuning to each of the four trained SageMaker tabular algorithms. You need only select the hyperparameters to tune and a range for each parameter to explore. For more information about automatic model tuning, refer to Amazon SageMaker Automatic Model Tuning: Using Machine Learning for Machine Learning or Amazon SageMaker automatic model tuning: Scalable gradient-free optimization.
Let’s see how this works in practice.
LightGBM
We start by running automatic model tuning with LightGBM, and adapt that process to the other algorithms. As is explained in the post Amazon SageMaker JumpStart models and algorithms now available via API, the following artifacts are required to train a pre-built algorithm via the SageMaker SDK:
- Its framework-specific container image, containing all the required dependencies for training and inference
- The training and inference scripts for the selected model or algorithm
We first retrieve these artifacts, which depend on the model_id
(lightgbm-classification-model
in this case) and version:
We then get the default hyperparameters for LightGBM, set some of them to selected fixed values such as number of boosting rounds and evaluation metric on the validation data, and define the value ranges we want to search over for others. We use the SageMaker parameters ContinuousParameter
and IntegerParameter
for this:
Finally, we create a SageMaker Estimator, feed it into a HyperarameterTuner, and start the hyperparameter tuning job with tuner.fit()
:
The max_jobs
parameter defines how many total jobs will be run in the automatic model tuning job, and max_parallel_jobs
defines how many concurrent training jobs should be started. We also define the objective to “Maximize”
the model’s AUC (area under the curve). To dive deeper into the available parameters exposed by HyperParameterTuner
, refer to HyperparameterTuner.
Check out the sample notebook to see how we proceed to deploy and evaluate this model on the test set.
CatBoost
The process for hyperparameter tuning on the CatBoost algorithm is the same as before, although we need to retrieve model artifacts under the ID catboost-classification-model
and change the range selection of hyperparameters:
TabTransformer
The process for hyperparameter tuning on the TabTransformer model is the same as before, although we need to retrieve model artifacts under the ID pytorch-tabtransformerclassification-model
and change the range selection of hyperparameters.
We also change the training instance_type
to ml.p3.2xlarge
. TabTransformer is a model recently derived from Amazon research, which brings the power of deep learning to tabular data using Transformer models. To train this model in an efficient manner, we need a GPU-backed instance. For more information, refer to Bringing the power of deep learning to data in tables.
AutoGluon-Tabular
In the case of AutoGluon, we don’t run hyperparameter tuning. This is by design, because AutoGluon focuses on ensembling multiple models with sane choices of hyperparameters and stacking them in multiple layers. This ends up being more performant than training one model with the perfect selection of hyperparameters and is also computationally cheaper. For details, check out AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.
Therefore, we switch the model_id
to autogluon-classification-ensemble
, and only fix the evaluation metric hyperparameter to our desired AUC score:
Instead of calling tuner.fit()
, we call estimator.fit()
to start a single training job.
Benchmarking the trained models
After we deploy all four models, we send the full test set to each endpoint for prediction and calculate accuracy, F1, and AUC metrics for each (see code in the sample notebook). We present the results in the following table, with an important disclaimer: results and relative performance between these models will depend on the dataset you use for training. These results are representative, and even though the tendency for certain algorithms to perform better is based on relevant factors (for example, AutoGluon intelligently ensembles the predictions of both LightGBM and CatBoost models behind the scenes), the balance in performance might change given a different data distribution.
. | LightGBM with Automatic Model Tuning | CatBoost with Automatic Model Tuning | TabTransformer with Automatic Model Tuning | AutoGluon-Tabular |
Accuracy | 0.8977 | 0.9622 | 0.9511 | 0.98 |
F1 | 0.8986 | 0.9624 | 0.9517 | 0.98 |
AUC | 0.9629 | 0.9907 | 0.989 | 0.9979 |
Conclusion
In this post, we trained four different SageMaker built-in algorithms to solve the customer churn prediction problem with low coding effort. We used SageMaker automatic model tuning to find the best hyperparameters to train these algorithms with, and compared their performance on a selected churn prediction dataset. You can use the related sample notebook as a template, replacing the dataset with your own to solve your desired tabular data-based problem.
Make sure to try these algorithms on SageMaker, and check out sample notebooks on how to use other built-in algorithms available on GitHub.
About the authors
Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.
João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize Deep Learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.
FindIt: Generalized Object Localization with Natural Language Queries
Natural language enables flexible descriptive queries about images. The interaction between text queries and images grounds linguistic meaning in the visual world, facilitating a better understanding of object relationships, human intentions towards objects, and interactions with the environment. The research community has studied object-level visual grounding through a range of tasks, including referring expression comprehension, text-based localization, and more broadly object detection, each of which require different skills in a model. For example, object detection seeks to find all objects from a predefined set of classes, which requires accurate localization and classification, while referring expression comprehension localizes an object from a referring text and often requires complex reasoning on prominent objects. At the intersection of the two is text-based localization, in which a simple category-based text query prompts the model to detect the objects of interest.
Due to their dissimilar task properties, referring expression comprehension, detection, and text-based localization are mostly studied through separate benchmarks with most models only dedicated to one task. As a result, existing models have not adequately synthesized information from the three tasks to achieve a more holistic visual and linguistic understanding. Referring expression comprehension models, for instance, are trained to predict one object per image, and often struggle to localize multiple objects, reject negative queries, or detect novel categories. In addition, detection models are unable to process text inputs, and text-based localization models often struggle to process complex queries that refer to one object instance, such as “Left half sandwich.” Lastly, none of the models can generalize sufficiently well beyond their training data and categories.
To address these limitations, we are presenting “FindIt: Generalized Localization with Natural Language Queries” at ECCV 2022. Here we propose a unified, general-purpose and multitask visual grounding model, called FindIt, that can flexibly answer different types of grounding and detection queries. Key to this architecture is a multi-level cross-modality fusion module that can perform complex reasoning for referring expression comprehension and simultaneously recognize small and challenging objects for text-based localization and detection. In addition, we discover that a standard object detector and detection losses are sufficient and surprisingly effective for all three tasks without the need for task-specific design and losses common in existing works. FindIt is simple, efficient, and outperforms alternative state-of-the-art models on the referring expression comprehension and text-based localization benchmarks, while being competitive on the detection benchmark.
![]() |
FindIt is a unified model for referring expression comprehension (col. 1), text-based localization (col. 2), and the object detection task (col. 3). FindIt can respond accurately when tested on object types/classes not known during training, e.g. “Find the desk” (col. 4). Compared to existing baselines (MattNet and GPV), FindIt can perform these tasks well and in a single model. |
Multi-level Image-Text Fusion
Different localization tasks are created with different semantic understanding objectives. For example, because the referring expression task primarily references prominent objects in the image rather than small, occluded or faraway objects, low resolution images generally suffice. In contrast, the detection task aims to detect objects with various sizes and occlusion levels in higher resolution images. Apart from these benchmarks, the general visual grounding problem is inherently multiscale, as natural queries can refer to objects of any size. This motivates the need for a multi-level image-text fusion model for efficient processing of higher resolution images over different localization tasks.
The premise of FindIt is to fuse the higher level semantic features using more expressive transformer layers, which can capture all-pair interactions between image and text. For the lower-level and higher-resolution features, we use a cheaper dot-product fusion to save computation and memory cost. We attach a detector head (e.g., Faster R-CNN) on top of the fused feature maps to predict the boxes and their classes.
![]() |
FindIt accepts an image and a query text as inputs, and processes them separately in image/text backbones before applying the multi-level fusion. We feed the fused features to Faster R-CNN to predict the boxes referred to by the text. The feature fusion uses more expressive transformers at higher levels and cheaper dot-product at the lower levels. |
Multitask Learning
Apart from the multi-level fusion described above, we adapt the text-based localization and detection tasks to take the same inputs as the referring expression comprehension task. For the text-based localization task, we generate a set of queries over the categories present in the image. For any present category, the text query takes the form “Find the [object
],” where [object
] is the category name. The objects corresponding to that category are labeled as foreground and the other objects as background. Instead of using the aforementioned prompt, we use a static prompt for the detection task, such as “Find all the objects.”. We found that the specific choice of prompts is not important for text-based localization and detection tasks.
After adaptation, all tasks in consideration share the same inputs and outputs — an image input, a text query, and a set of output bounding boxes and classes. We then combine the datasets and train on the mixture. Finally, we use the standard object detection losses for all tasks, which we found to be surprisingly simple and effective.
Evaluation
We apply FindIt to the popular RefCOCO benchmark for referring expression comprehension tasks. When only the COCO and RefCOCO dataset is available, FindIt outperforms the state-of-the-art-model on all tasks. In the settings where external datasets are allowed, FindIt sets a new state of the art by using COCO and all RefCOCO splits together (no other datasets). On the challenging Google and UMD splits, FindIt outperforms the state of the art by a 10% margin, which, taken together, demonstrate the benefits of multitask learning.
![]() |
Comparison with the state of the art on the popular referring expression benchmark. FindIt is superior on both the COCO and unconstrained settings (additional training data allowed). |
On the text-based localization benchmark, FindIt achieves 79.7%, higher than the GPV (73.0%), and Faster R-CNN baselines (75.2%). Please refer to the paper for more quantitative evaluation.
We further observe that FindIt generalizes better to novel categories and super-categories in the text-based localization task compared to competitive single-task baselines on the popular COCO and Objects365 datasets, shown in the figure below.
Efficiency
We also benchmark the inference times on the referring expression comprehension task (see Table below). FindIt is efficient and comparable with existing one-stage approaches while achieving higher accuracy. For fair comparison, all running times are measured on one GTX 1080Ti GPU.
Model | Image Size | Backbone | Runtime (ms) | |||
MattNet | 1000 | R101 | 378 | |||
FAOA | 256 | DarkNet53 | 39 | |||
MCN | 416 | DarkNet53 | 56 | |||
TransVG | 640 | R50 | 62 | |||
FindIt (Ours) | 640 | R50 | 107 | |||
FindIt (Ours) | 384 | R50 | 57 |
Conclusion
We present Findit, which unifies referring expression comprehension, text-based localization, and object detection tasks. We propose multi-scale cross-attention to unify the diverse localization requirements of these tasks. Without any task-specific design, FindIt surpasses the state of the art on referring expression and text-based localization, shows competitive performance on detection, and generalizes better to out-of-distribution data and novel classes. All of these are accomplished in a single, unified, and efficient model.
Acknowledgements
This work is conducted by Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. We would like to thank Ashish Vaswani, Prajit Ramachandran, Niki Parmar, David Luan, Tsung-Yi Lin, and other colleagues at Google Research for their advice and helpful discussions. We would like to thank Tom Small for preparing the animation.
No Hang Ups With Hangul: KT Trains Smart Speakers, Customer Call Centers With NVIDIA AI
South Korea’s most popular AI voice assistant, GiGA Genie, converses with 8 million people each day.
The AI-powered speaker from telecom company KT can control TVs, offer real-time traffic updates and complete a slew of other home-assistance tasks based on voice commands. It has mastered its conversational skills in the highly complex Korean language thanks to large language models (LLMs) — machine learning algorithms that can recognize, understand, predict and generate human languages based on huge text datasets.
The company’s models are built using the NVIDIA DGX SuperPOD data center infrastructure platform and the NeMo Megatron framework for training and deploying LLMs with billions of parameters.
The Korean language, known as Hangul, reliably shows up in lists of the world’s most challenging languages. It includes four types of compound verbs, and words are often composed of two or more roots.
KT — South Korea’s leading mobile operator with over 22 million subscribers — improved the smart speaker’s understanding of such words by developing LLMs with around 40 billion parameters. And through integration with Amazon Alexa, GiGA Genie can converse with users in English, too.
“With transformer-based models, we’ve achieved significant quality improvements for the GiGA Genie smart speaker, as well as our customer services platform AI Contact Center, or AICC,” said Hwijung Ryu, LLM development team lead at KT.
AICC is an all-in-one, cloud-based platform that offers AI voice agents and other customer service-related applications.
It can receive calls and provide requested information — or quickly connect customers to human agents for answers to more detailed inquiries. AICC without human intervention manages more than 100,000 calls daily across Korea, according to Ryu.
“LLMs enable GiGA Genie to gain better language understanding and generate more human-like sentences, and AICC to reduce consultation times by 15 seconds as it summarizes and classifies inquiry types more quickly,” he added.
Training Large Language Models
Developing LLMs can be an expensive, time-consuming process that requires deep technical expertise and full-stack technology investments.
The NVIDIA AI platform simplified and sped up this process for KT.
“We trained our LLM models more effectively with NVIDIA DGX SuperPOD’s powerful performance — as well as NeMo Megatron’s optimized algorithms and 3D parallelism techniques,” Ryu said. “NeMo Megatron is continuously adopting new features, which is the biggest advantage we think it offers in improving our model accuracy.”
3D parallelism — a distributed training method in which an extremely large-scale deep learning model is partitioned across multiple devices — was crucial for training KT’s LLMs. NeMo Megatron enabled the team to easily accomplish this task with the highest throughput, according to Ryu.
“We considered using other platforms, but it was difficult to find an alternative that provides full-stack environments — from the hardware level to the inference level,” he added. “NVIDIA also provides exceptional expertise from product, engineering teams and more, so we easily solved several technical issues.”
Using hyperparameter optimization tools in NeMo Megatron, KT trained its LLMs 2x faster than with other frameworks, Ryu said. These tools allow users to automatically find the best configurations for LLM training and inference, easing and speeding the development and deployment process.
KT is also planning to use the NVIDIA Triton Inference Server to provide an optimized real-time inference service, as well as NVIDIA Base Command Manager to easily monitor and manage hundreds of nodes in its AI cluster.
“Thanks to LLMs, KT can release competitive products faster than ever,” Ryu said. “We also believe that our technology can drive innovation from other companies, as it can be used to improve their value and create innovative products.”
KT plans to release more than 20 natural language understanding and natural language generation APIs for developers in November. The application programming interfaces can be used for tasks including document summarization and classification, emotion recognition, and filtering of potentially inappropriate content.
Learn more about breakthrough technologies for the era of AI and the metaverse at NVIDIA GTC, running online through Thursday, Sept. 22.
Watch NVIDIA founder and CEO Jensen Huang’s keynote address in replay below:
The post No Hang Ups With Hangul: KT Trains Smart Speakers, Customer Call Centers With NVIDIA AI appeared first on NVIDIA Blog.
New NVIDIA DGX System Software and Infrastructure Solutions Supercharge Enterprise AI
At GTC today, NVIDIA unveiled a number of updates to its DGX portfolio to power new breakthroughs in enterprise AI development.
NVIDIA DGX H100 systems are now available for order. These infrastructure building blocks support NVIDIA’s full-stack enterprise AI solutions.
With 32 petaflops of performance at FP8 precision, NVIDIA DGX H100 delivers a leap in efficiency for enterprise AI development. It offers 3x lower total cost of ownership and 3.5x more energy efficiency compared to the previous generation.
New NVIDIA Base Command software, which simplifies and speeds AI development, powers every DGX system — from single nodes to DGX SuperPODs.
Also unveiled was NVIDIA DGX BasePOD — the evolution of DGX POD — which makes enterprise data-center AI deployments simpler and faster for IT teams to acquire, deploy and manage.
Many of the world’s AI leaders are building technological breakthroughs — from self-driving cars to voice assistants — using NVIDIA DGX systems and software, and the pace of innovation is not slowing down.
New NVIDIA Base Command Features
NVIDIA Base Command provides enterprise-grade orchestration and cluster management, and it now features a full software stack for maximizing AI developer productivity, IT manageability and workload performance.
The workflow management features of Base Command now include support for on-premises DGX SuperPOD environments, enabling businesses to gain centralized control of AI development projects with simplified collaboration for project teams, and integrated monitoring and reporting dashboards.
Base Command works with the NVIDIA AI Enterprise software suite, which is now included with every DGX system. The NVIDIA AI software enables end-to-end AI development and deployment with supported AI and data science tools, optimized frameworks and pretrained models.
Additionally, it offers enterprise-workflow management and MLOps integrations with DGX-Ready Software providers Domino Data Lab, Run.ai, Weights & Biases and NVIDIA Inception member Rescale. It also includes libraries that optimize and accelerate compute, storage and network infrastructure — while ensuring maximized system uptime, security and reliability.
New DGX BasePOD Reference Architecture
DGX BasePOD provides a reference architecture for DGX systems that incorporates design best practices for integrating compute, networking, storage and software.
Customers are already using NVIDIA DGX POD to power the development of a broad range of enterprise applications. DGX BasePOD builds on the success of DGX POD with new industry solutions targeting the biggest AI opportunities, including natural language processing, healthcare and life sciences, and fraud detection.
Delivered as fully integrated, ready-to-deploy offerings through the NVIDIA Partner Network, DGX BasePOD solutions range in size, from two to hundreds of DGX systems, with certified high-performance storage from NVIDIA DGX storage technology partners including DDN, Dell, NetApp, Pure Storage, VAST Data and WEKA.
Leaders Power AI Breakthroughs With DGX Systems
Enterprises around the world choose NVIDIA DGX systems to power their most advanced AI workloads. Among the AI innovators developing mission-critical AI capabilities on DGX A100 systems:
- ML research and product lab Adept is building an AI teammate powered by a large language model prototyped on NVIDIA DGX Foundry, and then scaled with NVIDIA A100 GPUs and NVIDIA Megatron on Oracle Cloud Infrastructure.
- Hyundai Motor Group is using a 40-node DGX SuperPOD to explore hyperscale AI workloads.
- Telecom company KT is developing a LLM with around 40 billion parameters for a variety of Korean-language applications, including the GiGA Genie smart speaker, using the NVIDIA NeMo Megatron framework, NVIDIA DGX SuperPOD and NVIDIA Base Command software.
- The University of Wisconsin-Madison is quickly bringing AI to medical imaging devices using NVIDIA DGX systems with the Flywheel research platform and the NVIDIA Clara healthcare application framework. Using the NVIDIA Federated Learning Application Runtime Environment, or NVIDIA FLARE, in collaboration with other hospitals, the university is securely training AI models on DGX systems for medical imaging, annotation and classification.
Learn more about the AI breakthroughs powered by NVIDIA DGX systems by watching NVIDIA founder and CEO Jensen Huang’s GTC keynote in replay. And join the GTC session, “Designing Your AI Center of Excellence,” with Charlie Boyle, vice president of DGX systems at NVIDIA.
The post New NVIDIA DGX System Software and Infrastructure Solutions Supercharge Enterprise AI appeared first on NVIDIA Blog.