Tune ML models for additional objectives like fairness with SageMaker Automatic Model Tuning

Tune ML models for additional objectives like fairness with SageMaker Automatic Model Tuning

Model tuning is the experimental process of finding the optimal parameters and configurations for a machine learning (ML) model that result in the best possible desired outcome with a validation dataset. Single objective optimization with a performance metric is the most common approach for tuning ML models. However, in addition to predictive performance, there may be multiple objectives which need to be considered for certain applications. For example,

  1. Fairness – The aim here is to encourage models to mitigate bias in model outcomes between certain sub-groups in the data, especially when humans are subject to algorithmic decisions. For example, a credit lending application should not only be accurate but also unbiased to different population sub-groups.
  2. Inference time – The aim here is to reduce the inference time during model invocation. For example, a speech recognition system must not only understand different dialects of the same language accurately, but also operate within a specified time limit that is acceptable by the business process.
  3. Energy efficiency – The aim here is to train smaller energy-efficient models. For example, neural network models are compressed for usage on mobile devices and thus naturally reduce their energy consumption by reducing the number of FLOPS required for a pass through the network.

Multi-objective optimization methods represent different trade-offs between the desired metrics. This can involve finding a global minimum of an objective function subject to a set of constraints on different metrics being simultaneously satisfied.

Amazon SageMaker Automatic Model Tuning (AMT) finds the best version of a model by running many SageMaker training jobs on your dataset using the algorithm and ranges of hyperparameters. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric (e.g., accuracy, auc, recall) that you define. With Amazon SageMaker automatic model tuning, you can find the best version of your model by running training jobs on your dataset with several search strategies, such as Bayesian, Random search, Grid search, and Hyperband.

Amazon SageMaker Clarify can detect potential bias during data preparation, after model training, and in your deployed model. Currently, it offers 21 different metrics to choose from. These metrics are also openly available with the smclarify python package and github repository here. You can learn more about measuring bias with metrics from Amazon SageMaker Clarify at Learn how Amazon SageMaker Clarify helps detect bias.

In this blog we show you how to automatically tune a ML model with Amazon SageMaker AMT for both accuracy and fairness objectives by creating a single combined metric. We demonstrate a financial services use case of credit risk prediction with an accuracy metric of Area Under the Curve (AUC) to measure performance and a bias metric of Difference in Positive Proportions in Predicted Labels (DPPL) from SageMaker Clarify to measure the imbalance in model predictions for different demographic groups. The code for this example is available on GitHub.

Fairness in Credit Risk prediction

The credit lending industry relies heavily on credit scores for processing loan applications. Generally, credit scores reflect an applicant’s history of borrowing and paying back money, and lenders refer to them when determining an individual’s creditworthiness. Payment firms and banks are interested to build systems that can help identify the risk associated with a particular application and provide competitive credit products. Machine learning (ML) models can be used to build such a system that processes historical applicant data and predicts the credit risk profile. Data can include financial and employment history of the applicant, their demographics, and the new credit/loan context. There is always some statistical uncertainty with any model that predicts whether a particular applicant will default in the future. The systems need to provide a tradeoff between rejecting applications that might default over time and accepting applications that are eventually creditworthy.

Business owners of such a system need to ensure the validity and quality of the models as per existing and upcoming regulatory compliance requirements. They are obligated to treat customers fairly and provide transparency in their decision making. They might want to ensure that positive model predictions are not imbalanced across different groups (for example, gender, race, ethnicity, immigration status and others). Once the required data is collected, the ML model training typically optimizes for prediction performance as a primary objective with a metric like classification accuracy or AUC score. Alternatively, a model with a given performance objective can be constrained with a fairness metric to ensure certain requirements are maintained. One such technique to constrain the model is fairness-aware hyperparameter tuning. By applying these strategies, the best candidate model can have lower bias than the unconstrained model while maintaining a high predictive performance.

Scenario - Credit Risk

In the scenario depicted in this schematic,

  1. The ML model is built with historical customer credit profile data. The model training and hyperparameter tuning process maximizes for multiple objectives including classification accuracy and fairness. The model is deployed to an existing business process in a production system.
  2. A new customer credit profile is evaluated for credit risk. If low risk, it can go through an automated process. High risk applications could include human review before a final acceptance or rejection decision.

The decisions and metrics gathered during design and development, deployment and operations can be documented with SageMaker Model Cards and shared with the stakeholders.

This use case demonstrates how to reduce model bias against a specific group by fine tuning hyperparameters for a combined objective metric of both accuracy and fairness with SageMaker Automatic Model Tuning. We use the South German Credit dataset (South German Credit Data Set) .

The applicant data can be split into following categories:

  1. Demographic
  2. Financial Data
  3. Employment History
  4. Loan purpose

Credit Risk Data Structure

In this example, we specifically look at the ‘Foreign worker’ demographic and tune a model that predicts credit application decisions with high accuracy and low bias against that particular subgroup.

There are various bias metrics that can be used to evaluate fairness of the system with respect to specific sub-groups in the data. Here, we use the absolute value of Difference in Positive Proportions in Predicted Labels (DPPL) from SageMaker Clarify. In simple terms, DPPL measures the difference in positive class (good credit) assignments between non-foreign workers and foreign workers.

For example, if 4.5% of all foreign workers are assigned the positive label by the model, and 13.7% of all non-foreign workers are assigned the positive label, then DPPL = 0.137 – 0.045 = 0.092.

Solution Architecture

The figure below displays a high level overview of the architecture of an Automatic Model Tuning job with XGBoost on Amazon SageMaker.

High Level Solution Architecture

In the solution, SageMaker Processing preprocesses the training dataset from Amazon S3. Amazon SageMaker Automatic Tuning instantiates multiple SageMaker training jobs with their associated EC2 instances and EBS volumes. The container for the algorithm (XGBoost) is loaded from Amazon ECR in each job. SageMaker AMT finds the best version of a model by running many training jobs on the preprocessed dataset using the specified algorithm script and range of hyperparameters. The output metrics are logged in Amazon CloudWatch for monitoring.

The hyperparameters we are tuning in this use case are as follows:

  1. eta – Step size shrinkage used in updates to prevent overfitting.
  2. min_child_weight – Minimum sum of instance weight (hessian) needed in a child.
  3. gamma – Minimum loss reduction required to make a further partition on a leaf node of the tree.
  4. max_depth – Maximum depth of a tree.

The definition of these hyperparameters and others available with SageMaker AMT can be found here.

First, we demonstrate a baseline scenario of a single performance objective metric for tuning hyperparameters with Automatic Model Tuning. Then, we demonstrate the optimized scenario of a multi-objective metric specified as a combination of performance metric and fairness metric.

Single Metric Hyperparameter Tuning (Baseline)

There is a choice of multiple metrics for a tuning job to evaluate the individual training jobs. As per the code snippet below, we specify the single objective metric as  objective_metric_name. The hyperparameter tuning job returns the training job that gave the best value for the chosen objective metric.

In this baseline scenario, we are tuning for Area Under Curve (AUC) as seen below. It is important to note that we are only optimizing AUC, and not optimizing for other metrics such as fairness.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
'min_child_weight': IntegerParameter(1, 10),
'gamma': IntegerParameter(1, 5),
'max_depth': IntegerParameter(1, 10)}

objective_metric_name = 'validation:auc'

tuner = HyperparameterTuner(estimator,

tuning_job_name = "xgb-tuner-{}".format(strftime("%d-%H-%M-%S", gmtime()))
inputs = {'train': train_data_path, 'validation': val_data_path}
tuner.fit(inputs, job_name=tuning_job_name)
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

In this context max jobs allows us to specify how many times a single training job will be tuned, and finding the best training job from there.

Multi Objective Hyperparameter Tuning (Fairness Optimized)

We want to optimize multiple objective metrics with hyperparameter tuning as described in this paper. However, SageMaker AMT still accepts only a single metric as input.

To address this challenge, we express multiple metrics as a single metric function and optimize this metric:

  • maxM(y1​,y2​,θ)
  • y1​,y2​ are different metrics. For example AUC score and DPPL.
  • M(⋅,⋅,θ)is a scalarization function and is parameterized by a fixed parameter

Higher weight favors that particular objective in model tuning. Weights may wary from case to case and you might need to try different weights for your use case. In this example, weights for AUC and DPPL have been set heuristically. Let’s walk through how this would look like in code. You can see the training job returning a single metric based on a combination function of AUC Score for performance and DPPL for fairness. The hyperparameter optimization ranges for multiple objectives are the same as the single objective. We are passing the validation metric as “auc” but behind the scenes we are returning the results of the combined metric function as described last in the list of functions below:

Here is the function Multi Objective optimization:

objective_metric_name = 'validation:auc'
tuner = HyperparameterTuner(estimator,

Here is the function for computing AUC score:

def eval_auc_score(predt, dtrain):
   fY = [1 if p > 0.5 else 0 for p in predt]
   y = dtrain.get_label()
   auc_score = roc_auc_score(y, fY)
   return auc_score

Here is the function for computing DPPL score:

def eval_dppl(predt, dtrain):
    dtrain_np = dmatrix_to_numpy(dtrain)
    # groups: an np array containing 1 or 2
    groups = dtrain_np[:, -1]
    # sensitive_facet_index: boolean column indicating sensitive group
    sensitive_facet_index = pd.Series(groups - 1, dtype=bool)
    # positive_predicted_label_index: boolean column indicating positive predicted labels
    positive_label_index = pd.Series(predt > 0.5)
    return abs(DPPL(predt, sensitive_facet_index, positive_label_index))

Here is the function for the Combined Metric:

def eval_combined_metric(predt, dtrain):
    auc_score = eval_auc_score(predt, dtrain)
    DPPL = eval_dppl(predt, dtrain)
    # Assign weight of 3 to AUC and 1 to DPPL
    # Maximize (1-DPPL) for the purpose of minimizing DPPL 
    combined_metric = ((3*auc_score)+(1-DPPL))/4       
    print("DPPL, AUC Score, Combined Metric: ", DPPL, auc_score, combined_metric)
    return "auc", combined_metric

Experiments & Results

Synthetic data generation for bias dataset

The original South German Credit dataset contained 1000 records, and we generated 100 more records synthetically to create a dataset where the bias in model predictions disfavors Foreign Workers. This is done to simulate bias that could manifest itself in the real world. New records of foreign workers labeled as “bad credit” applicants were extrapolated from existing foreign workers with the same label.

There are many libraries/techniques to create synthetic data and we use Synthetic Data Vault (DPPLV).

From the following code snippet we can see how synthetic data is generated with DPPLV with the South German Credit Data Set:

# Parameters for generated data
# How many rows of synthetic data
num_rows = 100

# Select all foreign workers who were accepted (foreign_worker value 1 credit_risk 1)
ForeignWorkerData = training_data.loc[(training_data['foreign_worker'] == 1) & (training_data['credit_risk'] == 1)]

# Fit Foreign Worker data to SDV model
model = GaussianCopula()

# Generate Synthetic foreign worker data based on rows stated
SynthForeignWorkers = model.sample(Rows)

We generated 100 new synthetic records of Foreign Workers based on Foreign Workers who were accepted in the original dataset. We will now take those records and convert the “credit_risk” label to 0 (bad credit). This will mark these Foreign Workers unfairly as bad credit, hence inserting bias into our dataset

SynthForeignWorkers.loc[SynthForeignWorkers['credit_risk'] == 1, 'credit_risk'] = 0

We explore the bias in the dataset through the graphs below.
Credit Risk Ratio for Foreign Workers

The pie graph on top shows the percentage of Non-Foreign Workers labelled as good credit or bad credit, and the bottom pie graph shows the same for Foreign Workers. The percentage of Foreign workers labeled as  “bad credit” is 75.90% and far outweigh the 30.70% of Non-Foreign workers labeled the same. The stack bar displays the almost similar percentage breakdown of total workers across the category of Foreign & Non-Foreign workers.

We want to avoid the ML model from learning a strong bias against Foreign Workers either through explicit features or implicit proxy features in the data. With an additional fairness objective, we guide the ML model to mitigate the bias of lower creditworthiness towards Foreign Workers.

Model performance after tuning for both performance and fairness

This chart depicts the density plot of up to 100 tuning jobs run by SageMaker AMT and their corresponding combined objective metric values. Although we have set max jobs to 100, it is changeable under the discretion of the user. The combined metric was a combination of AUC and DPPL with a function of: (3*AUC + (1-DPPL)) / 4. The reason that we use (1-DPPL) instead of (DPPL) is because we would like to maximize the combined objective for the lowest DPPL possible (lower DPPL means lower bias against foreign workers). The plot shows how AMT helps identify the best hyperparameters for the XGBoost model that returns the highest combined evaluation metric value of 0.68.

Model performance with combined metric
Model Performance with Combined Metrics

Below we take a look at the pareto front chart for the individual metrics of AUC and DPPL. A Pareto Front chart is used here to visually represent the trade-offs between multiple objectives, in this case the two metric values (AUC & DPPL). Points on the curve front are considered as equally good and one metric cannot be improved without degrading the other. The Pareto chart allows us to see how different jobs performed against the baseline (red circle) in terms of both metrics. It also shows us the most optimal job (green triangle). The position of the red circle and green triangle are important because it allows us to understand if our combined metric is actually performing as expected and truly optimizing for both metrics. The code to generate the pareto front chart is included in the notebook in GitHub.

Pareto Chart

In this scenario, a lower DPPL value is more desirable (less bias), while higher AUC is better (increased performance).

Here, the baseline (red circle) represents the scenario where the objective metric is AUC alone. In other words, the baseline does not consider DPPL at all and optimizes only for AUC (no fine tuning for fairness). We see the baseline has a good AUC score of 0.74, but does not perform well on fairness with a DPPL score of 0.75.

The Optimized model (green triangle) represents the best candidate model when fine-tuned for a combined metric with weight ratio of 3:1 for AUC:DPPL. We see the optimized model has a good AUC score of 0.72, and also a low DPPL score of 0.43 (low bias). This tuning job found a model configuration where DPPL can be significantly lower than the baseline, without a significant drop in AUC.  Models with even lower DPPL scores can be identified by moving the green triangle further left along the Pareto Front. We thus achieved the combined objective of a well performing model with fairness for Foreign Worker sub-groups.

In the chart below, we can see the results of the predictions from the baseline model and the optimized model. The optimized model with a combined objective of performance and fairness predicts a positive outcome for 30.6% Foreign Workers as opposed to the 13.9% from the baseline model. The optimized model thus reduces the model bias against this sub-group.

Percentage of Accepted Workers


The blog shows you to implement multi-objective optimization with SageMaker Automatic Model Tuning for real-world applications. In many instances, collected data in the real world may be biased against certain subgroups. Multi-objective optimization using automatic model tuning enables customers to build ML models easily that optimize fairness in addition to accuracy. We demonstrate an example of credit risk prediction and specifically look at fairness for foreign workers. We show that it is possible to maximize for another metric like fairness while continuing to train models with high performance. If what you have read has piqued your interest you may try out the code example hosted in Github here.

About the authors

Munish Dabra is a Senior Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML, Data Analytics and Observability. He has a strong background in designing and building scalable distributed systems. He enjoys helping customers innovate and transform their business in AWS. LinkedIn: /mdabra

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Mohammad (Moh) Tahsin is an associate AI/ML Specialist Solutions Architect for AWS. Moh has experience teaching students about responsible AI concepts, and is passionate about conveying these concepts through cloud based architectures. In his spare time he loves to lift weights, play games, and explore nature.

Xingchen Ma is an Applied Scientist at AWS. He works in service team for SageMaker Automatic Model Tuning.

Rahul Sureka is an Enterprise Solution Architect at AWS based out of India. Rahul has more than 22 years of experience in architecting and leading large business transformation programs across multiple industry segments. His areas of interests are data and analytics, streaming, and AI/ML applications.

Read More

More Speaking or More Speakers?

Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of number of speakers in the training data on a recent SSL algorithm (wav2vec 2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on both labeled and unlabeled data by varying the number of speakers while…Apple Machine Learning Research

Improvements to Embedding-Matching Acoustic-to-Word ASR Using Multiple-Hypothesis Pronunciation-Based Embeddings

In embedding-matching acoustic-to-word (A2W) ASR, every word in the vocabulary is represented by a fixed-dimension embedding vector that can be added or removed independently of the rest of the system. The approach is potentially an elegant solution for the dynamic out-of-vocabulary (OOV) words problem, where speaker- and context-dependent named entities like contact names must be incorporated into the ASR on-the-fly for every speech utterance at testing time. Challenges still remain, however, in improving the overall accuracy of embedding-matching A2W. In this paper, we contribute two methods…Apple Machine Learning Research

HEiMDaL: Highly Efficient Method for Detection and Localization of wake-words

Streaming keyword spotting is a widely used solution for activating voice assistants. Deep Neural Networks with Hidden Markov Model (DNN-HMM) based methods have proven to be efficient and widely adopted in this space, primarily because of the ability to detect and identify the start and end of the wake-up word at low compute cost. However, such hybrid systems suffer from loss metric mismatch when the DNN and HMM are trained independently. Sequence discriminative training cannot fully mitigate the loss-metric mismatch due to the inherent Markovian style of the operation. We propose an low…Apple Machine Learning Research

Responsible AI: The research collaboration behind new open-source tools offered by Microsoft

Responsible AI: The research collaboration behind new open-source tools offered by Microsoft

Flowchart showing how responsible AI tools are used together for targeted debugging of machine learning models: the Responsible AI Dashboard for the identification of failures; followed by the Responsible AI Dashboard and Mitigations Library for the diagnosis of failures; then the Responsible AI Mitigations Library for mitigating failures; and lastly the Responsible AI Tracker for tracking, comparing, and validating mitigation techniques from which an arrow points back to the identification phase of the cycle  to indicate the repetition of the process as models and data continue to evolve during the ML lifecycle.

As computing and AI advancements spanning decades are enabling incredible opportunities for people and society, they’re also raising questions about responsible development and deployment. For example, the machine learning models powering AI systems may not perform the same for everyone or every condition, potentially leading to harms related to safety, reliability, and fairness. Single metrics often used to represent model capability, such as overall accuracy, do little to demonstrate under which circumstances or for whom failure is more likely; meanwhile, common approaches to addressing failures, like adding more data and compute or increasing model size, don’t get to the root of the problem. Plus, these blanket trial-and-error approaches can be resource intensive and financially costly.

Through its Responsible AI Toolbox, a collection of tools and functionalities designed to help practitioners maximize the benefits of AI systems while mitigating harms, and other efforts for responsible AI, Microsoft offers an alternative: a principled approach to AI development centered around targeted model improvement. Improving models through targeting methods aims to identify solutions tailored to the causes of specific failures. This is a critical part of a model improvement life cycle that not only includes the identification, diagnosis, and mitigation of failures but also the tracking, comparison, and validation of mitigation options. The approach supports practitioners in better addressing failures without introducing new ones or eroding other aspects of model performance.

“With targeted model improvement, we’re trying to encourage a more systematic process for improving machine learning in research and practice,” says Besmira Nushi, a Microsoft Principal Researcher involved with the development of tools for supporting responsible AI. She is a member of the research team behind the toolbox’s newest additions: the Responsible AI Mitigations Library, which enables practitioners to more easily experiment with different techniques for addressing failures, and the Responsible AI Tracker, which uses visualizations to show the effectiveness of the different techniques for more informed decision-making.

Targeted model improvement: From identification to validation

The tools in the Responsible AI Toolbox, available in open source and through the Azure Machine Learning platform offered by Microsoft, have been designed with each stage of the model improvement life cycle in mind, informing targeted model improvement through error analysis, fairness assessment, data exploration, and interpretability.

For example, the new mitigations library bolsters mitigation by offering a means of managing failures that occur in data preprocessing, such as those caused by a lack of data or lower-quality data for a particular subset. For tracking, comparison, and validation, the new tracker brings model, code, visualizations, and other development components together for easy-to-follow documentation of mitigation efforts. The tracker’s main feature is disaggregated model evaluation and comparison, which breaks down model performance by data subset to present a clearer picture of a mitigation’s effects on the intended subset, as well as other subsets, helping to uncover hidden performance declines before models are deployed and used by individuals and organizations. Additionally, the tracker allows practitioners to look at performance for subsets of data across iterations of a model to help practitioners determine the most appropriate model for deployment.

photo of Besmira Nushi smiling for the camera

“Data scientists could build many of the functionalities that we offer with these tools; they could build their own infrastructure,” says Nushi. “But to do that for every project requires a lot of effort and time. The benefit of these tools is scale. Here, they can accelerate their work with tools that apply to multiple scenarios, freeing them up to focus on the work of building more reliable, trustworthy models.”

Besmira Nushi, Microsoft Principal Researcher

Building tools for responsible AI that are intuitive, effective, and valuable can help practitioners consider potential harms and their mitigation from the beginning when developing a new model. The result can be more confidence that the work they’re doing is supporting AI that is safer, fairer, and more reliable because it was designed that way, says Nushi. The benefits of using these tools can be far-reaching—from contributing to AI systems that more fairly assess candidates for loans by having comparable accuracy across demographic groups to traffic sign detectors in self-driving cars that can perform better across conditions like sun, snow, and rain.

Converting research into tools for responsible AI

Creating tools that can have the impact researchers like Nushi envision often begins with a research question and involves converting the resulting work into something people and teams can readily and confidently incorporate in their workflows.

“Making that jump from a research paper’s code on GitHub to something that is usable involves a lot more process in terms of understanding what is the interaction that the data scientist would need, what would make them more productive,” says Nushi. “In research, we come up with many ideas. Some of them are too fancy, so fancy that they cannot be used in the real world because they cannot be operationalized.”

Multidisciplinary research teams consisting of user experience researchers, designers, and machine learning and front-end engineers have helped ground the process as have the contributions of those who specialize in all things responsible AI. Microsoft Research works closely with the incubation team of Aether, the advisory body for Microsoft leadership on AI ethics and effects, to create tools based on the research. Equally important has been partnership with product teams whose mission is to operationalize AI responsibly, says Nushi. For Microsoft Research, that is often Azure Machine Learning, the Microsoft platform for end-to-end ML model development. Through this relationship, Azure Machine Learning can offer what Microsoft Principal PM Manager Mehrnoosh Sameki refers to as customer “signals,” essentially a reliable stream of practitioner wants and needs directly from practitioners on the ground. And, Azure Machine Learning is just as excited to leverage what Microsoft Research and Aether have to offer: cutting-edge science. The relationship has been fruitful.

As the current Azure Machine Learning platform made its debut five years ago, it was clear tooling for responsible AI was going to be necessary. In addition to aligning with the Microsoft vision for AI development, customers were seeking out such resources. They approached the Azure Machine Learning team with requests for explainability and interpretability features, robust model validation methods, and fairness assessment tools, recounts Sameki, who leads the Azure Machine Learning team in charge of tooling for responsible AI. Microsoft Research, Aether, and Azure Machine Learning teamed up to integrate tools for responsible AI into the platform, including InterpretML for understanding model behavior, Error Analysis for identifying data subsets for which failures are more likely, and Fairlearn for assessing and mitigating fairness-related issues. InterpretML and Fairlearn are independent community-driven projects that power several Responsible AI Toolbox functionalities.

Before long, Azure Machine Learning approached Microsoft Research with another signal: customers wanted to use the tools together, in one interface. The research team responded with an approach that enabled interoperability, allowing the tools to exchange data and insights, facilitating a seamless ML debugging experience. Over the course of two to three months, the teams met weekly to conceptualize and design “a single pane of glass” from which practitioners could use the tools collectively. As Azure Machine Learning developed the project, Microsoft Research stayed involved, from providing design expertise to contributing to how the story and capabilities of what had become Responsible AI dashboard would be communicated to customers.

After the release, the teams dived into the next open challenge: enabling practitioners to better mitigate failures. Enter the Responsible AI Mitigations Library and the Responsible AI Tracker, which were developed by Microsoft Research in collaboration with Aether. Microsoft Research was well-equipped with the resources and expertise to figure out the most effective visualizations for doing disaggregated model comparison (there was very little previous work available on it) and navigating the proper abstractions for the complexities of applying different mitigations to different subsets of data with a flexible, easy-to-use interface. Throughout the process, the Azure team provided insight into how the new tools fit into the existing infrastructure.

With the Azure team bringing practitioner needs and the platform to the table and research bringing the latest in model evaluation, responsible testing, and the like, it is the perfect fit, says Sameki.

An open-source approach to tooling for responsible AI

While making these tools available through Azure Machine Learning supports customers in bringing their products and services to market responsibly, making these tools open source is important to cultivating an even larger landscape of responsibly developed AI. When release ready, these tools for responsible AI are made open source and then integrated into the Azure Machine Learning platform. The reasons for going with an open-source-first approach are numerous, say Nushi and Sameki:

  • freely available tools for responsible AI are an educational resource for learning and teaching the practice of responsible AI;
  • more contributors, both internal to Microsoft and external, add quality, longevity, and excitement to the work and topic; and
  • the ability to integrate them into any platform or infrastructure encourages more widespread use.

The decision also represents one of the Microsoft AI principles in action—transparency.

photo of Mehrnoosh Sameki smiling for the camera

“In the space of responsible AI, being as open as possible is the way to go, and there are multiple reasons for that,” says Sameki. “The main reason is for building trust with the users and with the consumers of these tools. In my opinion, no one would trust a machine learning evaluation technique or an unfairness mitigation algorithm that is unclear and close source. Also, this field is very new. Innovating in the open nurtures better collaborations in the field.”

Mehrnoosh Sameki, Microsoft Principal PM Manager

Looking ahead

AI capabilities are only advancing. The larger research community, practitioners, the tech industry, government, and other institutions are working in different ways to steer these advancements in a direction in which AI is contributing value and its potential harms are minimized. Practices for responsible AI will need to continue to evolve with AI advancements to support these efforts.

For Microsoft researchers like Nushi and product managers like Sameki, that means fostering cross-company, multidisciplinary collaborations in their continued development of tools that encourage targeted model improvement guided by the step-by-step process of identification, diagnosis, mitigation, and comparison and validation—wherever those advances lead.

“As we get better in this, I hope we move toward a more systematic process to understand what data is actually useful, even for the large models; what is harmful that really shouldn’t be included in those; and what is the data that has a lot of ethical issues if you include it,” says Nushi. “Building AI responsibly is crosscutting, requiring perspectives and contributions from internal teams and external practitioners. Our growing collection of tools shows that effective collaboration has the potential to impact—for the better—how we create the new generation of AI systems.”

The post Responsible AI: The research collaboration behind new open-source tools offered by Microsoft appeared first on Microsoft Research.

Read More

NVIDIA Chief Scientist Inducted Into Silicon Valley’s Engineering Hall of Fame

NVIDIA Chief Scientist Inducted Into Silicon Valley’s Engineering Hall of Fame

From scaling mountains in the annual California Death Ride bike challenge to creating a low-cost, open-source ventilator in the early days of the COVID-19 pandemic, NVIDIA Chief Scientist Bill Dally is no stranger to accomplishing near-impossible feats.

On Friday, he achieved another rare milestone: induction into the Silicon Valley Engineering Council’s Hall of Fame.

The aim of the council — a coalition of engineering societies, including the Institute of Electrical and Electronics Engineers, SAE International and the Association for Computing Machinery — is to promote engineering programs and enhance society through science.

Since 1990, its Hall of Fame has honored engineers who have accomplished significant professional achievements while serving their profession and the wider community.

Previous inductees include industry luminaries such as Intel founders Robert Noyce and Gordon Moore, former president of Stanford University and MIPS founder John Hennessy, and Google distinguished engineer and professor emeritus at UC Berkeley David Patterson.

Recognizing ‘an Industry Leader’

In accepting the distinction, Dally said, “I am honored to be inducted into the Silicon Valley Hall of Fame. The work for which I am being recognized is part of a large team effort. Many faculty and students participated in the stream processing research at Stanford, and a very large team at NVIDIA was involved in translating this research into GPU computing. It is a really exciting time to be a computer engineer.”

“The future is bright with a lot more demanding applications waiting to be accelerated using the principles of stream processing and accelerated computing.”

His induction kicked off with a video featuring colleagues and friends, spanning his career across Caltech, MIT,  Stanford and NVIDIA.

In the video, NVIDIA founder and CEO Jensen Huang describes Dally as “an extraordinary scientist, engineer, leader and amazing person.”

Fei-Fei Li, professor of computer science at Stanford and co-director of the Stanford Institute for Human-Centered AI, commended Dally’s journey “from an academic scholar and a world-class researcher to an industry leader” who is spearheading one of the “biggest digital revolutions of our time in terms of AI — both software and hardware.”

Following the tribute video, Fred Barez, chair of the Hall of Fame committee and professor of mechanical engineering at San Jose State University, took the stage. He said of Dally: “This year’s inductee has made significant contributions, not just to his profession, but to Silicon Valley and beyond.”

Underpinning the GPU Revolution

As the leader of NVIDIA Research for nearly 15 years, Dally has built a team of more than 300 scientists around the globe, with groups covering a wide range of topics, including AI, graphics, simulation, computer vision, self-driving cars and robotics.

Prior to NVIDIA, Dally advanced the state of the art in engineering at some of the world’s top academic institutions. His development of stream processing at Stanford led directly to GPU computing, and his contributions are responsible for much of the technology used today in high-performance computing networks.

Read More

NVIDIA Unveils GPU-Accelerated AI-on-5G System for Edge AI, 5G and Omniverse Digital Twins

NVIDIA Unveils GPU-Accelerated AI-on-5G System for Edge AI, 5G and Omniverse Digital Twins

Telcos are seeking industry-standard solutions that can run 5G, AI applications and immersive graphics workloads on the same server — including for computer vision and the metaverse.

To meet this need, NVIDIA is developing a new AI-on-5G solution that combines 5G vRAN, edge AI and digital twin workloads on an all-in-one, hyperconverged and GPU-accelerated system.

The lower cost of ownership enabled by such a system would help telcos drive revenue growth in smart cities, as well as the retail, entertainment and manufacturing industries, to support a multitrillion-dollar, 5G-enabled ecosystem.

The AI-on-5G system consists of:

  • Fujitsu’s virtualized 5G Open RAN product suite, which was developed as part of the 5G Open RAN ecosystem experience (OREX) project promoted by NTT DOCOMO. It also includes Fujitsu’s virtualized central unit (vCU) and distributed unit (vDU), plus other virtualized software functions of vRAN from Fujitsu.
  • The NVIDIA Aerial™ software development kit for 5G vRAN; NVIDIA Omniverse for building and operating custom 3D pipelines and large-scale simulations; NVIDIA RTX Virtual Workstation (vWS) software; and NVIDIA CloudXR for streaming extended reality.
  • Hardware includes the NVIDIA A100X and L40 converged accelerators.

OREC has supported performance verification and evaluation tests for this system.

Collaborating With Fujitsu

“Fujitsu is delivering a fully virtualized 5G vRAN together with multi-access edge computing on the same high-performance, energy-efficient, versatile and scalable computing infrastructure,” said Masaki Taniguchi, senior vice president and head of mobile systems at Fujitsu. “This combination, powered by AI and XR applications, enables telcos to deliver ultra-low latency services, highly optimized TCO and energy-efficient performance.”

The announcement is a step toward accomplishing the O-RAN alliance’s goal of enabling software-defined, AI-driven, cloud-native, fully programmable, energy-efficient and commercially ready telco-grade 5G Open RAN solutions. It’s also consistent with OREC’s goal of implementing a widely adopted, high-performance and multi-vendor 5G vRAN for both public and enterprise 5G deployments.

The all-in-one system uses GPUs to accelerate the software-defined 5G vRAN, as well as the edge AI and graphics applications, without bespoke hardware accelerators nor a specific telecom CPU. This ensures that the GPUs can accelerate the vRAN (based on NVIDIA Aerial), AI video analytics (based on NVIDIA Metropolis), streaming immersive extended reality (XR) experiences (based on NVIDIA CloudXR) and digital twins (based on NVIDIA Omniverse).

“Telcos and their customers are exploring new ways to boost productivity, efficiency and creativity through immersive experiences delivered over 5G networks,” said Ronnie Vasishta, senior vice president of telecom at NVIDIA. “At Mobile World Congress, we are bringing those visions into reality, showcasing how a single GPU-enabled server can support workloads such as NVIDIA Aerial for 5G, CloudXR for streaming virtual reality and Omniverse for digital twins.”

The AI-on-5G system is part of a growing portfolio of 5G solutions from NVIDIA that are driving transformation in the telecommunications industry. Anchored on the NVIDIA Aerial SDK and A100X converged accelerators — combined with BlueField DPUs and a suite of AI frameworks — NVIDIA provides a high-performance, software-defined, cloud-native, AI-enabled 5G for on-premises and telco operators’ RAN.

Telcos working with NVIDIA can gain access to thousands of software vendors and applications in the ecosystem, which can help address enterprise needs in smart cities, retail, manufacturing, industrial and mining.

NVIDIA and Fujitsu will demonstrate the new AI-on-5G system at Mobile World Congress in Barcelona, running Feb. 27-March 2, at hall 4, stand 4E20.

Read More

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. a smartphone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user intention (whether the user is speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end (E2E) ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a…Apple Machine Learning Research

Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the…Apple Machine Learning Research