Join the TensorFlow Special Interest Groups (SIGs)

Join the TensorFlow Special Interest Groups (SIGs)

Posted by Joana Carrasqueira, TensorFlow Program Manager and Thea Lamkin, Open Source Program Manager, in collaboration with TensorFlow SIG Leads.

TensorFlow SIGs (Special Interest Groups) organize community contributions to key parts of the TensorFlow ecosystem, and enable community members to contribute and maintain new features in important areas.

SIG leads and members work together to build and support important TensorFlow use cases, and are a vital part of our open source community. It all started with the SIG Build, and we now have 13 Active SIGs, with more on the way.

In this article, you’ll learn about the SIGs that exist today, and how you can get involved. Many SIGs are led by members of the open source community, from industry collaborators to Machine Learning Google Developer Experts (ML GDEs). TensorFlow’s success is due in large part to the hard work and contributions of our vibrant community. We welcome contributors to join the SIGs working on the parts of TensorFlow’s ecosystem they are most excited to collaborate on. Here is an overview of the SIGs and their areas of focus, contributed by their leads:

SIG Addons

In a fast-moving field like Machine Learning, there are many new developments that cannot be integrated into core TensorFlow. SIG Addons was created to tackle this problem by maintaining a repository of bleeding edge contributions that conform to well-established API patterns, but implement new functionality not available in core TensorFlow and adopted some of the parts of tf.contrib.

To contribute to TensorFlow Addons, join the conversation at our monthly meeting.

SIG Build

Started as a forum for development topics like new architecture support and packaging improvements, SIG Build grew to a discussion center dedicated to building, testing, packaging, and distributing TensorFlow that bridges internal and external TensorFlow development. The goal of this group is to ensure TensorFlow is a good citizen in the wider OSS ecosystem (Python, C++, Linux, Windows, MacOS).

To contribute to TensorFlow Build, join the conversation at our monthly meeting.

SIG IO

SIG IO is a repository of dataset, streaming, and file systems extension support for TensorFlow. Recent accomplishments include the release of v.0.13.0 (with TF 2.2), added Video Studio Code tutorial, and added AVIF imagine file format support.

To contribute to TensorFlow IO, join the conversation at our monthly meeting.

SIG JVM

SIG JVM provides comprehensive support for building, training and serving TensorFlow models on top of Java Virtual Machine (JVM). This group focuses on using Java but also includes other popular JVM languages, like Kotlin and Scala. Some of the recent accomplishments include adding n-dimensional data access in native memory and the creation of a high-level API similar to Keras for building models.

To contribute to TensorFlow JVM, join the conversation at our monthly meeting.

SIG Keras

This group focuses on care and feeding of the tf.Keras API (new features, docs, guides), Keras Tuner, AutoKeras, and Keras applications.

To contribute to TensorFlow Keras, join the conversation at our bi-monthly meeting.

SIG Micro

SIG Micro is a discussion and collaboration group around running TensorFlow models on Microcontrontrollers, DSPs, and other highly resource constrained embedded devices.

To contribute to TensorFlow Micro, join the conversation at our monthly meeting.

SIG MLIR

The goal of this group is to foster an open discussion on high performance compilers and how optimization techniques can be applied to TensorFlow graphs. Ultimately this project aims to create a common intermediate representation that reduces the cost of new hardware and improves usability for existing TensorFlow users.

To contribute to TensorFlow MLIR, join the conversation at our monthly meeting.

SIG Networking

SIG Networking aims to add support for different network fabrics and protocols. The group evaluates proposals and designs in this area and maintains code in the tensorflow/networking repository. Join us, if you are interested in improving TensorFlow on different types of networks or underlying drivers and libraries!

To contribute to TensorFlow Networking, join the conversation at our monthly meeting.

SIG Reccomenders (New!)

SIG Recommenders was created to drive discussion and collaborations around using TensorFlow for large scale recommendation systems (Recommenders). We hope to encourage sharing of best practices in the industry, get consensus and product feedback to help evolve TensorFlow support for recommenders, and facilitate the contributions of RFCs and PRs in this domain.

To contribute to TensorFlow Recommenders, join the mailing list to get updates about our upcoming meetings.

SIG Rust

SIG Rust was created for users and contributors on the TensorFlow Rust binding project. It provides stable support for running models created in other languages, and can both train and evaluate.

To contribute to TensorFlow Rust, join the conversation at our monthly meeting.

SIG Swift

The purpose of SIG Swift is to host design reviews, discuss upcoming API changes, share project roadmap, and encourage collaboration in the Swift for TensorFlow (S4TF) open-source community.

To contribute to TensorFlow Swift, join the conversation at our monthly meeting.

SIG Tensorboard

SIG TensorBoard was created for discussion and collaboration around TensorBoard, the visualization tool for TensorFlow. The goal of this group is to engage the TensorBoard user and developer community and get feedback; encourage development of new TensorBoard plugins; promote collaboration ML via TensorBoard.dev; and encourage community improvements to TensorBoard.

To contribute to TensorFlow TensorBoard, join the conversation at our monthly meeting.

SIG TF.js (New!)

SIG TF.js was created to facilitate community-contributed components to tensorflow/tfjs (and potential community-maintained libraries). The core TensorFlow.js engineering team has been working on building the infrastructure and tooling to enable ML to run in JavaScript powered applications, and has an active contributor community of individual developers, GDEs, and enterprise users. We want to accelerate the community involvement in the project to help continue meet the needs and help drive new directions for the project.

To contribute to TensorFlow TF.js, join the conversation at our monthly meeting.

Thank you to our SIG Leads for their work and leadership:

Picture: 1st TensorFlow Contributor Summit, Santa Clara, 2019.
Picture: 1st TensorFlow Contributor Summit, Santa Clara, 2019.

Sean Morgan, Tzu-Wei Sung | SIG Addons

Jason Zaman, Austin Anderson | SIG Build

Yong Tang, Anthony Dmitriev, Derek Murray | SIG IO

Karl Lessard, Adam Pocock, Rajagopal Ananthanarayanan | SIG JVM

Francois Chollet | SIG Keras

Neil Tan, Pete Warden | SIG Micro

Tatiana Shpeisman, Pankaj Kanwar | SIG MLIR

Bairen Yi, Jeroen Bedorf | SIG Networking

Bo Liu, Haidong Rong, Yong Li, Wei Wei | SIG Recommenders

Adam Crume | SIG Rust

Ewa Matejska | SIG Swift

Mani Varadarajan, Gal Oshri | SIG TensorBoard

Sandeep Gupta, Ping Yu | SIG TF.js

Read More

Sparkles in the Rough: NVIDIA’s Video Gems from a Hardscrabble 2020

Sparkles in the Rough: NVIDIA’s Video Gems from a Hardscrabble 2020

Much of 2020 may look best in the rearview mirror, but the year also held many moments of outstanding work, gems worth hitting the rewind button to see again.

So, here’s a countdown — roughly in order of ascending popularity — of 10 favorite NVIDIA videos that hit YouTube in 2020. With two exceptions for videos that deserve a wide audience, all got at least 200,000 views and most, but not all, can be found on the NVIDIA YouTube channel.

#10 Coronavirus Gets a Close-Up

The pandemic was clearly the story of the year.

We celebrated the work of many healthcare providers and researchers pushing science forward to combat it, including the team that won a prestigious Gordon Bell award for using high performance computing and AI to see how the coronavirus works, something they explained in detail in their own video here.

In another one of the many responses to COVID-19, the Folding@Home project received donations of time on more than 200,000 NVIDIA GPUs to study the coronavirus. Using NVIDIA Omniverse, we created a visualization (described below) of data they amassed on their virtual exascale computer.

#9 Cruising into a Ray-Traced Future

Despite the challenging times, many companies continued to deliver top-notch work. For example, Autodesk VRED 2021 showed the shape of things to come in automotive design.

The demo below displays the power of ray tracing and AI to deliver realistic 3D visualizations in real time using RTX technology, snagging nearly a quarter million views. (Note: There’s no audio on this one, just amazing images.)

#8 A Test Drive in the Latest Mercedes

Just for fun — yes, even 2020 included fun — we look back at NVIDIA CEO Jensen Huang taking a spin in the latest Mercedes-Benz S-Class as part of the world premiere of the flagship sedan. He shared the honors with Grammy award-winning Alicia Keys and Formula One champ Lewis Hamilton.

The S-Class uses AI to deliver intelligent features like a voice assistant personalized for each driver. An engineer and a car enthusiast at heart, Huang gave kudos to the work of hundreds of engineers who delivered a vehicle that with over-the-air software updates will get better and better.

#7 Playing Marbles After Dark

The NVIDIA Omniverse team pointed the way to a future of photorealistic games and simulations rendered in real time. They showed how a distributed team of engineers and artists can integrate multiple tools to play more than a million polygons smoothly with ray-traced lighting at 1440p on a single GeForce RTX 3090.

The mesmerizing video captured the eyeballs of nearly half a million viewers.

#6 An AI Platform for the Rest of Us

Great things sometimes come in small packages. In October, we debuted the DGX Station A100, a supercomputer that plugs into a standard wall socket to let data scientists do world-class work in AI. More than 400,000 folks tuned in.

#5 Seeing Virtual Meetings Through a New AI

With online gatherings the new norm, NVIDIA Maxine attracted a lot of eyeballs. More than 800,000 viewers tuned into this demo of how we’re using generative adversarial networks to lower the bandwidth and turn up the quality of video conferencing.

#4 What’s Jensen Been Cooking?

Our most energy-efficient video of 2020 was a bit of a tease. It lasted less than 30 seconds, but Jensen Huang’s preview of the first NVIDIA Ampere architecture GPU drew nearly a million viewers.

#3 Voila, Jensen Whips Up the First Kitchen Keynote

In the days of the Great Depression, vacuum tubes flickered with fireside chats. The 2020 pandemic spawned a slew of digital events with GTC among the first of them.

In May, Jensen recorded in his California home the first kitchen keynote. In a playlist of nine virtual courses, he served a smorgasbord where the NVIDIA A100 GPU was an entrée surrounded by software side dishes that included frameworks for conversational AI (Jarvis) and recommendation systems (Merlin). The first chapter alone attracted more than 300,000 views.

And we did it all again in October when we featured the first DPU, its DOCA software and a framework to accelerate drug discovery.

#2 Delivering Enterprise AI in a Box

The DGX A100 emerged as one of the favorite dishes from our May kitchen keynote. The 5-petaflops system packs AI training, inference and analytics for any data center.

Some 1.3 million viewers clicked to get a virtual tour of the eight A100 GPUs and 200 Gbit/second InfiniBand links inside it.

#1 Enough of All This Hard Work, Let’s Have Fun!

By September it was high time to break away from a porcupine of a year. With the GeForce RTX 30 Series GPUs, we rolled out engines to create lush new worlds for those whose go-to escape is gaming.

The launch video, viewed more than 1.5 million times, begins with a brief tour of the history of computer games. Good days remembered, good days to come.

For Dessert: Two Bytes of Chocolate

We’ll end 2020, happily, with two special mentions.

Our most watched video of the year was a blistering five-minute clip of game play on DOOM Eternal running all out on a GeForce RTX 3080 in 4K.

And perhaps our sweetest feel good moment of 2020 was delivered by an NVIDIA engineer, Bryce Denney, who hacked a way to let choirs sing together safely in the pandemic. Play it again, Bryce!

 

The post Sparkles in the Rough: NVIDIA’s Video Gems from a Hardscrabble 2020 appeared first on The Official NVIDIA Blog.

Read More

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

Highly-regulated industries, such as financial services, are often required to audit all access to their data. This includes auditing exploratory activities performed by data scientists, who usually query data from within machine learning (ML) notebooks.

This post walks you through the steps to implement access control and auditing capabilities on a per-user basis, using Amazon SageMaker Studio notebooks and AWS Lake Formation access control policies. This is a how-to guide based on the Machine Learning Lens for the AWS Well-Architected Framework, following the design principles described in the Security Pillar:

  • Restrict access to ML systems
  • Ensure data governance
  • Enforce data lineage
  • Enforce regulatory compliance

Additional ML governance practices for experiments and models using Amazon SageMaker are described in the whitepaper Machine Learning Best Practices in Financial Services.

Overview of solution

This implementation uses Amazon Athena and the PyAthena client on a Studio notebook to query data on a data lake registered with Lake Formation.

SageMaker Studio is the first fully integrated development environment (IDE) for ML. Studio provides a single, web-based visual interface where you can perform all the steps required to build, train, and deploy ML models. Studio notebooks are collaborative notebooks that you can launch quickly, without setting up compute instances or file storage beforehand.

Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Lake Formation is a fully managed service that makes it easier for you to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes, including securely making that data available for analytics and ML.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

The workflow includes the following steps:

  1. Data scientists access the AWS Management Console using their AWS Identity and Access Management (IAM) user accounts and open Studio using individual user profiles. Each user profile has an associated execution role, which the user assumes while working on a Studio notebook. The diagram depicts two data scientists that require different permissions over data in the data lake. For example, in a data lake containing personally identifiable information (PII), user Data Scientist 1 has full access to every table in the Data Catalog, whereas Data Scientist 2 has limited access to a subset of tables (or columns) containing non-PII data.
  2. The Studio notebook is associated with a Python kernel. The PyAthena client allows you to run exploratory ANSI SQL queries on the data lake through Athena, using the execution role assumed by the user while working with Studio.
  3. Athena sends a data access request to Lake Formation, with the user profile execution role as principal. Data permissions in Lake Formation offer database-, table-, and column-level access control, restricting access to metadata and the corresponding data stored in Amazon S3. Lake Formation generates short-term credentials to be used for data access, and informs Athena what columns the principal is allowed to access.
  4. Athena uses the short-term credential provided by Lake Formation to access the data lake storage in Amazon S3, and retrieves the data matching the SQL query. Before returning the query result, Athena filters out columns that aren’t included in the data permissions informed by Lake Formation.
  5. Athena returns the SQL query result to the Studio notebook.
  6. Lake Formation records data access requests and other activity history for the registered data lake locations. AWS CloudTrail also records these and other API calls made to AWS during the entire flow, including Athena query requests.

Walkthrough overview

In this walkthrough, I show you how to implement access control and audit using a Studio notebook and Lake Formation. You perform the following activities:

  1. Register a new database in Lake Formation.
  2. Create the required IAM policies, roles, group, and users.
  3. Grant data permissions with Lake Formation.
  4. Set up Studio.
  5. Test Lake Formation access control policies using a Studio notebook.
  6. Audit data access activity with Lake Formation and CloudTrail.

If you prefer to skip the initial setup activities and jump directly to testing and auditing, you can deploy the following AWS CloudFormation template in a Region that supports Studio and Lake Formation:

You can also deploy the template by downloading the CloudFormation template. When deploying the CloudFormation template, you provide the following parameters:

  • User name and password for a data scientist with full access to the dataset. The default user name is data-scientist-full.
  • User name and password for a data scientist with limited access to the dataset. The default user name is data-scientist-limited.
  • Names for the database and table to be created for the dataset. The default names are amazon_reviews_db and amazon_reviews_parquet, respectively.
  • VPC and subnets that are used by Studio to communicate with the Amazon Elastic File System (Amazon EFS) volume associated to Studio.

If you decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to the section Testing Lake Formation access control policies in this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • A data lake set up in Lake Formation with a Lake Formation Admin. For general guidance on how to set up Lake Formation, see Getting started with AWS Lake Formation.
  • Basic knowledge on creating IAM policies, roles, users, and groups.

Registering a new database in Lake Formation

For this post, I use the Amazon Customer Reviews Dataset to demonstrate how to provide granular access to the data lake for different data scientists. If you already have a dataset registered with Lake Formation that you want to use, you can skip this section and go to Creating required IAM roles and users for data scientists.

To register the Amazon Customer Reviews Dataset in Lake Formation, complete the following steps:

  1. Sign in to the console with the IAM user configured as Lake Formation Admin.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Databases.
  3. Choose Create Database.
  4. In Database details, select Database to create the database in your own account.
  5. For Name, enter a name for the database, such as amazon_reviews_db.
  6. For Location, enter s3://amazon-reviews-pds.
  7. Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

  1. Choose Create database.

The Amazon Customer Reviews Dataset is currently available in TSV and Parquet formats. The Parquet dataset is partitioned on Amazon S3 by product_category. To create a table in the data lake for the Parquet dataset, you can use an AWS Glue crawler or manually create the table using Athena, as described in Amazon Customer Reviews Dataset README file.

  1. On the Athena console, create the table.

If you haven’t specified a query result location before, follow the instructions in Specifying a Query Result Location.

  1. Choose the data source AwsDataCatalog.
  2. Choose the database created in the previous step.
  3. In the Query Editor, enter the following query:
    CREATE EXTERNAL TABLE amazon_reviews_parquet(
      marketplace string, 
      customer_id string, 
      review_id string, 
      product_id string, 
      product_parent string, 
      product_title string, 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine string, 
      verified_purchase string, 
      review_headline string, 
      review_body string, 
      review_date bigint, 
      year int)
    PARTITIONED BY (product_category string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/'

  1. Choose Run query.

You should receive a Query successful response when the table is created.

  1. Enter the following query to load the table partitions:
    MSCK REPAIR TABLE amazon_reviews_parquet

  1. Choose Run query.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Tables.
  3. For Table name, enter a table name.
  4. Verify that you can see the table details.

18. Verify that you can see the table details.

  1. Scroll down to see the table schema and partitions.

Finally, you register the database location with Lake Formation so the service can start enforcing data permissions on the database.

  1. On the Lake Formation console, in the navigation pane, under Register and ingest, choose Data lake locations.
  2. On the Data lake locations page, choose Register location.
  3. For Amazon S3 path, enter s3://amazon-reviews-pds/.
  4. For IAM role, you can keep the default role.
  5. Choose Register location.

Creating required IAM roles and users for data scientists

To demonstrate how you can provide differentiated access to the dataset registered in the previous step, you first need to create IAM policies, roles, a group, and users. The following diagram illustrates the resources you configure in this section.

The following diagram illustrates the resources you configure in this section.

In this section, you complete the following high-level steps:

  1. Create an IAM group named DataScientists containing two users: data-scientist-full and data-scientist-limited, to control their access to the console and to Studio.
  2. Create a managed policy named DataScientistGroupPolicy and assign it to the group.

The policy allows users in the group to access Studio, but only using a SageMaker user profile that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. For each IAM user, create individual IAM roles, which are used as user profile execution roles in Studio later.

The naming convention for these roles consists of a common prefix followed by the corresponding IAM user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual IAM users who performed the activities. For this post, I use the prefix SageMakerStudioExecutionRole_.

  1. Create a managed policy named SageMakerUserProfileExecutionPolicy and assign it to each of the IAM roles.

The policy establishes coarse-grained access permissions to the data lake.

Follow the remainder of this section to create the IAM resources described. The permissions configured in this section grant common, coarse-grained access to data lake resources for all the IAM roles. In a later section, you use Lake Formation to establish fine-grained access permissions to Data Catalog resources and Amazon S3 locations for individual roles.

Creating the required IAM group and users

To create your group and users, complete the following steps:

  1. Sign in to the console using an IAM user with permissions to create groups, users, roles, and policies.
  2. On the IAM console, create policies on the JSON tab to create a new IAM managed policy named DataScientistGroupPolicy.
    1. Use the following JSON policy document to provide permissions, providing your AWS Region and AWS account ID:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "sagemaker:DescribeDomain",
                      "sagemaker:ListDomains",
                      "sagemaker:ListUserProfiles",
                      "sagemaker:ListApps"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedDomainUrl",
                      "sagemaker:DescribeUserProfile"
                  ],
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedDomainUrl",
                      "sagemaker:DescribeUserProfile"
                  ],
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}"
              },
              {
                  "Action": "sagemaker:*App",
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sagemaker:*App",
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedNotebookInstanceUrl",
                      "sagemaker:*NotebookInstance",
                      "sagemaker:*NotebookInstanceLifecycleConfig",
                      "sagemaker:CreateUserProfile",
                      "sagemaker:DeleteDomain",
                      "sagemaker:DeleteUserProfile"
                  ],
                  "Resource": "*",
                  "Effect": "Deny"
              }
          ]
      }

This policy forces an IAM user to open Studio using a SageMaker user profile with the same name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. Create an IAM group.
    1. For Group name, enter DataScientists.
    2. Search and attach the AWS managed policy named DataScientist and the IAM policy created in the previous step.
  2. Create two IAM users named data-scientist-full and data-scientist-limited.

Alternatively, you can provide names of your choice, as long as they’re a combination of lowercase letters, numbers, and hyphen (-). Later, you also give these names to their corresponding SageMaker user profiles, which at the time of writing only support those characters.

Creating the required IAM roles

To create your roles, complete the following steps:

  1. On the IAM console, create a new managed policy named SageMakerUserProfileExecutionPolicy.
    1. Use the following policy code:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "lakeformation:GetDataAccess",
                      "glue:GetTable",
                      "glue:GetTables",
                      "glue:SearchTables",
                      "glue:GetDatabase",
                      "glue:GetDatabases",
                      "glue:GetPartitions"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sts:AssumeRole",
                  "Resource": "*",
                  "Effect": "Deny"
              }
          ]
      }

This policy provides common coarse-grained IAM permissions to the data lake, leaving Lake Formation permissions to control access to Data Catalog resources and Amazon S3 locations for individual users and roles. This is the recommended method for granting access to data in Lake Formation. For more information, see Methods for Fine-Grained Access Control.

  1. Create an IAM role for the first data scientist (data-scientist-full), which is used as the corresponding user profile’s execution role.
    1. On the Attach permissions policy page, search and attach the AWS managed policy AmazonSageMakerFullAccess.
    2. For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-scientist-full.
  2. To add the remaining policies, on the Roles page, choose the role name you just created.
  3. Under Permissions, choose Attach policies;
  4. Search and select the SageMakerUserProfileExecutionPolicy and AmazonAthenaFullAccess policies.
  5. Choose Attach policy.
  6. To restrict the Studio resources that can be created within Studio (such as image, kernel, or instance type) to only those belonging to the user profile associated to the first IAM role, embed an inline policy to the IAM role.
    1. Use the following JSON policy document to scope down permissions for the user profile, providing the Region, account ID, and IAM user name associated to the first data scientist (data-scientist-full). You can name the inline policy DataScientist1IAMRoleInlinePolicy.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": "sagemaker:*App",
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sagemaker:*App",
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*"
              }
          ]
      }

  1. Repeat the previous steps to create an IAM role for the second data scientist (data-scientist-limited).
    1. Name the role SageMakerStudioExecutionRole_data-scientist-limited and the second inline policy DataScientist2IAMRoleInlinePolicy.

Granting data permissions with Lake Formation

Before data scientists are able to work on a Studio notebook, you grant the individual execution roles created in the previous section access to the Amazon Customer Reviews Dataset (or your own dataset). For this post, we implement different data permission policies for each data scientist to demonstrate how to grant granular access using Lake Formation.

  1. Sign in to the console with the IAM user configured as Lake Formation Admin.
  2. On the Lake Formation console, in the navigation pane, choose Tables.
  3. On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
  4. On the Actions menu, under Permissions, choose Grant.
  5. Provide the following information to grant full access to the Amazon Customer Reviews Dataset table for the first data scientist:
  6. Select My account.
  7. For IAM users and roles, choose the execution role associated to the first data scientist, such as SageMakerStudioExecutionRole_data-scientist-full.
  8. For Table permissions and Grantable permissions, select Select.
  9. Choose Grant.
  10. Repeat the first step to grant limited access to the dataset for the second data scientist, providing the following information:
  11. Select My account.
  12. For IAM users and roles, choose the execution role associated to the second data scientist, such as SageMakerStudioExecutionRole_data-scientist-limited.
  13. For Columns, choose Include columns.
  14. Choose a subset of columns, such as: product_category, product_id, product_parent, product_title, star_rating, review_headline, review_body, and review_date.
  15. For Table permissions and Grantable permissions, select Select.
  16. Choose Grant.
  17. To verify the data permissions you have granted, on the Lake Formation console, in the navigation pane, choose Tables.
  18. On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
  19. On the Actions menu, under Permissions, choose View permissions to open the Data permissions menu.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

If you see the principal IAMAllowedPrincipals listed on the Data permissions menu for the table, you must remove it. Select the principal and choose Revoke. On the Revoke permissions page, choose Revoke.

Setting up SageMaker Studio

You now onboard to Studio and create two user profiles, one for each data scientist.

When you onboard to Studio using IAM authentication, Studio creates a domain for your account. A domain consists of a list of authorized users, configuration settings, and an Amazon EFS volume, which contains data for the users, including notebooks, resources, and artifacts.

Each user receives a private home directory within Amazon EFS for notebooks, Git repositories, and data files. All traffic between the domain and the Amazon EFS volume is communicated through specified subnet IDs. By default, all other traffic goes over the internet through a SageMaker system Amazon Virtual Private Cloud (Amazon VPC).

Alternatively, instead of using the default SageMaker internet access, you could secure how Studio accesses resources by assigning a private VPC to the domain. This is beyond the scope of this post, but you can find additional details in Securing Amazon SageMaker Studio connectivity using a private VPC.

If you already have a Studio domain running, you can skip the onboarding process and follow the steps to create the SageMaker user profiles.

Onboarding to Studio

To onboard to Studio, complete the following steps:

  1. Sign in to the console using an IAM user with service administrator permissions for SageMaker.
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio menu, under Get started, choose Standard setup.
  4. For Authentication method, choose AWS Identity and Access Management (IAM).
  5. Under Permission, for Execution role for all users, choose an option from the role selector.

You’re not using this execution role for the SageMaker user profiles that you create later. If you choose Create a new role, the Create an IAM role dialog opens.

  1. For S3 buckets you specify, choose None.
  2. Choose Create role.

SageMaker creates a new IAM role named AmazonSageMaker-ExecutionPolicy role with the AmazonSageMakerFullAccess policy attached.

  1. Under Network and storage, for VPC, choose the private VPC that is used for communication with the Amazon EFS volume.
  2. For Subnet(s), choose multiple subnets in the VPC from different Availability Zones.
  3. Choose Submit.
  4. On the Studio Control Panel, under Studio Summary, wait for the status to change to Ready and the Add user button to be enabled.

Creating the SageMaker user profiles

To create your SageMaker user profiles, complete the following steps:

  1. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  2. On the Studio Control Panel, choose Add user.
  3. For User name, enter data-scientist-full.
  4. For Execution role, choose Enter a custom IAM role ARN.
  5. Enter arn:aws:iam::<AWSACCOUNT>:role/SageMakerStudioExecutionRole_data-scientist-full, providing your AWS account ID.
  6. After creating the first user profile, repeat the previous steps to create a second user profile.
    1. For User name, enter data-scientist-limited.
    2. For Execution role, enter the associated IAM role ARN.

For Execution role, enter the associated IAM role ARN.

Testing Lake Formation access control policies

You now test the implemented Lake Formation access control policies by opening Studio using both user profiles. For each user profile, you run the same Studio notebook containing Athena queries. You should see different query outputs for each user profile, matching the data permissions implemented earlier.

  1. Sign in to the console with IAM user data-scientist-full.
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio Control Panel, choose user name data-scientist-full.
  4. Choose Open Studio.
  5. Wait for SageMaker Studio to load.

Due to the IAM policies attached to the IAM user, you can only open Studio with a user profile matching the IAM user name.

  1. In Studio, on the top menu, under File, under New, choose Terminal.
  2. At the command prompt, run the following command to import a sample notebook to test Lake Formation data permissions:
    git clone https://github.com/aws-samples/amazon-sagemaker-studio-audit.git

  1. In the left sidebar, choose the file browser icon.
  2. Navigate to amazon-sagemaker-studio-audit.
  3. Open the notebook folder.
  4. Choose sagemaker-studio-audit-control.ipynb to open the notebook.
  5. In the Select Kernel dialog, choose Python 3 (Data Science).
  6. Choose Select.
  7. Wait for the kernel to load.

Wait for the kernel to load.

  1. Starting from the first code cell in the notebook, press Shift + Enter to run the code cell.
  2. Continue running all the code cells, waiting for the previous cell to finish before running the following cell.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

  1. On the top menu, under File, choose Shut Down.
  2. Choose Shutdown All to shut down all the Studio apps.
  3. Close the Studio browser tab.
  4. Repeat the previous steps in this section, this time signing in as the user data-scientist-limited and opening Studio with this user.
  5. Don’t run the code cell in the section Create S3 bucket for query output files.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

Auditing data access activity with Lake Formation and CloudTrail

In this section, we explore the events associated to the queries performed in the previous section. The Lake Formation console includes a dashboard where it centralizes all CloudTrail logs specific to the service, such as GetDataAccess. These events can be correlated with other CloudTrail events, such as Athena query requests, to get a complete view of the queries users are running on the data lake.

Alternatively, instead of filtering individual events in Lake Formation and CloudTrail, you could run SQL queries to correlate CloudTrail logs using Athena. Such integration is beyond the scope of this post, but you can find additional details in Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs and Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.

Auditing data access activity with Lake Formation

To review activity in Lake Formation, complete the following steps:

  1. Sign out of the AWS account.
  2. Sign in to the console with the IAM user configured as Lake Formation Admin.
  3. On the Lake Formation console, in the navigation pane, choose Dashboard.

Under Recent access activity, you can find the events associated to the data access for both users.

  1. Choose the most recent event with event name GetDataAccess.
  2. Choose View event.

Among other attributes, each event includes the following:

  • Event date and time
  • Event source (Lake Formation)
  • Athena query ID
  • Table being queried
  • IAM user embedded in the Lake Formation principal, based on the chosen role name convention

• IAM user embedded in the Lake Formation principal, based on the chosen role name convention

Auditing data access activity with CloudTrail

To review activity in CloudTrail, complete the following steps:

  1. On the CloudTrail console, in the navigation pane, choose Event history.
  2. In the Event history menu, for Filter, choose Event name.
  3. Enter StartQueryExecution.
  4. Expand the most recent event, then choose View event.

This event includes additional parameters that are useful to complete the audit analysis, such as the following:

  • Event source (Athena).
  • Athena query ID, matching the query ID from Lake Formation’s GetDataAccess event.
  • Query string.
  • Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

If you followed this walkthrough using the CloudFormation template, after shutting down the Studio apps for each user profile, deleting the stack deletes the remaining resources.

If you encounter any errors, open the Studio Control Panel and verify that all the apps for every user profile are in Deleted state before deleting the stack.

If you didn’t use the CloudFormation template, you can manually delete the resources you created:

  1. On the Studio Control Panel, for each user profile, choose User Details.
  2. Choose Delete user.
  3. When all users are deleted, choose Delete Studio.
  4. On the Amazon EFS console, delete the volume that was automatically created for Studio.
  5. On the Lake Formation console, delete the table and the database created for the Amazon Customer Reviews Dataset.
  6. Remove the data lake location for the dataset.
  7. On the IAM console, delete the IAM users, group, and roles created for this walkthrough.
  8. Delete the policies you created for these principals.
  9. On the Amazon S3 console, empty and delete the bucket created for storing Athena query results (starting with sagemaker-audit-control-query-results-), and the bucket created by Studio to share notebooks (starting with sagemaker-studio-).

Conclusion

This post described how to the implement access control and auditing capabilities on a per-user basis in ML projects, using Studio notebooks, Athena, and Lake Formation to enforce access control policies when performing exploratory activities in a data lake.

I thank you for following this walkthrough and I invite you to implement it using the associated CloudFormation template. You’re also welcome to visit the GitHub repo for the project.


About the Author

Rodrigo Alarcon is a Sr. Solutions Architect with AWS based out of Santiago, Chile. Rodrigo has over 10 years of experience in IT security and network infrastructure. His interests include machine learning and cybersecurity.

Read More

InSpace: A new video conferencing platform that uses TensorFlow.js for toxicity filters in chat

InSpace: A new video conferencing platform that uses TensorFlow.js for toxicity filters in chat

A guest post by Narine Hall, Assistant Professor at Champlain College, CEO of InSpace

InSpace is a communication and virtual learning platform that gives people the ability to interact, collaborate, and educate in familiar physical ways, but in a virtual space. InSpace is built by educators for educators, putting education at the center of the platform.

  • InSpace is designed to mirror the fluid, personal, and interactive nature of a real classroom. It allows participants to break free of “Brady Bunch” boxes in existing conference solutions to create a fun, natural, and engaging environment that fosters interaction and collaboration.
  • Each person is represented in a video circle that can freely move around the space. When people are next to each other, they can hear and engage in conversation, and as they move away, the audio fades, allowing them to find new conversations.
  • As participants zoom out, they can see the entire space, which provides visual social cues. People can seamlessly switch from class discussions to private conversations or group/team-based work, similar to the format of a lab or classroom.
  • Teachers can speak to everyone when needed, move between individual students and groups for more private discussion, and place groups of students in audio-isolated rooms for collaboration while still belonging to one virtual space.

Being a collaboration platform, a fundamental InSpace feature is chat. From day one, we wanted to provide a mechanism to help warn users from sending and receiving toxic messages. For example, a teacher in a classroom setting with young people may want a way to prevent students from typing inappropriate comments. Or, a moderator of a large discussion may want a way to reduce inappropriate spam. Or individual users may want to filter them out on their own.

A simple way to identify toxic comments would be to check for the presence of a list of words, including profanity. Moving a step beyond this, we did not want to identify toxic messages just by words contained in the message, we also wanted to consider the context. So, we decided to use machine learning to accomplish that goal.

After some research, we found a pre-trained model for toxicity detection in TensorFlow.js that could be easily integrated into our platform. Importantly, this model runs entirely in the browser, which means that we can warn users against sending toxic comments without their messages ever being stored or processed by a server.

Performance wise, we found that running the toxicity process in a browser’s main thread would be detrimental to the user experience. We decided a good approach was to use the Web Workers API to separate message toxicity detection from the main application so the processes are independent and non-blocking.

GIF moderating toxic comments

Web Workers connect to the main application by sending and receiving messages, in which you can wrap your data. Whenever a user sends a message, it is automatically added to a so-called queue and is sent from the main application to the web worker. When the web worker receives the message from the main app, it starts classification of the message, and when the output is ready, it sends the results back to the main application. Based on the results from the web worker, the main app either sends the message to all participants or warns the user that it is toxic.

Chart showing how toxicity filter works

Below is the pseudocode for the main application, where we initialize the web worker by providing its path as an argument, then set the callback that will be called each time the worker sends a message, and also we declare the callback that will be called when the user submits a message.

// main application
// initializing the web worker
const toxicityFilter = new Worker('toxicity-filter.worker.js'));
// now we need to set the callback which will process the data from the worker
worker.onMessage = ({ data: { message, isToxic } }) => {
if (isToxic) {
markAsToxic(message);
} else {
sendToAll(message);
}
}

When the user sends the message, we pass it to the web worker:

onMessageSubmit = message => {
worker.postMessage(message);
addToQueue(message);
}

After the worker is initialized, it starts listening to the data messages from the main app, and handling them using the declared onmessage callback, which then sends a message back to the main app.

// toxicity-filter worker
// here we import dependencies
importScripts(
// the main library to run Tenser Flow in the browser
'https://cdn.jsdelivr.net/npm/@tensorflow/tfjs',
// trained models for toxicity detection
'https://cdn.jsdelivr.net/npm/@tensorflow-models/toxicity',
);
// threshold point for the decision
const threshold = 0.9;
// the main model promise which would be used to classify the message
const modelPromise = toxicity.load(threshold);
// registered callback to run when the main app sends the data message
onmessage = ({ data: message }) => {
modelPromise.then(model => {
model.classify([message.body]).then(predictions => {
// as we want to check the toxicity for all labels,
// `predictions` will contain the results for all 7 labels
// so we check, whether there is a match for any of them
const isToxic = predictions.some(prediction => prediction.results[0].match);
// here we send the data message back to the main app with the results
postMessage({ message, isToxic });
});
});
};

As you can see, the toxicity detector is straightforward to integrate the package with an app, and does not require significant changes to existing architecture. The main application only needs a small “connector,” and the logic of the filter is written in a separate file.

To learn more about InSpace visit https://inspace.chat.

Read More

Blue People v. City of Ney

Blue People v. City of Ney

Introduction

Discriminatory behavior towards certain groups by machine learning (ML) models is especially concerning in critical applications such as hiring. This blog post explains one source of discrimination: the reliance of ML models on different groups’ data distributions. We will show that when ML models use noisy features (which are pervasive in the real world, e.g., exam scores), they’re incentivized to devalue a good candidate from a lower-performing group. This blog post is based on:

Fereshte Khani and Percy Liang, “Feature Noise Induces Loss Discrepancy
Across Groups.” International Conference on Machine Learning. PMLR, 2020

The findings are illustrated by reviewing the hiring process in the
fictitious city of Ney, where recently a group of people has accused the
government of discrimination.

Hiring people in Ney

The government of Ney wants to hire qualified people. Each person in Ney has a skill level that is normally distributed with a mean (mu) and a standard deviation
of (sigma_text{skill}). A person is qualified if their skill level is greater than 0 and non-qualified
otherwise. The government wants to hire qualified people (all people
with skills greater than 0). For example, Alice with skill level 2, is
qualified, but Bob with the skill level of -1 is not qualified.


The skills level of the people in Ney is normally distributed with a mean of (mu) and a standard deviation of (sigma_text{skill}).

To assess people’s skills, the government created an exam. The exam score is a noisy indicator of the applicant’s skill since it cannot capture the true skill of a person (e.g., the same applicant would score differently on different versions of SAT). In the city of Ney, exam noise is nice and simple: If an individual has skill (z), then their
score is distributed as (mathcal{N} (z,
sigma_text{noise}^2)),
where (sigma_text{noise}^2) indicates the variance of noise
on the exam.


The exam score of an individual with a skill of (z) is a random variable normally distributed with a mean of (z) and a standard deviation of (sigma_text{noise}).

The government wants to choose a threshold (tau), and hire all
people whose exam scores are greater than (tau). There are two
kinds of errors that the government can make:

  1. Not hiring a qualified person  ((z > 0 land x le tau))
  2. Hiring a non-qualified person ((z le 0 land x > tau))

For simplicity, let’s assume the government cares about these two types
of errors equally and wants to minimize the overall error, i.e., the
number of non-qualified hired people plus the number of qualified
non-hired people.


The government’s goal is to find a cut-off threshold such that it minimizes the error.

Given all exam scores and knowledge of the skill distribution of the people,
what cut-off threshold should the government use to minimize the error (the above equation)?
Is it a good strategy for the government to simply use 0 as the
threshold and hire all individuals with scores greater than zero?

Let’s consider an example where the skill distribution
is (mathcal{N}(-1,1)), and the exam noise
has a standard deviation of (sigma_text{noise}=1).  The following lines of code plot
the average error for various thresholds for this example. As
illustrated, 0 is not the best threshold to use. In fact, in this
example, a threshold of (tau=1) leads to minimum error.


A simple example with (mu=-1) and (sigma_text{skill}=sigma_text{noise}=1). As shown on the right, accepting individuals with a score higher than (0) does not result in the minimum error.

The government wants to minimize the number of hired people with negative skill levels + the number of non-hired people with positive skill levels. Hiring all people with positive exam scores (a noisy indicator of the skill) is not optimal.

If 0 is not always the optimal threshold, then what is the optimal
threshold for minimizing error for different values of (mu,
sigma_text{skill}) and (sigma_text{noise})?
Generally, given a person’s exam score ((x)) and the skill level distribution ((mathbb{P}(z))), what can we infer
about their real skill ((z))? Here is where Bayesian inference
comes in.

Bayesian inference  

Let’s see what we can infer about a person’s skill given their exam score and knowing the skill level distribution
(mathbb{P} (z)) (known as the prior distribution since it shows the prior over a person’s skill). Using Bayes rule, we can calculate (mathbb{P} (z|x)) (known as the posterior distribution since it shows the distribution over a person’s skill after observing their score).

Let’s first consider two extreme cases:

  1. If the exam is completely precise
    (i.e., (sigma_text{noise}=0)), then the exam score is
    the exact indicator of a person’s skill (irrespective of the prior
    distribution).
  2. If the exam is pure noise (i.e., (sigma_text{noise}
    rightarrow infty)), then the exam score is meaningless, and
    the best estimate for a person’s skill is the average
    skill (mu) (irrespective of the exam score).

Intuitively, when the noise variance has a value between (0) and (infty), the best estimate of a person’s skill is a number
between their exam score ((x)) and the average skill
((mu)). The figure below shows the standard formulation of the
posterior distribution (mathbb{P} (z mid x)) after observing
an exam score ((x_0)). For more details on how to derive this
formula, see
this.


Posterior distribution of a person’s skill after observing their exam score ((x_0)).

Based on this formula (and as hypothesized), depending on the amount of noise, (mathbb{E} [zmid x]) is a number between (x) and (mu).

An applicant’s expected skill level is between their exam score and the average skill among Ney people. If the exam is noisier, it is closer to the average skill; if the exam is more precise, it is closer to the exam score.

Optimal threshold

Now that we have exactly characterized the posterior distribution
((mathbb{P} (z mid x))), the government can find the optimal
threshold. For any exam score (x), if the government hires people
with score (x), it incurs (mathbb{P}(z le 0 mid x) )
error (probability of hiring non-qualified people). On the other hand,
if it does not hire people with score (x), it
incurs (mathbb{P}(z > 0 mid x)) error (probability of
non-hiring qualified people). Thus, in order to minimize the error, the
government should hire a person iff (mathbb{P} (z > 0 mid x) >
mathbb{P}(z le 0 mid x)). Since the posterior distribution is a
normal distribution, the government must hire an applicant
iff (mathbb{E}[z mid x] > 0).

Using the formulation in the previous section, we have:

Therefore, the optimal threshold is:

In our running example with average skill (mu=-1)
and (sigma_text{skill} = sigma_text{noise}=1), the optimal threshold is 1.
The figure below shows how the optimal threshold varies according
to (mu) and (sigma_text{noise}).
As (sigma_text{noise}) increases or (mu) decreases,
the optimal threshold moves farther away from (0).


(left) The optimal threshold increases as  the average of the prior distribution decreases (with a fixed exam noise (sigma_text{noise} > 0)). (right) The optimal threshold increases if the exam noise increases (with a fixed average skill (mu < 0)). Note that, if exam scores are not noisy or the average skill is zero, then the optimal threshold is zero.

As exams become more noisy or the average skill becomes more negative, the optimal threshold moves further away from 0.

What does machine learning have to do with all of this?

So far, we precisely identified the optimal cut-off threshold given the
exact knowledge of (mu, sigma_text{skill}),
and (sigma_text{noise}). But how can the government find the
optimal threshold using observational data? This is where machine
learning (ML) comes into the picture.
Let’s imagine very favorable conditions. Let’s assume everyone (an infinite number of them!) takes the exam, the government hires all of them and observes their true skills. Further, assume the modeling assumption is perfectly correct (i.e., both the true prior distribution and conditional distribution are normal). What would happen if the government trains a model with an infinite number of ((x,z))
pairs?


The government has collected lots of data and now wants to use ML models to predict the best threshold that minimizes the error.

Before delving into this, we would like to note that in real-world
scenarios, we do not have infinite data (finite data issues); the
government does not hire everyone (selection bias issues), and the true
skill is not perfectly observable (target noise/biases issues).
Furthermore, the modeling assumptions are often incorrect (model
misspecification issues). Each of these issues may affect the model
adversely; however, in this blog post our goal is to analyze the model
decisions when none of these issues exist. In the next section, we will show that discrimination occurs even under these ideal conditions.

Under these very favorable conditions and the right loss function,
machine learning algorithms can perfectly predict (mathbb{E} [z
mid x]) from (x); therefore, can find the optimal threshold
that minimizes the error.  The following few lines of Python code show
how linear regression and logistic regression fit the data. In this
example, we set (mu = -1,
sigma_text{skill}=sigma_text{noise}=1), and as shown in
the figure on the right, the cut-off threshold predicted by the model is
one, which matches the optimal threshold as we observed previously.


A simple example along with the predicted cut-off
threshold for linear and logistic regression. The predicted cut-off
threshold results in the minimum error, as previously discussed.

Under very favorable conditions, machine learning models find the optimal threshold, which is a function of average skill, exam noise, and skill variance among people.

Optimal thresholds for different groups

So far, we have shown how to calculate the optimal threshold and
illustrated that ML models also recover this threshold. Let’s now
analyze the optimal threshold when different groups exist in the
population. There are two kinds of people in the city of Ney: blue and red. The
blue people’s skills are normally distributed centered
on (mu_text{blue}), and the red people’s skills are normally
distributed centered on (mu_text{red}). The standard deviation for
both groups is (sigma_text{skill}). There can be various
reasons for disparities between groups, for example historically blue
people might not have been allowed to attend school.


In Ney, people are divided into two groups: blue and red. The blue people have a lower average skill level than the red people.

First of all, let’s see what happens if the exam is completely precise. As
previously discussed in this case, the optimal threshold to use is 0 for
both groups independent of their distribution. Thus, both groups are
held to the same standard, and the error for the government is 0.

If there is no noise in the exam, then zero is the optimal threshold for both groups and leads to zero error.

Now let’s analyze the case where the exam is noisy
( (sigma_text{noise} > 0)). As discussed in the prior
sections, the optimal threshold depends on the average of the prior
distribution, thus the optimal threshold differs between blue and red
groups. Therefore, if the government knows the demographic information,
then it’s a better strategy for the government to classify different
groups separately (in order to minimize the error). In particular, the
government can calculate the optimal threshold for blue and red people
using Bayesian inference.

People in a group that has lower average skills need to pass a higher bar for hiring! Not only do blue people need to overcome other associated effects of being in a group with lower average skills, they also need to pass a higher bar to get hired.                  


The cut-off threshold for hiring is higher for blue people in comparison to the red people.

As stated, the government uses a higher threshold for people in a group
with a lower average skill! Consider two individuals with the same skill
level but from different groups. The blue person is less likely to get
hired by the government than the red person. Surprisingly, blue people
who are already in a group with a lower average skill (which probably
affects their confidence and society’s view of them) need to also pass a
higher bar to get hired!

Finally, note that the gap between thresholds for the different groups
grows as the noise increases.


As the exam noise increases, the gap between the optimal thresholds among different groups widens. Blue people need to get a better score than red people on the exam to get hired.

A blue person has a lower chance of getting hired in comparison with a red person with the same skill.

Conclusion

We examined the discriminatory effect of relying on noisy features. When ML models use noisy features, they’re naturally incentivized to devalue a good score when the candidate in question comes from an overall lower-performing group. Note that noisy features are prevalent in any real-world application (here, we assumed that noise is the same among all individuals, but it’s usually worse for disadvantaged groups). Ideally, we would like to improve the features to better reflect a candidate’s skill/potential or make the features more closely approximate the job requirements. If that’s not possible, it’s important to be conscious that the “optimal decision” is to discriminate, and we should adjust our process (e.g., hiring) in acknowledgment that group membership can shade an individual’s evaluation.


Frequently asked questions

Can we just remove the group membership information, so the model treats individuals from both groups similarly?

Unlike this example where group membership is a removable feature,
real-world datasets are more complex. Usually, datasets contain many
features such that the group membership can be predicted from them
(recall that ML models benefit from predicting group membership since it
lowers error). Thus, it is not obvious how to remove group membership in
these datasets. See
[1,2,3] for some efforts on removing group information.

Why should we treat these two groups similarly when their distributions are inherently different? Utilizing group membership information reduces error overall and for both groups!

Fairness in machine learning usually studies the impact of ML algorithms
on groups according to protected attributes such as sex, sexual
orientation, race, etc. Usually, there has been some discrimination
towards these groups throughout history, which leads to huge disparities
among their distributions. For example, women (because of their sex)
were not allowed to go to universities. Thus, these disparities are not
inherent and could (and probably should!) change over time. For
instance, see women in the labor force
[4].

Another reason to avoid relying on disparities among protected groups in
models is feedback loops. Feedback loops might exacerbate distributional
disparities among protected groups over time. (e.g., few women get
accepted → the self-doubt between women increases → women perform
worse in the exam → fewer women get accepted and so on). For
instance, see
[5] and
[6].

Finally, note that although the government objective may be to minimize the
error by weighting the costs of hiring non-qualified and non-hiring
qualified candidates similarly, it is not clear whether the group
objectives should be the same. For example, a group might be worse off
as a result of the government not hiring its qualified members than if
the government had hired its non-qualified members (for example, in
settings where the lack of minority role models in higher-level
positions leads to a lower perceived sense of belonging in other members
of a group). Thus, using group membership to minimize the error is not
necessarily the most beneficial outcome for a group; and depending on
the context we might need to minimize other objectives.

What about other notions of fairness in machine learning?

In this blog post, we studied the ML model’s prediction for two similar individuals (here same z) but from different groups (blue vs. red). This is referred to as the counterfactual notion of fairness. There is another common notion of fairness known as the statistical notion of fairness, which looks at the groups as a whole and compares their incurred error (it is also common to compare the error incurred by qualified members of different groups known as the equal opportunity [7]). Statistical and counterfactual notions of fairness are independent of each other, and satisfying one does not guarantee satisfying the other. Another consequence of feature noise is causing a trade-off between these two notions of fairness, which is beyond this blog post’s scope. See our paper [8] for critiques regarding these two notions and the effect of feature noise on statistical notions of fairness.

Acknowledgement

I would like to thank Percy Liang, Megha Srivastava, Frieda Rong, and Rishi Bommasani, Yeganeh Alimohammadi, and Michelle Lee for their useful comments.

Read More