Mission-Driven: Takeaways From Our Corporate Responsibility Report

NVIDIA’s latest corporate responsibility report shares our efforts in empowering employees and putting to work our technologies for the benefit of humanity.

Amid ongoing global economic concerns and pandemic challenges, this year’s report highlights our ability to attract and retain talent that come here to do their life’s work while tackling some of the world’s greatest technology and societal challenges.

Taking Care of Our People 

NVIDIA scored the highest grade for workplaces, ranking No. 1 on Glassdoor’s Best Places to work list for large U.S. companies. Some 95% of employees indicated they’d recommend NVIDIA to a friend.

We make the health of our employees and their families a top priority. Our family leave policy allows U.S. employees 12 weeks of fully paid leave to care for family members. And we’ve selected eight days each year in which we shut down all but essential operations globally, so employees can unwind without having to return to a full inbox.

We’ve recently added surrogacy benefits and fertility education resources to our award-winning list of family-forming benefits, which include adoption support and a generous parental leave program of up to 22 weeks of fully paid leave.

And we worked with our LGBTQ+ colleagues to expand gender affirmation resources and support.

Supporting Communities

Last year we established the Ignite program to prepare students from underrepresented communities for NVIDIA summer internships. Sixty-five percent of these students are returning for our internship program, and we saw a 100% increase in applications for this summer’s Ignite program.

We supported professional organizations, including Black Women in AI, Women in Data and Women-ai, to increase access to AI education and technology.

We launched NVIDIA Emerging Chapters, a new program that enables developers in emerging regions to build and scale their AI, data science and graphics expertise through technology access, educational resources and co-marketing opportunities.

We announced a three-year partnership with the Boys & Girls Clubs of Western Pennsylvania to expand access to AI and robotics to students in communities traditionally underrepresented in tech. Core to this is an open-source curriculum that will make it easy for Boys & Girls Clubs nationwide to deliver AI education to their students.

Our employees remained committed to donating resources to those in need, with nearly 40% of them participating in the NVIDIA Foundation’s Inspire 365 efforts during fiscal year 2022. That brought the unique participation rate since the initiative’s start to 68%.

Despite in-person volunteering remaining paused due to COVID, NVIDIANs still logged more than 16,500 volunteer hours through individual and virtual efforts, up more than 76% from the previous fiscal year.

NVIDIANs also joined the company in contributing more than $22 million to charitable causes in the last fiscal year. And during the Ukraine crisis, employees and NVIDIA have donated more than $4.6 million to date for humanitarian relief.

Developing Climate Solutions 

NVIDIA GPUs are enabling progress in responding to the crisis of climate change. With recent advances in AI, modeling of weather forecasting can now be done 4-5 magnitudes faster than with traditional computing methods.

We plan to build Earth-2, an AI supercomputer that will create a digital twin of the Earth, enabling scientists to do ultra-high-resolution climate modeling and put tools into the hands of cities and nations to simulate the impact of mitigation and adaptation strategies.

Digital twins are also being used to predict costly maintenance at power plants and model new energy sources like fusion reactor design.

NVIDIA scientists along with leading institutions are using AI to model the most efficient way to capture greenhouse gasses in the atmosphere and lock them away underground.

Startups from the NVIDIA Inception program are jumping into the climate challenge as well. In Kenya, a company is using AI to monitor the health of bee colonies. And a German startup is monitoring the ocean floor to help scientists understand how natural carbon sinks can be better utilized.

Building Energy-Efficient Technologies 

These solutions are not only bringing innovation to the climate challenge, but are built on a foundation of energy-efficient technology.

We aim to make every new generation of our GPUs faster and more energy efficient than its predecessor. As AI models and HPC applications increase exponentially in size, moving to new-generation GPUs will help our customers complete their work with lower energy consumption and get results more quickly.

NVIDIA GPUs are typically 20x more energy efficient for AI and HPC workloads than CPUs. If we switched all the CPU-only servers running AI and HPC worldwide to GPU-accelerated systems, the world could save nearly 12 trillion watt-hours of energy a year, equivalent to the electricity requirements of nearly 1.7 million U.S. homes.

Leaning Into Trustworthy AI

We’re committed to the advancement of trustworthy AI, recognizing that technology can have a profound impact on people and the world. We’ve set priorities that are rooted in fostering positive change and enabling trust and transparency in AI development.

We’re developing practices and methodologies enabling construction of AI products that are trustworthy by design, including datasets, machine learning tools and processes, AI model development, and software development and testing.

Running a Mission-Driven Company

As NVIDIA CEO Jensen Huang mentions in the opening letter of our corporate responsibility report, creating a place where people can do impactful work means building a culture strong enough to be willing to take on the most pressing problems.

The impacts of accelerated computing, which we have driven over the last two decades, are already being felt in areas as wide ranging as self-driving cars, healthcare and, increasingly, in climate change. We’re proud to have built this organization with more than 20,000 of the brightest minds and look forward to what they choose to tackle next.

The post Mission-Driven: Takeaways From Our Corporate Responsibility Report appeared first on NVIDIA Blog.

Read More

Feature engineering at scale for healthcare and life sciences with Amazon SageMaker Data Wrangler

Machine learning (ML) is disrupting a lot of industries at an unprecedented pace. The healthcare and life sciences (HCLS) industry has been going through a rapid evolution in recent years embracing ML across a multitude of use cases for delivering quality care and improving patient outcomes.

In a typical ML lifecycle, data engineers and scientists spend the majority of their time on the data preparation and feature engineering steps before even getting started with the process of model building and training. Having a tool that can lower the barrier to entry for data preparation, thereby improving productivity, is a highly desirable ask for these personas. Amazon SageMaker Data Wrangler is purpose built by AWS to reduce the learning curve and enable data practitioners to accomplish data preparation, cleaning, and feature engineering tasks in less effort and time. It offers a GUI interface with many built-in functions and integrations with other AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon SageMaker Feature Store, as well as partner data sources including Snowflake and Databricks.

In this post, we demonstrate how to use Data Wrangler to prepare healthcare data for training a model to predict heart failure, given a patient’s demographics, prior medical conditions, and lab test result history.

Solution overview

The solution consists of the following steps:

  1. Acquire a healthcare dataset as input to Data Wrangler.
  2. Use Data Wrangler’s built-in transformation functions to transform the dataset. This includes drop columns, featurize data/time, join datasets, impute missing values, encode categorical variables, scale numeric values, balance the dataset, and more.
  3. Use Data Wrangler’s custom transform function (Pandas or PySpark code) to supplement additional transformations required beyond the built-in transformations and demonstrate the extensibility of Data Wrangler. This includes filter rows, group data, form new dataframes based on conditions, and more.
  4. Use Data Wrangler’s built-in visualization functions to perform visual analysis. This includes target leakage, feature correlation, quick model, and more.
  5. Use Data Wrangler’s built-in export options to export the transformed dataset to Amazon S3.
  6. Launch a Jupyter notebook to use the transformed dataset in Amazon S3 as input to train a model.

Generate a dataset

Now that we have settled on the ML problem statement, we first set our sights on acquiring the data we need. Research studies such as Heart Failure Prediction may provide data that’s already in good shape. However, we often encounter scenarios where the data is quite messy and requires joining, cleansing, and several other transformations that are very specific to the healthcare domain before it can be used for ML training. We want to find or generate data that is messy enough and walk you through the steps of preparing it using Data Wrangler. With that in mind, we picked Synthea as a tool to generate synthetic data that fits our goal. Synthea is an open-source synthetic patient generator that models the medical history of synthetic patients. To generate your dataset, complete the following steps:

  1. Follow the instructions as per the quick start documentation to create an Amazon SageMaker Studio domain and launch Studio.
    This is a prerequisite step. It is optional if Studio is already set up in your account.
  2. After Studio is launched, on the Launcher tab, choose System terminal.
    This launches a terminal session that gives you a command line interface to work with.
  3. To install Synthea and generate the dataset in CSV format, run the following commands in the launched terminal session:
    $ sudo yum install -y java-1.8.0-openjdk-devel
    $ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
    $ export PATH=$JAVA_HOME/bin:$PATH
    $ git clone https://github.com/synthetichealth/synthea
    $ git checkout v3.0.0
    $ cd synthea
    $ ./run_synthea --exporter.csv.export=true -p 10000

We supply a parameter to generate the datasets with a population size of 10,000. Note the size parameter denotes the number of alive members of the population. Additionally, Synthea also generates data for dead members of the population which might add a few extra data points on top of the specified sample size.

Wait until the data generation is complete. This step usually takes around an hour or less. Synthea generates multiple datasets, including patients, medications, allergies, conditions, and more. For this post, we use three of the resulting datasets:

  • patients.csv – This dataset is about 3.2 MB and contains approximately 11,000 rows of patient data (25 columns including patient ID, birthdate, gender, address, and more)
  • conditions.csv – This dataset is about 47 MB and contains approximately 370,000 rows of medical condition data (six columns including patient ID, condition start date, condition code, and more)
  • observations.csv – This dataset is about 830 MB and contains approximately 5 million rows of observation data (eight columns including patient ID, observation date, observation code, value, and more)

There is a one-to-many relationship between the patients and conditions datasets. There is also a one-to-many relationship between the patients and observations datasets. For a detailed data dictionary, refer to CSV File Data Dictionary.

  1. To upload the generated datasets to a source bucket in Amazon S3, run the following commands in the terminal session:
    $ cd ./output/csv
    $ aws s3 sync . s3://<source bucket name>/

Launch Data Wrangler

Choose SageMaker resources in the navigation page in Studio and on the Projects menu, choose Data Wrangler to create a Data Wrangler data flow. For detailed steps how to launch Data Wrangler from within Studio, refer to Get Started with Data Wrangler.

Import data

To import your data, complete the following steps:

  1. Choose Amazon S3 and locate the patients.csv file in the S3 bucket.
  2. In the Details pane, choose First K for Sampling.
  3. Enter 1100 for Sample size.
    In the preview pane, Data Wrangler pulls the first 100 rows from the dataset and lists them as a preview.
  4. Choose Import.
    Data Wrangler selects the first 1,100 patients from the total patients (11,000 rows) generated by Synthea and imports the data. The sampling approach lets Data Wrangler only process the sample data. It enables us to develop our data flow with a smaller dataset, which results in quicker processing and a shorter feedback loop. After we create the data flow, we can submit the developed recipe to a SageMaker processing job to horizontally scale out the processing for the full or larger dataset in a distributed fashion.
  5. Repeat this process for the conditions and observations datasets.
    1. For the conditions dataset, enter 37000 for Sample size, which is 1/10 of the total 370,000 rows generated by Synthea.
    2. For the observations dataset, enter 500000 for Sample size, which is 1/10 of the total observations 5 million rows generated by Synthea.

You should see three datasets as shown in the following screenshot.

Transform the data

Data transformation is the process of changing the structure, value, or format of one or more columns in the dataset. The process is usually developed by a data engineer and can be challenging for people with a smaller data engineering skillset to decipher the logic proposed for the transformation. Data transformation is part of the broader feature engineering process, and the correct sequence of steps is another important criterion to keep in mind while devising such recipes.

Data Wrangler is designed to be a low-code tool to reduce the barrier of entry for effective data preparation. It comes with over 300 preconfigured data transformations for you to choose from without writing a single line of code. In the following sections, we see how to transform the imported datasets in Data Wrangler.

Drop columns in patients.csv

We first drop some columns from the patients dataset. Dropping redundant columns removes non-relevant information from the dataset and helps us reduce the amount of computing resources required to process the dataset and train a model. In this section, we drop columns such as SSN or passport number based on common sense that these columns have no predictive value. In other words, they don’t help our model predict heart failure. Our study is also not concerned about other columns such as birthplace or healthcare expenses’ influence to a patient’s heart failure, so we drop them as well. Redundant columns can also be identified by running the built-in analyses like target leakage, feature correlation, multicollinearity, and more, which are built into Data Wrangler. For more details on the supported analyses types, refer to Analyze and Visualize. Additionally, you can use the Data Quality and Insights Report to perform automated analyses on the datasets to arrive at a list of redundant columns to eliminate.

  1. Choose the plus sign next to Data types for the patients.csv dataset and choose Add transform.
  2. Choose Add step and choose Manage columns.
  3. For Transform¸ choose Drop column.
  4. For Columns to drop, choose the following columns:
    1. SSN
    2. DRIVERS
    3. PASSPORT
    4. PREFIX
    5. FIRST
    6. LAST
    7. SUFFIX
    8. MAIDEN
    9. RACE
    10. ETHNICITY
    11. BIRTHPLACE
    12. ADDRESS
    13. CITY
    14. STATE
    15. COUNTY
    16. ZIP
    17. LAT
    18. LON
    19. HEALTHCARE_EXPENSES
    20. HEALTHCARE_COVERAGE
  5. Choose Preview to review the transformed dataset, then choose Add.

    You should see the step Drop column in your list of transforms.

Featurize date/time in patients.csv

Now we use the Featurize date/time function to generate the new feature Year from the BIRTHDATE column in the patients dataset. We use the new feature in a subsequent step to calculate a patient’s age at the time of observation takes place.

  1. In the Transforms pane of your Drop column page for the patients dataset, choose Add step.
  2. Choose the Featurize date/time transform.
  3. Choose Extract columns.
  4. For Input columns, add the column BIRTHDATE.
  5. Select Year and deselect Month, Day, Hour, Minute, Second.

  6. Choose Preview, then choose Add.

Add transforms in observations.csv

Data Wrangler supports custom transforms using Python (user-defined functions), PySpark, Pandas, or PySpark (SQL). You can choose your transform type based on your familiarity with each option and preference. For the latter three options, Data Wrangler exposes the variable df for you to access the dataframe and apply transformations on it. For a detailed explanation and examples, refer to Custom Transforms. In this section, we add three custom transforms to the observations dataset.

  1. Add a transform to observations.csv and drop the DESCRIPTION column.
  2. Choose Preview, then choose Add.
  3. In the Transforms pane, choose Add step and choose Custom transform.
  4. On the drop-down menu, choose Python (Pandas).
  5. Enter the following code:
    df = df[df["CODE"].isin(['8867-4','8480-6','8462-4','39156-5','777-3'])]

    These are LONIC codes that correspond to the following observations we’re interested in using as features for predicting heart failure:

    heart rate: 8867-4
    systolic blood pressure: 8480-6
    diastolic blood pressure: 8462-4
    body mass index (BMI): 39156-5
    platelets [#/volume] in Blood: 777-3

  6. Choose Preview, then choose Add.
  7. Add a transform to extract Year and Quarter from the DATE column.
  8. Choose Preview, then choose Add.
  9. Choose Add step and choose Custom transform.
  10. On the drop-down menu, choose Python (PySpark).

    The five types of observations may not always be recorded on the same date. For example, a patient may visit their family doctor on January 21 and have their systolic blood pressure, diastolic blood pressure, heart rate, and body mass index measured and recorded. However, a lab test that includes platelets may be done at a later date on February 2. Therefore, it’s not always possible to join dataframes by the observation date. Here we join dataframes on a coarse granularity at the quarter basis.
  11. Enter the following code:
    from pyspark.sql.functions import col
    
    systolic_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed("value", "systolic")
                       .filter((col("code") == "8480-6"))
      )
    
    diastolic_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'diastolic')
                       .filter((col("code") == "8462-4"))
        )
    
    hr_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'hr')
                       .filter((col("code") == "8867-4"))
        )
    
    bmi_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'bmi')
                       .filter((col("code") == "39156-5"))
        )
    
    platelets_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'platelets')
                       .filter((col("code") == "777-3"))
        )
    
    df = (
        systolic_df.join(diastolic_df, ["patient", "DATE_year", "DATE_quarter"])
                                .join(hr_df, ["patient", "DATE_year", "DATE_quarter"])
                                .join(bmi_df, ["patient", "DATE_year", "DATE_quarter"])
                                .join(platelets_df, ["patient", "DATE_year", "DATE_quarter"])
    )

  12. Choose Preview, then choose Add.
  13. Choose Add step, then choose Manage rows.
  14. For Transform, choose Drop duplicates.
  15. Choose Preview, then choose Add.
  16. Choose Add step and choose Custom transform.
  17. On the drop-down menu, choose Python (Pandas).
  18. Enter the following code to take an average of data points that share the same time value:
    import pandas as pd
    df.loc[:, df.columns != 'patient']=df.loc[:, df.columns != 'patient'].apply(pd.to_numeric)
    df = df.groupby(['patient','DATE_year','DATE_quarter']).mean().round(0).reset_index()

  19. Choose Preview, then choose Add.

Join patients.csv and observations.csv

In this step, we showcase how to effectively and easily perform complex joins on datasets without writing any code via Data Wrangler’s powerful UI. To learn more about the supported types of joins, refer to Transform Data.

  1. To the right of Transform: patients.csv, choose the plus sign next to Steps and choose Join.
    You can see the transformed patients.csv file listed under Datasets in the left pane.
  2. To the right of Transform: observations.csv, click on the Steps to initiate the join operation.
    The transformed observations.csv file is now listed under Datasets in the left pane.
  3. Choose Configure.
  4. For Join Type, choose Inner.
  5. For Left, choose Id.
  6. For Right, choose patient.
  7. Choose Preview, then choose Add.

Add a custom transform to the joined datasets

In this step, we calculate a patient’s age at the time of observation. We also drop columns that are no longer needed.

  1. Choose the plus sign next to 1st Join and choose Add transform.
  2. Add a custom transform in Pandas:
    df['age'] = df['DATE_year'] - df['BIRTHDATE_year']
    df = df.drop(columns=['BIRTHDATE','DEATHDATE','BIRTHDATE_year','patient'])

  3. Choose Preview, then choose Add.

Add custom transforms to conditions.csv

  1. Choose the plus sign next to Transform: conditions.csv and choose Add transform.
  2. Add a custom transform in Pandas:
    df = df[df["CODE"].isin(['84114007', '88805009', '59621000', '44054006', '53741008', '449868002', '49436004'])]
    df = df.drop(columns=['DESCRIPTION','ENCOUNTER','STOP'])

Note: As we demonstrated earlier, you can drop columns either using custom code or using the built-in transformations provided by Data Wrangler. Custom transformations within Data Wrangler provides the flexibility to bring your own transformation logic in the form of code snippets in the supported frameworks. These snippets can later be searched and applied if needed.

The codes in the preceding transform are SNOMED-CT codes that correspond to the following conditions. The heart failure or chronic congestive heart failure condition becomes the label. We use the remaining conditions as features for predicting heart failure. We also drop a few columns that are no longer needed.

Heart failure: 84114007
Chronic congestive heart failure: 88805009
Hypertension: 59621000
Diabetes: 44054006
Coronary Heart Disease: 53741008
Smokes tobacco daily: 449868002
Atrial Fibrillation: 49436004
  1. Next, let’s add a custom transform in PySpark:
    from pyspark.sql.functions import col, when
    
    heartfailure_df = (
        df.select("patient", "start")
                          .withColumnRenamed("start", "heartfailure")
                       .filter((col("code") == "84114007") | (col("code") == "88805009"))
      )
    
    hypertension_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "hypertension")
                       .filter((col("code") == "59621000"))
      )
    
    diabetes_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "diabetes")
                       .filter((col("code") == "44054006"))
      )
    
    coronary_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "coronary")
                       .filter((col("code") == "53741008"))
      )
    
    smoke_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "smoke")
                       .filter((col("code") == "449868002"))
      )
    
    atrial_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "atrial")
                       .filter((col("code") == "49436004"))
      )
    
    df = (
        heartfailure_df.join(hypertension_df, ["patient"], "leftouter").withColumn("has_hypertension", when(col("hypertension") < col("heartfailure"), 1).otherwise(0))
        .join(diabetes_df, ["patient"], "leftouter").withColumn("has_diabetes", when(col("diabetes") < col("heartfailure"), 1).otherwise(0))
        .join(coronary_df, ["patient"], "leftouter").withColumn("has_coronary", when(col("coronary") < col("heartfailure"), 1).otherwise(0))
        .join(smoke_df, ["patient"], "leftouter").withColumn("has_smoke", when(col("smoke") < col("heartfailure"), 1).otherwise(0))
        .join(atrial_df, ["patient"], "leftouter").withColumn("has_atrial", when(col("atrial") < col("heartfailure"), 1).otherwise(0))
    )

    We perform a left outer join to keep all entries in the heart failure dataframe. A new column has_xxx is calculated for each condition other than heart failure based on the condition’s start date. We’re only interested in medical conditions that were recorded prior to the heart failure and use them as features for predicting heart failure.

  2. Add a built-in Manage columns transform to drop the redundant columns that are no longer needed:
    1. hypertension
    2. diabetes
    3. coronary
    4. smoke
    5. atrial
  3. Extract Year and  Quarter from the heartfailure column.
    This matches the granularity we used earlier in the transformation of the observations dataset.
  4. We should have a total of 6 steps for conditions.csv.

Join conditions.csv to the joined dataset

We now perform a new join to join the conditions dataset to the joined patients and observations dataset.

  1. Choose Transform: 1st Join.
  2. Choose the plus sign and choose Join.
  3. Choose Steps next to Transform: conditions.csv.
  4. Choose Configure.
  5. For Join Type, choose Left outer.
  6. For Left, choose Id.
  7. For Right, choose patient.
  8. Choose Preview, then choose Add.

Add transforms to the joined datasets

Now that we have all three datasets joined, let’s apply some additional transformations.

  1. Add the following custom transform in PySpark so has_heartfailure becomes our label column:
    from pyspark.sql.functions import col, when
    df = (
        df.withColumn("has_heartfailure", when(col("heartfailure").isNotNull(), 1).otherwise(0))
    )

  2. Add the following custom transformation in PySpark:
    from pyspark.sql.functions import col
    
    df = (
        df.filter(
          (col("has_heartfailure") == 0) | 
          ((col("has_heartfailure") == 1) & ((col("date_year") <= col("heartfailure_year")) | ((col("date_year") == col("heartfailure_year")) & (col("date_quarter") <= col("heartfailure_quarter")))))
        )
    )

    We’re only interested in observations recorded prior to when the heart failure condition is diagnosed and use them as features for predicting heart failure. Observations taken after heart failure is diagnosed may be affected by the medication a patient takes, so we want to exclude those ones.

  3. Drop the redundant columns that are no longer needed:
    1. Id
    2. DATE_year
    3. DATE_quarter
    4. patient
    5. heartfailure
    6. heartfailure_year
    7. heartfailure_quarter
  4. On the Analysis tab, for Analysis type¸ choose Table summary.
    A quick scan through the summary shows that the MARITAL column has missing data.
  5. Choose the Data tab and add a step.
  6. Choose Handle Missing.
  7. For Transform, choose Fill missing.
  8. For Input columns, choose MARITAL.
  9. For Fill value, enter S.
    Our strategy here is to assume the patient is single if the marital status has missing value. You can have a different strategy.
  10. Choose Preview, then choose Add.
  11. Fill the missing value as 0 for has_hypertension, has_diabetes, has_coronary, has_smoke, has_atrial.

Marital and Gender are categorial variables. Data Wrangler has a built-in function to encode categorial variables.

  1. Add a step and choose Encode categorial.
  2. For Transform, choose One-hot encode.
  3. For Input columns, choose MARITAL.
  4. For Output style, choose Column.
    This output style produces encoded values in separate columns.
  5. Choose Preview, then choose Add.
  6. Repeat these steps for the Gender column.

The one-hot encoding splits the Marital column into Marital_M (married) and Marital_S (single), and splits the Gender column into Gender_M (male) and Gender_F (female). Because Marital_M and Marital_S are mutually exclusive (as are Gender_M and Gender_F), we can drop one column to avoid redundant features.

  1. Drop Marital_S and Gender_F.

Numeric features such as systolic, heart rate, and age have different unit standards. For a linear regression-based model, we need to normalize these numeric features first. Otherwise, some features with higher absolute values may have an unwarranted advantage over other features with lower absolute values and result in poor model performance. Data Wrangler has the built-in transform Min-max scaler to normalize the data. For a decision tree-based classification model, normalization isn’t required. Our study is a classification problem so we don’t need to apply normalization. Imbalanced classes are a common problem in classification. Imbalance happens when the training dataset contains severely skewed class distribution. For example, when our dataset contains disproportionally more patients without heart failure than patients with heart failure, it can cause the model to be biased toward predicting no heart failure and perform poorly. Data Wrangler has a built-in function to tackle the problem.

  1. Add a custom transform in Pandas to convert data type of columns from “object” type to numeric type:
    import pandas as pd
    df=df.apply(pd.to_numeric)

  2. Choose the Analysis tab.
  3. For Analysis type¸ choose Histogram.
  4. For X axis, choose has_heartfailure.
  5. Choose Preview.

    It’s obvious that we have an imbalanced class (more data points labeled as no heart failure than data points labeled as heart failure).
  6. Go back to the Data tab. Choose Add step and choose Balance data.
  7. For Target column, choose has_heartfailure.
  8. For Desired ratio, enter 1.
  9. For Transform, choose SMOTE.

    SMOTE stands for Synthetic Minority Over-sampling Technique. It’s a technique to create new minority instances and add to the dataset to reach class balance. For detailed information, refer to SMOTE: Synthetic Minority Over-sampling Technique.
  10. Choose Preview, then choose Add.
  11. Repeat the histogram analysis in step 20-23. The result is a balanced class.

Visualize target leakage and feature correlation

Next, we’re going to perform a few visual analyses using Data Wrangler’s rich toolset of advanced ML-supported analysis types. First, we look at target leakage. Target leakage occurs when data in the training dataset is strongly correlated with the target label, but isn’t available in real-world data at inference time.

  1. On the Analysis tab, for Analysis type¸ choose Target Leakage.
  2. For Problem Type, choose classification.
  3. For Target, choose has_heartfailure.
  4. Choose Preview.

    Based on the analysis, hr is a target leakage. We’ll drop it in a subsequent step. age is flagged a target leakage. It’s reasonable to say that a patient’s age will be available during inference time, so we keep age as a feature. Systolic and diastolic are also flagged as likely target leakage. We expect to have the two measurements during inference time, so we keep them as features.
  5. Choose Add to add the analysis.

Then, we look at feature correlation. We want to select features that are correlated with the target but are uncorrelated among themselves.

  1. On the Analysis tab, for Analysis type¸ choose Feature Correlation.
  2. For Correlation Type¸ choose linear.
  3. Choose Preview.

The coefficient scores indicate strong correlations between the following pairs:

  • systolic and diastolic
  • bmi and age
  • has_hypertension and has_heartfailure (label)

For features that are strongly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To mitigate the correlation, we can simply remove one from the pair. We drop diastolic and bmi and keep systolic and age in a subsequent step.

Drop diastolic and bmi columns

Add additional transform steps to drop the hr, diastolic and bmi columns using the built-in transform.

Generate the Data Quality and Insights Report

AWS recently announced the new Data Quality and Insights Report feature in Data Wrangler. This report automatically verifies data quality and detects abnormalities in your data. Data scientists and data engineers can use this tool to efficiently and quickly apply domain knowledge to process datasets for ML model training. This step is optional. To generate this report on our datasets, complete the following steps:

  1. On the Analysis tab, for Analysis type, choose Data Quality and Insights Report.
  2. For Target column, choose has_heartfailure.
  3. For Problem type, select Classification.
  4. Choose Create.

In a few minutes, it generates a report with a summary, visuals, and recommendations.

Generate a Quick Model analysis

We have completed our data preparation, cleaning, and feature engineering. Data Wrangler has a built-in function that provides a rough estimate of the expected predicted quality and the predictive power of features in our dataset.

  1. On the Analysis tab, for Analysis type¸ choose Quick Model.
  2. For Label, choose has_heartfailure.
  3. Choose Preview.

As per our Quick Model analysis, we can see the feature has_hypertension has the highest feature importance score among all features.

Export the data and train the model

Now let’s export the transformed ML-ready features to a destination S3 bucket and scale the entire feature engineering pipeline we have created so far using the samples into the entire dataset in a distributed fashion.

  1. Choose the plus sign next to the last box in the data flow and choose Add destination.
  2. Choose Amazon S3.
  3. Enter a Dataset name. For Amazon S3 location, choose a S3 bucket, then choose Add Destination.
  4. Choose Create job to launch a distributed PySpark processing job to perform the transformation and output the data to the destination S3 bucket.

    Depending on the size of the datasets, this option lets us easily configure the cluster and horizontally scale in a no-code fashion. We don’t have to worry about partitioning the datasets or managing the cluster and Spark internals. All of this is automatically taken care for us by Data Wrangler.
  5. On the left pane, choose Next, 2. Configure job.

  6. Then choose Run.

Alternatively, we can also export the transformed output to S3 via a Jupyter Notebook. With this approach, Data Wrangler automatically generates a Jupyter notebook with all the code needed to kick-off a processing job to apply the data flow steps (created using a sample) on the larger full dataset and use the transformed dataset as features to kick-off a training job later. The notebook code can be readily run with or without making changes. Let’s now walk through the steps on how to get this done via Data Wrangler’s UI.

  1. Choose the plus sign next to the last step in the data flow and choose Export to.
  2. Choose Amazon S3 (via Jupyter Notebook).
  3. It automatically opens a new tab with a Jupyter notebook.
  4. In the Jupyter notebook, locate the cell in the (Optional) Next Steps section and change run_optional_steps from False to True.
    The enabled optional steps in the notebook perform the following:

    • Train a model using XGBoost
  5. Go back to the top of the notebook and on the Run menu, choose Run All Cells.

If you use the generated notebook as is, it launches a SageMaker processing job that scales out the processing across two m5.4xlarge instances to processes the full dataset on the S3 bucket. You can adjust the number of instances and instance types based on the dataset size and time you need to complete the job.

Wait until the training job from the last cell is complete. It generates a model in the SageMaker default S3 bucket.

The trained model is ready for deployment for either real-time inference or batch transformation. Note that we used synthetic data to demonstrate functionalities in Data Wrangler and used processed data for training model. Given that the data we used is synthetic, the inference result from the trained model is not meant for real-world medical condition diagnosis or substitution of judgment from medical practitioners.

You can also directly export your transformed dataset into Amazon S3 by choosing Export on top of the transform preview page. The direct export option only exports the transformed sample if sampling was enabled during the import. This option is best suited if you’re dealing with smaller datasets. The transformed data can also be ingested directly into a feature store. For more information, refer to Amazon SageMaker Feature Store. The data flow can also be exported as a SageMaker pipeline that can be orchestrated and scheduled as per your requirements. For more information, see Amazon SageMaker Pipelines.

Conclusion

In this post, we showed how to use Data Wrangler to process healthcare data and perform scalable feature engineering in a tool-driven, low-code fashion. We learned how to apply the built-in transformations and analyses aptly wherever needed, combining it with custom transformations to add even more flexibility to our data preparation workflow. We also walked through the different options for scaling out the data flow recipe via distributed processing jobs. We also learned how the transformed data can be easily used for training a model to predict heart failure.

There are many other features in Data Wrangler we haven’t covered in this post. Explore what’s possible in Prepare ML Data with Amazon SageMaker Data Wrangler and learn how to leverage Data Wrangler for your next data science or machine learning project.


About the Authors

Forrest Sun is a Senior Solution Architect with the AWS Public Sector team in Toronto, Canada. He has worked in the healthcare and finance industries for the past two decades. Outside of work, he enjoys camping with his family.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Read More

Enabling Creative Expression with Concept Activation Vectors

Advances in computer vision and natural language processing continue to unlock new ways of exploring billions of images available on public and searchable websites. Today’s visual search tools make it possible to search with your camera, voice, text, images, or multiple modalities at the same time. However, it remains difficult to input subjective concepts, such as visual tones or moods, into current systems. For this reason, we have been working collaboratively with artists, photographers, and image researchers to explore how machine learning (ML) might enable people to use expressive queries as a way of visually exploring datasets.

Today, we are introducing Mood Board Search, a new ML-powered research tool that uses mood boards as a query over image collections. This enables people to define and evoke visual concepts on their own terms. Mood Board Search can be useful for subjective queries, such as “peaceful”, or for words and individual images that may not be specific enough to produce useful results in a standard search, such as “abstract details in overlooked scenes” or “vibrant color palette that feels part memory, part dream. We developed, and will continue to develop, this research tool in alignment with our AI Principles.”

Search Using Mood Boards
With Mood Board Search, our goal is to design a flexible and approachable interface so people without ML expertise can train a computer to recognize a visual concept as they see it. The tool interface is inspired by mood boards, commonly used by people in creative fields to communicate the “feel” of an idea using collections of visual materials.

With Mood Board Search, users can train a computer to recognize visual concepts in image collections.

To get started, simply drag and drop a small number of images that represent the idea you want to convey. Mood Board Search returns the best results when the images share a consistent visual quality, so results are more likely to be relevant with mood boards that share visual similarities in color, pattern, texture, or composition.

It’s also possible to signal which images are more important to a visual concept by upweighting or downweighting images, or by adding images that are the opposite of the concept. Then, users can review and inspect search results to understand which part of an image best matches the visual concept. Focus mode does this by revealing a bounding box around part of the image, while AI crop cuts in directly, making it easier to draw attention to new compositions.

Supported interactions, like AI crop, allow users to see which part of an image best matches their visual concept.

Powered by Concept Activation Vectors (CAVs)
Mood Board Search takes advantage of pre-trained computer vision models, such as GoogLeNet and MobileNet, and a machine learning approach called Concept Activation Vectors (CAVs).

CAVs are a way for machines to represent images (what we understand) using numbers or directions in a neural net’s embedding space (which can be thought of as what machines understand). CAVs can be used as part of a technique, Testing with CAVs (TCAV), to quantify the degree to which a user-defined concept is important to a classification result; e.g., how sensitive a prediction of “zebra” is to the presence of stripes. This is a research approach we open-sourced in 2018, and the work has since been widely applied to medical applications and science to build ML applications that can provide better explanations for what machines see. You can learn more about embedding vectors in general in this Google AI blog post, and our approach to working with TCAVs in Been Kim’s Keynote at ICLR.

In Mood Board Search, we use CAVs to find a model’s sensitivity to a mood board created by the user. In other words, each mood board creates a CAV — a direction in embedding space — and the tool searches an image dataset, surfacing images that are the closest match to the CAV. However, the tool takes it one step further, by segmenting each image in the dataset in 15 different ways, to uncover as many relevant compositions as possible. This is the approach behind features like Focus mode and AI crop.

Three artists created visual concepts to share their way of seeing, shown here in an experimental app by design invention studio, Nord Projects.

Because embedding vectors can be learned and re-used across models, tools like Mood Board Search can help us express our perspective to other people. Early collaborations with creative communities have shown value in being able to create and share subjective experiences with others, resulting in feelings of being able to “break out of visually-similar echo chambers” or “see the world through another person’s eyes”. Even misalignment between model and human understanding of a concept frequently resulted in unexpected and inspiring connections for collaborators. Taken together, these findings point towards new ways of designing collaborative ML systems that embrace personal and collective subjectivity.

Conclusions and Future Work
Today, we’re open-sourcing the code to Mood Board Search, including three visual concepts made by our collaborators, and a Mood Board Search Python Library for people to tap the power of CAVs directly into their own websites and apps. While these tools are early-stage prototypes, we believe this capability can have a wide-range of applications from exploring unorganized image collections to externalizing ways of seeing into collaborative and shareable artifacts. Already, an experimental app by design invention studio Nord Projects, made using Mood Board Search, investigates the opportunities for running CAVs in camera, in real-time. In future work, we plan to use Mood Board Search to learn about new forms of human-machine collaboration and expand ML models and inputs — like text and audio — to allow even deeper subjective discoveries, regardless of medium.

If you’re interested in a demo of this work for your team or organization, email us at cav-experiments-support@google.com.

Acknowledgments
This blog presents research by (in alphabetical order): Kira Awadalla, Been Kim, Eva Kozanecka, Alison Lentz, Alice Moloney, Emily Reif, and Oliver Siy, in collaboration with design invention studio Nord Projects. We thank our co-author, Eva Kozanecka, our artist collaborators, Alexander Etchells, Tom Hatton, Rachel Maggart, the Imaging team at The British Library for their participation in beta previews, and Blaise Agüera y Arcas, Jess Holbrook, Fernanda Viegas, and Martin Wattenberg for their support of this research project.

Read More

Does AutoML work for diverse tasks?

Over the past decade, machine learning (ML) has grown rapidly in both popularity and complexity. Driven by advances in deep neural networks, ML is now being applied far beyond its traditional domains like computer vision and text processing, with applications in areas as diverse as solving partial differential equations (PDEs), tracking credit card fraud, and predicting medical conditions from gene sequences. However, progress in such areas has often required expert-driven development of complex neural network architectures, expensive hyperparameter tuning, or both. Given that such resource intensive iteration is expensive and inaccessible to most practitioners, AutoML has emerged with an overarching goal of enabling any team of ML developers to deploy ML on arbitrary new tasks. Here we ask about the current status of AutoML, namely: can available AutoML tools quickly and painlessly attain near-expert performance on diverse learning tasks?

This blog post is dedicated to two recent but related efforts that measure the field’s current effectiveness at achieving this goal: NAS-Bench-360 and the AutoML Decathlon. The first is a benchmark suite focusing on the burgeoning field of neural architecture search (NAS), which seeks to automate the development of neural network models. With evaluations on ten diverse tasks—including a precomputed tabular benchmark on three of them—NAS-Bench-360 is the first NAS testbed that goes beyond traditional AI domains such as vision, text, and audio signals. Specifically, the 10 tasks vary in their domain (including image, finance time series, audio, and natural sciences), problem type (including regression, single-label, and multi-label classification), and scale (ranging from several thousands to hundreds of thousands of observations).

The second is a NeurIPS 2022 competition (which we are soft-launching today!) that builds on our NAS-Bench-360 work yet has a broader vision of understanding what is truly the best approach for a practitioner to take when faced with a modern ML problem.  During the public development phase of the competition we will release a set of diverse tasks that will be representative of (but distinct from) the final set of test tasks on which evaluation will be performed. Unlike most past competitions in the AutoML community, competitors in the AutoML Decathlon are free (and in fact encouraged) to consider a wide range of approaches from traditional hyperparameter optimization and ensembling methods to modern techniques such as NAS and large-scale transfer learning.

You can learn more about getting involved with either of these efforts at the bottom of this post.

NAS-Bench-360: A NAS Benchmark for diverse tasks

NAS-Bench-360 is a benchmark suite consisting of ten ML tasks that we developed jointly with Renbo Tu, Nick Roberts, Junhong Shen, and Fred Sala. These tasks represent a diverse set of signals, including various kinds of imaging sources, simulation data, genomic data, and more. At the same time, we constrain all tasks to be amenable to modern NAS search spaces, i.e. we do not include tabular or graph-based data, thus allowing for the application of most NAS methods. Our evaluation on NAS-Bench-360 is thus a robustness test that checks whether the massive amount of largely computer vision-driven progress in the field of NAS is actually indicative of wider success of AutoML across a variety of applications, data types, and tasks. More importantly, the benchmark will serve as a useful tool to develop and evaluate new, better methods for NAS.

So can AutoML tools—specifically NAS methods—quickly and painlessly attain near-expert performance on NAS-Bench-360? In positive news, searching over a large search space such as DARTS using a state-of-the-art algorithm such as GAEA does yield models that outperform available expert architectures on half of the tasks, in addition to consistently beating perennial Kaggle favorite XGBoost and a recent attempt at a general-purpose architecture, Perceiver IO. On the other hand it fails catastrophically on several tasks, doing little better than a simple baseline, namely a tuned Wide ResNet (Figure 1, left panel). Indeed, despite being developed on CIFAR-10 it does surprisingly poorly on 2D classification tasks from the medical and audio domains. Furthermore, in a resource-constrained setting where AutoML methods are not given much more time than running a single architecture, the leading NAS method DenseNAS does worse than an untuned Wide ResNet (Figure 1, right panel).

Figure 2:  Whereas high-performance architectures on vision datasets often perform well on other vision datasets (left), we use NAS-Bench-360 to show that this does not translate to diverse tasks (right).

Our evaluation of modern NAS methods on NAS-Bench-360 demonstrates the need for such a benchmark and a lack of robustness in the field. NAS-Bench-360 is also useful for understanding past and future search spaces and algorithms, specifically whether current beliefs about NAS extend to diverse tasks. For example, Figure 2 shows that high-performing architectures transfer well between vision tasks—a quality used extensively in NAS research—but not between diverse tasks. Other examples of scientific uses of NAS-Bench-360—such as one investigating a recent paper on operation redundancy—are provided in our paper and in a recent ICLR 2022 blog post on zero-cost proxies. We also expect NAS-Bench-360 to be used for the development of new NAS methods; to further this, for two of the datasets we provide precomputed models for all architectures in the NAS-Bench-201 search space; together with existing CIFAR-100 precompute results this means three NAS-Bench-360 datasets have precomputed tabular benchmarks to accelerate search algorithm development. 

The AutoML Decathlon: A competition focused on diverse tasks and methods

Our goal in releasing NAS-Bench-360 is to spur the development of NAS methods that work well on diverse tasks. However, given the mixed performance of NAS on this benchmark, there remains a question of whether automatic architecture design should even be the focus of AutoML research more broadly. Building on our efforts from NAS-Bench-360, a group of researchers at CMU, Hewlett Packard Enterprise (HPE), Wisconsin-Madison, and Morgan Stanley are organizing the AutoML Decathlon competition at NeurIPS 2022 precisely to ask the following broader question: what automated technique(s) are best for diverse tasks?

This competition is designed to address two gaps between research and practice:

  1. Lack of task diversity. The field of NAS is no exception here, as the vast majority of recent AutoML benchmarking and competition efforts have focused on computer vision or other well studied tasks in speech and language processing. Evaluating AutoML methods on such well-studied tasks does not give a good indication of their utility on more far-afield applications.
  2. Siloed methodological development. Many developments in AutoML narrowly focus on particular techniques rather than the downstream benefits to the end user. A practitioner with a specific ML task ultimately cares about the quality of the resulting model (in terms of accuracy and other non-accuracy metrics), as opposed to the underlying technical details of the procedure yielding this model, e.g., whether the model is the result of a weight-sharing NAS method, a fine-tuned large model,  a more classical non-deep learning AutoML technique, or some other automated procedure.

By designing our competition in a practitioner-centric fashion and accounting for the two aforementioned gaps, our competition aims to spur innovation in AutoML with results that are directly transferable to ML practitioners. We envision that the results of our competition will provide novel empirical insights into several open practical and scientific questions, including:

  • Given the growing methodological diversity of (Auto)ML approaches, what methods should I consider as a practitioner in 2022?  
  • How do leading NAS methods compare to the increasingly popular pre-training/fine-tuning paradigm?
  • How do either of these more modern approaches compare to classical AutoML approaches or to standard baselines such as XGBoost or a tuned ResNet?
  • Should I consider using any AutoML procedure given that I’m working on a specific scientific, technological, or industrial problem that seemingly differs drastically from well-studied tasks in computer vision and NLP?  
  • Given a reasonable computational budget, can any AutoML approach (whether classical or more modern) consistently outperform bespoke models that were hand-crafted by either domain experts and/or ML experts?
Figure 3: Summary of the AutoML Decathlon competition timeline. To ensure efficiency, the evaluation will be conducted under a fixed computational budget. To ensure robustness, the performance profile methodology described above will be used for determining the winners.

We note that while AutoML is not a new research area, we view our competition as being particularly timely given (1) rapid growth of ML task diversity, (2) progress in ML model development, and (3) acceleration in the scale of both datasets and available compute resources. Indeed, recent progress along these three dimensions has led us to make remarkably different design choices from those of past competitions like the AutoDL competition, which was launched just three years ago. For instance, we work with bigger datasets, allow larger computational budgets, consider an expanding set of applications, and perform more robust evaluations based on performance profiles. Relatedly, while over the past three years we’ve witnessed significant progress in NAS and the emergence of the pretrain/fine-tuning paradigm in various settings, neither of these types of approaches featured prominently in the AutoDL competition (or other past competitions). In contrast, we hypothesize that these approaches will be more prominently featured in the AutoML Decathlon.

The AutoML Decathlon is built around a set of 20 datasets that we have curated which represent a broad spectrum of practical applications in scientific, technological, and industrial domains. As explained in Figure 3, ten of the tasks will be used for development and an additional ten tasks will be used for final evaluation and revealed only after the competition. We will provide computational resources to participants as needed, with funding provided by Morgan Stanley. The results of our performance-profile based evaluation will determine monetary prizes, including a $15K first prize, with sponsorship provided by HPE.

Getting Involved: Using NAS-Bench-360 and competing in the AutoML Decathlon

Our goal with both NAS-Bench360 and the AutoML Decathlon is to encourage community participation in evaluating what AutoML is already good at, what areas need improving, and what directions seem most promising for future work. We hope that these rigorous benchmarking activities will help the field more rapidly move towards a truly democratized ML toolkit that can be used by researchers and practitioners alike.

To learn more, check out the following links:

  • NAS-Bench360: You can download the ten datasets on the website, and learn more about the benchmark and our various insights from our paper.
  • AutoML Decathlon: The competition officially starts next week and runs through mid October, but we are soft-launching today to spread the word. You can learn more about the details at the competition website and the associated CodaLab website. 

Also, stay tuned for a follow up blog post where our collaborator Junhong Shen describes our recent algorithmic NAS work targeting diverse tasks.

Read More

GFN Thursday Brings New Games to GeForce NOW for the Perfect Summer Playlist

Nothing beats the summer heat like GFN Thursday. Get ready for four new titles streaming at GeForce quality across nearly any device.

Buckle up for some great gaming, whether poolside, in the car for a long road trip, or in the air-conditioned comfort of home.

Speaking of summer, it’s also last call for this year’s Steam Summer Sale. Check out the special row in the GeForce NOW app for some great gaming deals before the sale ends today at 10am PDT.

Choose Your Adventure

With more than 1,300 games in the GeForce NOW library, there’s something for everyone. Single-player adventures? Check. Multiplayer battles? Got that, too. GFN Thursday brings more games each week, and it’s nearly impossible to play them all.

Catch up on titles you’ve been eyeing and put together a gaming playlist that fits the perfect summer mood. From blockbuster free-to-play action role-playing games like Genshin Impact and Lost Ark to story-driven sagas like Life is Strange: True Colors, high-speed action in NASCAR 21: Ignition and more, there are plenty of options to keep gamers busy.

GeForce NOW Ecosystem
There’s something for everyone on GeForce NOW.

Find your next adventure in the native GeForce NOW app or on play.geforcenow.com. Search for a game or genre using the top bar to build out the perfect gaming library. Streaming the game from GeForce-powered servers enables gamers to keep the action going, even on a Mac, mobile device, Chromebook and more.

Even better: RTX 3080 members can play at up to 4K resolution and 60 frames per second on PC and Mac, or take the action to the living room on the recently updated SHIELD TV. They can also take on opponents with ultra-low latency for the best gaming sessions, and RTX ON for supported titles to get the most cinematic visuals.

Press Play

Arma Reforger on GeForce NOW
Stand with the squad on the front lines in “Arma Reforger.”

Not sure where to start? Check out this week’s new additions to squad up in Arma Reforger, bring home the trophy in Matchpoint – Tennis Championships and more.

Here’s what’s coming to GeForce NOW this week:

  • Matchpoint – Tennis Championships (New release on Steam July 7)
  • Starship Troopers – Terran Command (New release on Epic Games Store July 7)
  • Sword and Fairy Inn 2 (New release on Steam, July 8)
  • Arma Reforger (Steam)

It was also announced that rFactor 2 would be coming to GeForce NOW. At this time, the title will not be coming to the service.

Finally, speaking of your summer playlist, we have a question that may get you a bit nostalgic. Let us know your answer on Twitter or in the comments below.

The post GFN Thursday Brings New Games to GeForce NOW for the Perfect Summer Playlist appeared first on NVIDIA Blog.

Read More

Wordle for AI: Santiago Valderrama on Getting Smarter on Machine Learning

Want to learn about AI and machine learning? There are plenty of resources out there to help — blogs, podcasts, YouTube tutorials — perhaps too many.

Machine learning engineer Santiago Valderrama has taken a far more focused approach to helping us all get smarter about the field.

He’s created a following by posing one machine learning question every day on his website bnomial.com.

Think of it as Wordle for those of who want to learn more about machine learning.

As Valderrama wrote on a LinkedIn post: “I got together with a couple of friends and built bnomial a site with a simple goal, a non-BS simple way to learn something new as fast as possible. We published one machine learning question every day. That’s it. You load the page, answer the question and return the next day. Rinse and repeat.”

NVIDIA AI podcast host Noah Kravitz spoke with Valderrama to talk to him about binomial, how to get smart about machine learning, and his own journey in the field.

You Might Also Like

What Is Conversational AI? ZeroShot Bot CEO Jason Mars Explains

In addition to being an entrepreneur and CEO of several startups, including Zero Shot Bot, Jason Mars is an associate professor of computer science at the University of Michigan and the author of Breaking Bots: Inventing a New Voice in the AI Revolution. He discusses how the latest AI techniques intersect with the very ancient art of conversation.

Recommender Systems 101: NVIDIA’s Even Oldridge Breaks It Down

Even Oldridge, senior manager for the Merlin team at NVIDIA, digs into how recommender systems work — and why these systems are being harnessed by companies in industries around the globe.

NVIDIA’s Jonah Alben Talks AI

Imagine building an engine with 54 billion parts. Now imagine each piece is the size of a gnat’s eyelash. That gives you some idea of the scale Jonah Alben works at. Alben is the co-leader of GPU engineering at NVIDIA. The engines he builds are GPUs — which these days do much of the heavy lifting for the latest and greatest form of computing: AI.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

The post Wordle for AI: Santiago Valderrama on Getting Smarter on Machine Learning appeared first on NVIDIA Blog.

Read More

AI4Science to empower the fifth paradigm of scientific discovery

Christopher Bishop, Distinguished Scientist, Managing Director, Microsoft Research Cambridge Lab

Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Could this capability represent the dawn of a new paradigm of scientific discovery?

Jim Gray, a Turing Award winner, and former Microsoft Technical Fellow, characterised the historical evolution of scientific discovery through four paradigms. With origins dating back thousands of years, the first paradigm was purely empirical and based on direct observation of natural phenomena. While many regularities were apparent in these observations, there was no systematic way to capture or express them. The second paradigm was characterised by theoretical models of nature, such as Newton’s laws of motion in the seventeenth century, or Maxwell’s equations of electrodynamics in the nineteenth century. Derived by induction from empirical observation, such equations allowed generalization to a much broader range of situations than those observed directly. While these equations could be solved analytically for simple scenarios, it was not until the development of digital computers in the twentieth century that they could be solved in more general cases, leading to a third paradigm based on numerical computation. By the dawn of the twenty-first century computation was again transforming science, this time through the ability to collect, store and process large volumes of data, leading to the fourth paradigm of data-intensive scientific discovery. Machine learning forms an increasingly important component of the fourth paradigm, allowing the modelling and analysis of large volumes of experimental scientific data. These four paradigms are complementary and coexist. 

The pioneering quantum physicist Paul Dirac commented in 1929 that “The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.” For example, Schrödinger’s equation describes the behaviour of molecules and materials at the subatomic level with exquisite precision, and yet numerical solution with high accuracy is only possible for very small systems consisting of a handful of atoms. Scaling to larger systems requires increasingly drastic approximations leading to a challenging trade-off between scale and accuracy. Even so, quantum chemistry calculations are already of such high practical value that they form one of the largest supercomputer workloads. 

However, over the last year or two, we have seen the emergence of a new way to exploit deep learning, as a powerful tool to address this speed-versus-accuracy trade-off for scientific discovery. This is a very different use of machine learning from the modelling of data that characterizes the fourth paradigm, because the data that is used to train the neural networks itself comes from numerical solution of the fundamental equations of science rather than from empirical observation. We can view the numerical solutions of scientific equations as simulators of the natural world that can be used, at high computational cost, to compute quantities of interest in applications such as forecasting the weather, modelling the collision of galaxies, optimizing the design of fusion reactors, or calculating the binding affinities of candidate drug molecules to a target protein. From a machine learning perspective, however, the intermediate details of the simulation can be viewed as training data which can be used to train deep learning emulators. Such data is perfectly labelled, and the quantity of data is limited only by computational budget. Once trained, the emulator can perform new calculations with high efficiency, achieving significant improvements in speed, sometimes by several orders of magnitude. 

This ‘fifth paradigm’ of scientific discovery represents one of the most exciting frontiers for machine learning as well as for the natural sciences. While there is a long way to go before these emulators are sufficiently fast, robust, and general-purpose to become mainstream, the potential for real-world impact is clear. For example, the number of small-molecule drug candidates alone is estimated at 1060, while the total number of stable materials is thought to be around 10180 (roughly the square of the number of atoms in the known universe). Finding more efficient ways to explore these vast spaces would transform our ability to discover new substances such as better drugs to treat disease, improved substrates for capturing atmospheric carbon dioxide, better materials for batteries, new electrodes for fuel cells to power the hydrogen economy, and myriad others.

AI4Science is an effort deeply rooted in Microsoft’s mission, applying the full breadth of our AI capabilities to develop new tools for scientific discovery so that we and others in the scientific community can confront some of humanity’s most important challenges. Microsoft Research has a 30+ year legacy of curiosity and discovery, and I believe that the AI4Science team – spanning geographies and scientific fields – has the potential to yield extraordinary contributions to that legacy.

Kevin Scott, Executive Vice President and Chief Technology Officer, Microsoft

I’m delighted to announce today that I will be leading a new global team in Microsoft Research, spanning the UK, China and the Netherlands, to focus on bringing this fifth paradigm to reality. Our AI4Science team encompasses world experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines who are working together to tackle some of the most pressing challenges in this field.

An example project is Graphormer, led by my colleague Tie-Yan Liu in our China team. This is a deep learning package that allows researchers and developers to train custom models for molecule modelling tasks, such as materials science, or drug discovery. Recently, Graphormer won the Open Catalyst Challenge, a molecular dynamics competition that aims to model the catalyst-absorbate reaction system by AI, and has more than 0.66 million catalyst-absorbate relaxation systems (144 million structure-energy frames) simulated by density functional theory (DFT) software. Another project, from our team in Cambridge, in collaboration with Novartis, is Generative Chemistry, where together we are empowering scientists with AI to speed up the discovery and development of break-through medicines.

As Iya Khalil, Global Head of the AI Innovation Lab at Novartis, recently noted, the work is no longer science fiction but science-in-action:

“Not only can AI learn from our past experiments, but, with each new iteration of designing and testing in the lab, the machine learning algorithms can identify new patterns and help guide the early drug discovery and development process. Hopefully in doing this we can augment our human scientists’ expertise so they can design better molecules faster.”

The team has since used the platform to generate several promising early-stage molecules which have been synthesised for further exploration.

Alongside our teams in China and the UK, we have been growing a team in the Netherlands, including hiring the world-renowned machine learning expert, Max Welling. I am also excited to be able to announce today that our brand-new Lab in Amsterdam will be housed in Matrix One, which is currently under construction on the Amsterdam Science Park. This purpose-built space is in close proximity to the University of Amsterdam and the Vrije Universiteit Amsterdam, and we will maintain strong affiliations with both institutions through the co-supervision of PhD students.

Image of Amsterdam office
Matrix One building in Amsterdam

It is with pride and excitement that we take this next step to come together as a cross-geographical team and follow in the footsteps of pioneers before us, to contribute to this next paradigm of scientific discovery, and in doing so impact many important societal challenges. If you share our excitement and ambition, and would like to join us, I encourage you to look at our open positions or get in touch to talk to anyone on the team.

The post AI4Science to empower the fifth paradigm of scientific discovery appeared first on Microsoft Research.

Read More

Smart textiles sense how their users are moving

Using a novel fabrication process, MIT researchers have produced smart textiles that snugly conform to the body so they can sense the wearer’s posture and motions.

By incorporating a special type of plastic yarn and using heat to slightly melt it — a process called thermoforming — the researchers were able to greatly improve the precision of pressure sensors woven into multilayered knit textiles, which they call 3DKnITS.

They used this process to create a “smart” shoe and mat, and then built a hardware and software system to measure and interpret data from the pressure sensors in real time. The machine-learning system predicted motions and yoga poses performed by an individual standing on the smart textile mat with about 99 percent accuracy.

Their fabrication process, which takes advantage of digital knitting technology, enables rapid prototyping and can be easily scaled up for large-scale manufacturing, says Irmandy Wicaksono, a research assistant in the MIT Media Lab and lead author of a paper presenting 3DKnITS.

The technique could have many applications, especially in health care and rehabilitation. For example, it could be used to produce smart shoes that track the gait of someone who is learning to walk again after an injury, or socks that monitor pressure on a diabetic patient’s foot to prevent the formation of ulcers.

“With digital knitting, you have this freedom to design your own patterns and also integrate sensors within the structure itself, so it becomes seamless and comfortable, and you can develop it based on the shape of your body,” Wicaksono says.

He wrote the paper with MIT undergraduate students Peter G. Hwang, Samir Droubi, and Allison N. Serio through the Undergraduate Research Opportunities Program; Franny Xi Wu, a recent graduate of the Wellesley College; Wei Yan, assistant professor at the Nanyang Technological University; and senior author Joseph A. Paradiso, the Alexander W. Dreyfoos Professor and director of the Responsive Environments group within the Media Lab. The research will be presented at the IEEE Engineering in Medicine and Biology Society Conference.

“Some of the early pioneering work on smart fabrics happened at the Media Lab in the late ’90s. The materials, embeddable electronics, and fabrication machines have advanced enormously since then,” Paradiso says. “It’s a great time to see our research returning to this area, for example through projects like Irmandy’s — they point at an exciting future where sensing and functions diffuse more fluidly into materials and open up enormous possibilities.”

Knitting know-how

To produce a smart textile, the researchers use a digital knitting machine that weaves together layers of fabric with rows of standard and functional yarn. The multilayer knit textile is composed of two layers of conductive yarn knit sandwiched around a piezoresistive knit, which changes its resistance when squeezed. Following a pattern, the machine stitches this functional yarn throughout the textile in horizontal and vertical rows. Where the functional fibers intersect, they create a pressure sensor, Wicaksono explains.

But yarn is soft and pliable, so the layers shift and rub against each other when the wearer moves. This generates noise and causes variability that make the pressure sensors much less accurate.

Wicaksono came up with a solution to this problem while working in a knitting factory in Shenzhen, China, where he spent a month learning to program and maintain digital knitting machines. He watched workers making sneakers using thermoplastic yarns that would start to melt when heated above 70 degrees Celsius, which slightly hardens the textile so it can hold a precise shape.

He decided to try incorporating melting fibers and thermoforming into the smart textile fabrication process.

“The thermoforming really solves the noise issue because it hardens the multilayer textile into one layer by essentially squeezing and melting the whole fabric together, which improves the accuracy. That thermoforming also allows us to create 3D forms, like a sock or shoe, that actually fit the precise size and shape of the user,” he says.

Once he perfected the fabrication process, Wicaksono needed a system to accurately process pressure sensor data. Since the fabric is knit as a grid, he crafted a wireless circuit that scans through rows and columns on the textile and measures the resistance at each point. He designed this circuit to overcome artifacts caused by “ghosting” ambiguities, which occur when the user exerts pressure on two or more separate points simultaneously.

Inspired by deep-learning techniques for image classification, Wicaksono devised a system that displays pressure sensor data as a heat map. Those images are fed to a machine-learning model, which is trained to detect the posture, pose, or motion of the user based on the heat map image.

Analyzing activities

Once the model was trained, it could classify the user’s activity on the smart mat (walking, running, doing push-ups, etc.) with 99.6 percent accuracy and could recognize seven yoga poses with 98.7 percent accuracy.

They also used a circular knitting machine to create a form-fitted smart textile shoe with 96 pressure sensing points spread across the entire 3D textile. They used the shoe to measure pressure exerted on different parts of the foot when the wearer kicked a soccer ball.   

The high accuracy of 3DKnITS could make them useful for applications in prosthetics, where precision is essential. A smart textile liner could measure the pressure a prosthetic limb places on the socket, enabling a prosthetist to easily see how well the device fits, Wicaksono says.

He and his colleagues are also exploring more creative applications. In collaboration with a sound designer and a contemporary dancer, they developed a smart textile carpet that drives musical notes and soundscapes based on the dancer’s steps, to explore the bidirectional relationship between music and choreography. This research was recently presented at the ACM Creativity and Cognition Conference.

“I’ve learned that interdisciplinary collaboration can create some really unique applications,” he says.

Now that the researchers have demonstrated the success of their fabrication technique, Wicaksono plans to refine the circuit and machine learning model. Currently, the model must be calibrated to each individual before it can classify actions, which is a time-consuming process. Removing that calibration step would make 3DKnITS easier to use. The researchers also want to conduct tests on smart shoes outside the lab to see how environmental conditions like temperature and humidity impact the accuracy of sensors.

“It’s always amazing to see technology advance in ways that are so meaningful. It is incredible to think that the clothing we wear, an arm sleeve or a sock, can be created in ways that its three-dimensional structure can be used for sensing,” says Eric Berkson, assistant professor of orthopaedic surgery at Harvard Medical School and sports medicine orthopaedic surgeon at Massachusetts General Hospital, who was not involved in this research. “In the medical field, and in orthopedic sports medicine specifically, this technology provides the ability to better detect and classify motion and to recognize force distribution patterns in real-world (out of the laboratory) situations. This is the type of thinking that will enhance injury prevention and detection techniques and help evaluate and direct rehabilitation.”

This research was supported, in part, by the MIT Media Lab Consortium.

Read More