January 2023 – Vedere AI

Cyberpunk 2077 Brings a Taste of the Future with DLSS

Analyst reports. Academic papers. Ph.D. programs. There are a lot of places you can go to get a glimpse of the future. But the best place might just be El Coyote Cojo, a whiskey-soaked dive bar that doesn’t exist in real life.

Fire up Cyberpunk 2077 and you’ll see much more than the watering hole’s colorful clientele. You’ll see refractions and reflections, shadows and smoke, all in the service of creating more than just eye candy — each element works in tandem with the game’s expansive and engaging story.

Patching In: Cyberpunk 2077’s DLSS 3 Upgrade

It’s a tale that gets more mesmerizing with every patch — the updates game developers periodically release to keep their games at the cutting edge. Today’s addition brings NVIDIA DLSS 3, the latest in neural graphics.

DLSS 3 is a package that includes a number of sophisticated technologies. Combining DLSS Super Resolution, all-new DLSS Frame Generation, and NVIDIA Reflex, running on the new hardware capabilities of GeForce RTX 40 Series GPUs, DLSS 3 multiplies performance while maintaining great image quality and responsiveness.

The performance uplift this delivers lets PC gamers experience more of Cyberpunk 2077’s gritty glory. And it sets the stage for the pending Ray Tracing Overdrive Mode, an update that will escalate the game’s ray tracing, a technique long used to create blockbuster films and enhance the game’s already-incredible visuals.

The gaming press — perhaps the most brutal critics of the visual arts — are already raving about DLSS 3.

“I’m deeply in love with DLSS with Frame Generation,” gushes PC Gamer. “DLSS 3 is incredible, and NVIDIA’s tech is undeniably a selling point for the [GeForce RTX] 4080,” asserts PCGamesN. “[I]t’s a phenomenal achievement in graphics performance,” states Digital Foundry.

Twenty-one games now support DLSS 3, including Dying Light 2 Stay Human, Hitman 3, Marvel’s Midnight Suns, Microsoft Flight Simulator, Portal with RTX, The Witcher 3: Wild Hunt and Warhammer 40,000: Darktide. More are coming, including Atomic Heart, ILL SPACE and Warhaven.

Playing with the Future

There are many tales on the increasingly immersive streets of Cyberpunk 2077’s Night City, but the one even non-gamers should pay attention to the story behind these stories: gaming as a proving ground for the technologies that will shape the future Cyberpunk 2077 is simulating right before our eyes.

This is the best of the best. CD PROJEKT RED is known for supporting its flagship titles like Cyberpunk 2077 and The Witcher 3: Wild Hunt for extended periods of time with a variety of patches that take advantage of modern hardware. It has earned a reputation as a game development studio that embraces emerging technologies.

That makes its games more than a cultural phenomenon. They’re a technology-proving ground, a position held over the past two decades by a string of titles revered by gamers, such as Crysis, Metro and Far Cry.

PC Games Unleash Global Innovation

Building digital worlds such as these is the hard computing problem — the meanest streets in our increasingly digital world — out of which the parallel computing engines that are GPUs emerged.

A decade ago, GPUs sparked the deep-learning revolution that has upended trillion-dollar industries around the world, one that continues with the latest advancements in generative AI such as ChatGPT and Dall-E that have erupted over the past month into a global cultural sensation.

It’s a case study in the disruptive innovations Harvard Business School Professor Clayton Christensen identified as lurking in unexpected places.

DLSS brings that revolution full circle, using the same deep-learning techniques harnessed for everything from cutting-edge science to self-driving cars to advance the visual quality of games.

Trained on NVIDIA’s supercomputers, DLSS enhances a new generation of games that demand ever more performance. And the use of DLSS 3 is just one example of this benchmark game’s innovations — innovations woven into the texture of the game’s storytelling.

CD PROJEKT RED uses DirectX Ray Tracing, for example, a lighting technique that emulates the way light reflects and refracts in the real world to provide a more believable environment than what’s typically seen using static lighting in more traditional games.

The game uses several ray-tracing techniques to render a massive future city at incredible levels of detail. The current version of the game uses ray-traced shadows, reflections, diffuse illumination and ambient occlusion.

And if you turn on “Psycho mode” in the game’s ray-traced lighting settings, you’ll even see ray-traced global illumination as sunlight bounces realistically around the scene.

Cyberpunk 2077’s Visual Storytelling Packs a Punch

The result of all these features is a visually stunning experience that complements the world’s story and tone: sprawling cityscapes that use subtle shadows to define depth, districts are bathed in neon lights, and windows, mirrors and puddles glistening with accurate reflections.

With realistic shadows and lighting and the added performance of NVIDIA DLSS 3, no other platform will compare to the Cyberpunk 2077 experience on a GeForce RTX-powered PC.

But that’s just part of the bigger story.

Games like these offer a window into the kind of visual capabilities now at the fingertips of architects and designers. It’s a taste of the simulation capabilities being put to work by engineers at NASA and Lawrence Livermore Labs. And it shows what’s possible in the next-generation environments for digital collaboration and simulation now being harnessed at scale by manufacturers such as BMW.

So muscle the geek in your life aside from the PC for an evening, grab the latest patch for Cyberpunk 2077 and a GeForce 40 Series GPU and gawk at the game’s abundance of power and potential, put on display right in front of your face.

It’s where we’ll see the future first, and that future is looking better than ever.

Find out more on GeForce.com.

New AI classifier for indicating AI-written text

We’re launching a classifier trained to distinguish between AI-written and human-written text.

We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe good classifiers can inform mitigations for false claims that AI-generated text was written by a human: for example, running automated misinformation campaigns, using AI tools for academic dishonesty, and positioning an AI chatbot as a human.

Our classifier is not fully reliable. In our evaluations on a “challenge set” of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as “likely AI-written,” while incorrectly labeling human-written text as AI-written 9% of the time (false positives). Our classifier’s reliability typically improves as the length of the input text increases. Compared to our previously released classifier, this new classifier is significantly more reliable on text from more recent AI systems.

We’re making this classifier publicly available to get feedback on whether imperfect tools like this one are useful. Our work on the detection of AI-generated text will continue, and we hope to share improved methods in the future.

Try our work-in-progress classifier yourself:

Try the classifier

Limitations

Our classifier has a number of important limitations. It should not be used as a primary decision-making tool, but instead as a complement to other methods of determining the source of a piece of text.

The classifier is very unreliable on short texts (below 1,000 characters). Even longer texts are sometimes incorrectly labeled by the classifier.
Sometimes human-written text will be incorrectly but confidently labeled as AI-written by our classifier.
We recommend using the classifier only for English text. It performs significantly worse in other languages and it is unreliable on code.
Text that is very predictable cannot be reliably identified. For example, it is impossible to predict whether a list of the first 1,000 prime numbers was written by AI or humans, because the correct answer is always the same.
AI-written text can be edited to evade the classifier. Classifiers like ours can be updated and retrained based on successful attacks, but it is unclear whether detection has an advantage in the long-term.
Classifiers based on neural networks are known to be poorly calibrated outside of their training data. For inputs that are very different from text in our training set, the classifier is sometimes extremely confident in a wrong prediction.

Training the classifier

Our classifier is a language model fine-tuned on a dataset of pairs of human-written text and AI-written text on the same topic. We collected this dataset from a variety of sources that we believe to be written by humans, such as the pretraining data and human demonstrations on prompts submitted to InstructGPT. We divided each text into a prompt and a response. On these prompts we generated responses from a variety of different language models trained by us and other organizations. For our web app, we adjust the confidence threshold to keep the false positive rate very low; in other words, we only mark text as likely AI-written if the classifier is very confident.

Impact on educators and call for input

We recognize that identifying AI-written text has been an important point of discussion among educators, and equally important is recognizing the limits and impacts of AI generated text classifiers in the classroom. We have developed a preliminary resource on the use of ChatGPT for educators, which outlines some of the uses and associated limitations and considerations. While this resource is focused on educators, we expect our classifier and associated classifier tools to have an impact on journalists, mis/dis-information researchers, and other groups.

We are engaging with educators in the US to learn what they are seeing in their classrooms and to discuss ChatGPT’s capabilities and limitations, and we will continue to broaden our outreach as we learn. These are important conversations to have as part of our mission is to deploy large language models safely, in direct contact with affected communities.

If you’re directly impacted by these issues (including but not limited to teachers, administrators, parents, students, and education service providers), please provide us with feedback using this form. Direct feedback on the preliminary resource is helpful, and we also welcome any resources that educators are developing or have found helpful (e.g., course guidelines, honor code and policy updates, interactive tools, AI literacy programs).

OpenAI

Where machine learning models meet mobility and human behavior

Mahdieh Allahviranloo is using her expertise in transportation engineering to assist last-mile delivery operations at Amazon.Read More

Broadcaster ‘Nilson1489’ Shares Livestreaming Techniques and More This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

Broadcasters have an arsenal of new features and technologies at their disposal.

These include the eighth-generation NVIDIA video encoder on RTX 40 Series GPUs with support for the open AV1 video-coding format; new NVIDIA Broadcast app effects like Eye Contact and Vignette; and support for AV1 streaming in Discord — joining integrations with software including OBS Studio, Blackmagic Design’s DaVinci Resolve, Adobe Premiere Pro via the Voukoder plugin, Wondershare Filmora and Jianying.

Livestreamer, video editor and entertainer Nilson1489 steps In the NVIDIA Studio this week to demonstrate how these broadcasting advancements elevate his livestreams — in style and substance — using a GeForce RTX 4090 GPU and the power of AI.

In addition, the Warbb World Challenge, hosted by famed 3D artist Warbb, is underway. It invites artists to create their own 3D worlds. Prizes include an NVIDIA Studio laptop, RTX 40 Series GPUs from MSI and ArtStation gift cards. Learn more below.

Better Broadcast Benefits

Content creators looking to get into the livestreaming hustle, professional YouTubers and other broadcasters regardless of skill level or audience can benefit from using GeForce RTX 40 Series GPUs — featuring the eighth-generation NVIDIA video encoder, NVENC, with support for AV1.

The new AV1 encoder delivers 40% better efficiency. This means livestreams will appear as if bandwidth was increased by 40% — a big boost in image quality — in popular broadcast apps like OBS Studio.

Discord, a communication platform with over 150 million active monthly users, has enabled end-to-end livestreams with AV1. This dramatically improves screen sharing — whether for livestreaming, online classes or virtual hangouts with friends — with crisp, clear image quality at up to 4K resolution and 60 frames per second.

AV1 increases bandwidth and video quality by up to 40%.

The integration takes advantage of AV1’s advanced compression efficiency, so users with AV1 decode-capable hardware will experience even higher-quality video. Plus, users with slower internet connections can now enjoy higher-quality video streams at up to 4K and 60fps resolution.

In addition, NVIDIA Studio recently released NVIDIA Broadcast 1.4 — a tool for livestreaming and video conferencing that turns virtually any room into a home studio — with two effects, Eye Contact and Vignette, as well as an enhancement to Virtual Background that uses temporal information. Learn more about Broadcast — available for all RTX GPU owners including this week’s featured artist, Nilson1489.

Give a Boost to Broadcasts

Hailing from Hamburg, Germany, Nilson1489 is a self-taught livestreamer. He possesses a deep passion — stemmed from his involvement in the livestreaming community — for helping to improve the creative workflows of emerging broadcasters who are eager to learn.

Nilson1489 said he invested in a GeForce RTX 4090 GPU expecting better visual livestreaming quality across the board and considerable time savings in his creative workflows. And that’s exactly what he experienced.

“With NVIDIA Broadcast, I’m able to look on my display to read notes or focus on tutorial elements without losing eye contact with the audience.”
—Nilson1489

“NVIDIA RTX GPUs have the best GPU acceleration for my creative apps as well as the best quality when it comes to recording inside OBS Studio,” the livestreamer said.

Nilson1489 streams primarily in OBS Studio, which means the AV1 encoder automatically boosts bandwidth by 40%, dramatically improving video quality.

As a teacher for creators and consultant for various brands and clients, Nilson1489 leads daily calls and workshops over Microsoft Teams, Zoom and other video conference apps supported by NVIDIA Broadcast. He can read notes and present while keeping strong eye contact with his followers, made possible by NVIDIA Broadcast’s new Eye Contact feature.

His GeForce RTX 4090 GPU proved especially handy when exporting final video files with its dual AV1 video encoders, he said. When enabled in video-editing and livestreaming apps — such as Adobe Premiere Pro via the Voukoder plug-in, DaVinci Resolve, Wondershare Filmora and Jianying — export times are cut in half, with improved video quality. This enabled Nilson1489 to export from Premiere Pro and upload his videos to YouTube at least twice as fast as his competitors.

The right GeForce RTX GPU can make a massive difference in the quality and quantity of content creation, as it did for Nilson1489.

Check out Nilson1489’s YouTube channel for streaming tutorials.

Create a 3D World, Win Serious Studio Hardware

3D talent Robin Snijders, aka Warbb, along with NVIDIA Studio presents the Warbb World Challenge, where 3D artists are invited to transform a traditionally boring space into an extraordinary scene using assets provided by Warbb. Everyone starts with the same template: an empty room, table, laptop and person.

A panel of creative talents, including Warbb, In the NVIDIA Studio artist I Am Fesq, Noxx_art and two NVIDIA reps will judge entries based on creativity, originality and visual appeal. Contest winners will receive incredible prizes, including an MSI Creator Z16P 3080 Ti Studio Laptop, RTX 40 Series GPUs from MSI and ArtStation gift cards.

The Warbb World Challenge’s grand prize: an MSI Creator Z16P Studio Laptop equipped with an NVIDIA RTX 3080 Ti GPU.

Enter by downloading the challenge assets, upload the submission to ArtStation with the hashtags #WarbbWorld and #NVIDIAStudio, then share on social media channels with #WarbbWorld and #NVIDIAStudio. NVIDIA Studio could feature you in an in-depth interview to add exposure to your world.

The challenge runs through Sunday, Feb. 19. Terms and conditions apply.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

New AI classifier for indicating AI-written text

We’re launching a classifier trained to distinguish between AI-written and human-written text.OpenAI Blog

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

Starting today, the SageMaker LightGBM algorithm offers distributed training using the Dask framework for both tabular classification and regression tasks. They’re available through the SageMaker Python SDK. The supported data format can be either CSV or Parquet. Extensive benchmarking experiments on four publicly available datasets with various settings are conducted to validate its performance.

Customers are increasingly interested in training models on large datasets with SageMaker LightGBM, which can take a day or even longer. In these cases, you might be able to speed up the process by distributing training over multiple machines or processes in a cluster. This post discusses how SageMaker LightGBM helps you set up and launch distributed training, without the expense and difficulty of directly managing your training clusters.

Problem statement

Machine learning has become an essential tool for extracting insights from large amounts of data. From image and speech recognition to natural language processing and predictive analytics, ML models have been applied to a wide range of problems. As datasets continue to grow in size and complexity, traditional training methods can become increasingly time-consuming and resource-intensive. This is where distributed training comes into play.

Distributed training is a technique that allows for the parallel processing of large amounts of data across multiple machines or devices. By splitting the data and training multiple models in parallel, distributed training can significantly reduce training time and improve the performance of models on big data. In recent years, distributed training has been a popular mechanism in training deep neural networks for use cases such as large language models (LLMs), image generation and classification, and text generation tasks using frameworks like PyTorch, TensorFlow, and MXNet. In this post, we discuss how distributed training can be applied to tabular data (a common type of data found in many industries such as finance, healthcare, and retail) using Dask and the LightGBM algorithm for tasks such as regression and classification.

Dask is an open-source parallel computing library that allows for distributed parallel processing of large datasets in Python. It’s designed to work with the existing Python and data science ecosystem such as NumPy and Pandas. When it comes to distributed training, Dask can be used to parallelize the data loading, preprocessing, and model training tasks, and it integrates well with popular ML algorithms like LightGBM. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, which is designed to be efficient and scalable for training large models on big data. Combining these two powerful libraries, LightGBM v3.2.0 is now integrated with Dask to allow distributed learning across multiple machines to produce a single model.

How distributed training works

Distributed training for tree-based algorithms is a technique that is used when the dataset is too large to be processed on a single instance or when the computational resources of a single instance are not sufficient to train the tree-based model in a reasonable amount of time. It allows a model to be trained across multiple instances or machines, rather than on a single machine. This is done by dividing the dataset into smaller subsets, called chunks, and distributing them among the available instances. Each instance then trains a model on its assigned chunk of data, and the results are later combined using aggregation algorithms to form a single model.

In tree-based models like LightGBM, the main computational cost is in the building of the tree structure. This is typically done by sorting and selecting subsets of the data.

Now, let’s explore how LightGBM does the parallel training. LightGBM can use three types of parallelism:

Data parallelism – This is the most basic form of data parallelism. The data is divided horizontally into smaller subsets and distributed among multiple instances. Each instance constructs its local histogram, and all histograms are merged, then a split is performed using a reduce scatter algorithm. A histogram in local instances is constructed by dividing the subset of the local data into discrete bins, and counting the number of data points in each bin. This histogram-based algorithm helps speed up the training and reduces memory usage.
Feature parallelism – In feature parallelism, each machine is responsible for training a subset of the features of the model, rather than a subset of the data. This can be useful when working with datasets that have a large number of features, because it allows for more efficient use of resources. It works by finding the best local split point in each instance, then communicates the best split with the other instances. LightGBM implementation maintains all features of the data in every machine to reduce the cost of communicating the best splits.
Voting parallelism – In voting parallelism, the data is divided into smaller subsets and distributed among multiple machines. Each machine trains a model on its assigned subset of data, and the results are later combined to form a single, larger model. However, instead of using the gradients from all the machines to update the model parameters, a voting mechanism is used to decide which gradients to use. This can be useful when working with datasets that have a lot of noise or outliers, because it can help reduce the impact of these on the final model. At the time of writing this post, LightGBM integration with Dask only supports data and voting parallelism types.

SageMaker will automatically set up and manage a Dask cluster when using multiple instances with the LightGBM built-in container.

Solution overview

When a training job using LightGBM is started with multiple instances, we first create a Dask cluster. One instance acts as the Dask scheduler, and the remaining instances have Dask workers, where each worker has multiple threads. Each worker in the cluster has part of the data to perform the distributed computations, as illustrated in the following figure.

Enable distributed training

The requirements for the input data are as follows:

The supported input data format for training can be either CSV or Parquet. You are allowed to put more than one data file under both train and validation channels. If multiple files are identified, the algorithm will concatenate all of them as the training or validation data. The name of the data file can be any string as long as it ends with .csv or .parquet.
For each data file, the algorithm requires that the target variable is in the first column and that it should not have a header record. This follows the convention of the SageMaker XGBoost algorithm.
If your predictors include categorical features, you can provide a JSON file named cat_index.json in the same location as your training data. This file should contain a Python dictionary, where the key can be any string and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your data file. The index starts with value 1, because value 0 corresponds to the target variable. The cat_index.json file should be put under the training data directory, as shown in the following example.
The instance type supported by distributed training is CPU.

Let’s use data in CSV format as an example. The train and validation data can be structured as follows:

-- training_dataset_s3_path
    -- data_1.csv
    -- data_2.csv
    -- data_3.csv
    -- cat_idx.json
    
-- validation_dataset_s3_path
    -- data_1.csv

You can specify the input type to be either text/csv or application/x-parquet:

from sagemaker.inputs import TrainingInput

content_type = "text/csv" # or "application/x-parquet"

train_input = TrainingInput(
    training_dataset_s3_path, content_type=content_type
)

validation_input = TrainingInput(
    validation_dataset_s3_path, content_type=content_type
)

Before distributed training, you can retrieve the default hyperparameters of LightGBM and override them with custom values:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for LightGBM
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values
hyperparameters[
    "num_boost_round"
] = "500" 

hyperparameters["tree_learner"] = "voting" ### specify either 'data' or 'voting' parallelism for distributed training. Unfortnately, for dask lightgbm, the 'feature' is not supported. See github issue: https://github.com/microsoft/LightGBM/issues/3834

To enable distributed training, you can simply specify the argument instance_count in the class sagemaker.estimator.Estimator to be more than 1. The rest of work is taken care of under the hood. See the following example code:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

training_job_name = name_from_base("sagemaker-built-in-distributed-lgb")

# Create SageMaker Estimator instance
tabular_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=4, ### select the instance count you would like to use for distributed training
    volume_size=30, ### volume_size (int or PipelineVariable): Size in GB of the storage volume to use for storing input and output data during training (default: 30).
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a SageMaker Training job by passing s3 path of the training data
tabular_estimator.fit(
    {
        "train": train_input,
        "validation": validation_input,
    }, logs=True, job_name=training_job_name
)

The following screenshots show a successful training job log from the notebook. The logs from different Amazon Elastic Compute Cloud (Amazon EC2) machines are marked by different colors.

The distributed training is also compatible with SageMaker automatic model tuning. For details, see the example notebook.

Benchmarking

We conducted benchmarking experiments to validate the performance of distributed training in SageMaker LightGBM on four different publicly available datasets for regression, binary, and multi-class classification tasks. The experiment details are as follows:

Each dataset is split into training, validation, and test data following the 80/20/10 split rule. For each dataset and instance type and count, we train LightGBM on the training data; record metrics such as billable time (per instance), total runtime, average training loss at the end of the last built tree over all instances, and validation loss at the end of the last built tree; and evaluate its performance on the hold-out test data.
For each trial, we use the exact same set of hyperparameter values, with the number of trees being 500 except for the lending dataset. For the lending dataset, we use 100 as the number of trees because it’s sufficient to get optimal results on the hold-out test data.
Each number presented in the table is averaged over three trials.
Because each model is trained with one fixed set of hyperparameter values, the evaluation metric numbers on the hold-out test data can be further improved with hyperparameter optimization.

Billable time refers to the absolute wall-clock time. The total runtime is the elastic time running the distributed training, which includes the billable time and time to spin up instances and install dependencies. For the validation loss at the end of the last built tree, we didn’t do the average over all the instances as the training loss because all of the validation data is assigned to a single instance and therefore only that instance has the validation loss metric. Out of Memory (OOM) means the dataset hit the out of memory error during training. The loss function and evaluation metrics used are binary and multi-class logloss, L2, accuracy, F1, ROC AUC, F1 macro, F1 micro, R2, MAE, and MSE.

The expectation is that as the instance count increases, the billable time (per instance) and total runtime decreases, while the average training loss and validation loss at the end of the last built tree and evaluation scores on the hold-out test data remain the same.

We conducted three experiments:

Benchmark on three publicly available datasets using CSV as the input data format
Benchmark on a different dataset using Parquet as the input data format
Compare the model performance on different instance types given a certain instance count

The datasets we used are lending club loan data, ad-tracking fraud detection data, code data, and NYC taxi data. The data statistics are presented as follows.

Dataset	Size	Number of Examples	Number of Features	Problem Type
lending club loan	~10 G	1, 439, 141	955	Binary classification
ad-tracking fraud detection	~10 G	145, 716, 493	9	Binary classification
code	~10 G	18, 268, 221	9	Multi-class classification (number of classes in target: 10)
NYC taxi	~0.5 G	83, 601, 440	8	Regression

The following table contains the benchmarking results for the first three datasets using CSV as the data input format. For demonstration purposes, we removed the categorical features for the lending club loan data. The data statistics are shown in the table. The experiment results matched our expectations.

Dataset	Instance Count (m5.2xlarge)	Billable Time per Instance (seconds)	Total Runtime (seconds)	Average Training Loss over all Instances at the End of the Last Built Tree	Validation Loss at the End of the Last Built Tree	Evaluation Metrics on Hold-Out Test Data
lending club loan	.	.	.	Binary logloss	Binary logloss	Accuracy (%)	F1 (%)	ROC AUC (%)
.	1	Out of Memory
.	2	Out of Memory
.	4	461	614	0.034	0.039	98.9	96.6	99.7
.	6	375	561	0.034	0.039	98.9	96.6	99.7
.	8	359	549	0.034	0.039	98.9	96.7	99.7
.	10	338	522	0.036	0.037	98.9	96.6	99.7
.
ad-tracking fraud detection	.	.	.	Binary logloss	Binary logloss	Accuracy (%)	F1 (%)	ROC AUC (%)
.	1	Out of Memory
.	2	Out of Memory
.	4	2649	2773	0.038	0.039	99.8	43.2	80.4
.	6	1926	2047	0.039	0.04	99.8	44.5	79.7
.	8	1677	1780	0.04	0.04	99.8	45.3	79.2
.	10	1595	1744	0.04	0.041	99.8	43	79.3
.
code	.	.	.	Multiclass logloss	Multiclass logloss	Accuracy (%)	F1 Macro (%)	F1 Micro (%)
.	1	5329	5414	0.937	0.947	65.6	59.3	65.6
.	2	3175	3294	0.94	0.942	65.5	59	65.5
.	4	2593	2695	0.937	0.942	65.6	59.3	65.6
.	8	2253	2377	0.938	0.943	65.6	59.3	65.6
.	10	2160	2285	0.937	0.942	65.6	59.3	65.6

The following table contains the benchmarking results using NYC taxi data with Parquet as the input data format. For the NYC taxi data, we use the yellow trip taxi records from 2009–2022. We follow the example notebook to conduct feature processing. The processed data takes 8.5 G of disk memory when saved as CSV format, and only 0.55 G when saved as Parquet format.

A similar pattern shown in the preceding table is observed. As the instance count increases, the billable time (per instance) and total runtime decreases, while the average training loss and validation loss at the end of the last built tree and evaluation scores on the hold-out test data remain the same.

Dataset	Instance Count (m5.4xlarge)	Billable Time per Instance (seconds)	Total Runtime (seconds)	Average Training Loss over all Instances at the End of the Last Built Tree	Validation Loss at the End of the Last Built Tree	Evaluation Metrics on Hold-Out Test Data
NYC taxi	.	.	.	L2	L2	R2 (%)	MSE	MAE
.	1	951	1036	6.543	6.543	54.7	42.8	2.7
.	2	635	727	6.545	6.545	54.7	42.8	2.7
.	4	501	628	6.637	6.639	53.4	44.1	2.8
.	6	435	552	6.74	6.74	52	45.4	2.8
.	8	410	510	6.919	6.924	52.3	44.9	2.9

We also conduct benchmarking experiments and compare the performance under different instance types using the code dataset. For a certain instance count, as the instance type becomes larger, the billable time and total runtime decrease.

.	ml.m5.2xlarge		ml.m5.4xlarge		ml.m5.12xlarge
Instance Count	Billable Time per Instance (seconds)	Total Runtime (seconds)	Billable Time per Instance (seconds)	Total Runtime (seconds)	Billable Time per Instance (seconds)	Total Runtime (seconds)
1	5329	5414	2793	2904	1302	1394
2	3175	3294	1911	2000	1006	1098
4	2593	2695	1451	1557	891	973

Conclusion

With the power of Dask’s distributed computing framework and LightGBM’s efficient gradient boosting algorithm, data scientists and developers can train models on large datasets faster and more efficiently than using traditional single-node methods. The SageMaker LightGBM algorithm makes the process of setting up distributed training using the Dask framework for both tabular classification and regression tasks much easier. The algorithm is now available through the SageMaker Python SDK. The supported data format can be either CSV or Parquet. Extensive benchmarking experiments were conducted on four publicly available datasets with various settings to validate its performance.

You can bring your own dataset and try these new algorithms on SageMaker, and check out the example notebook to use the built-in algorithms available on GitHub.

About the authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Will Badr is a Principal AI/ML Specialist SA who works as part of the global Amazon Machine Learning team. Will is passionate about using technology in innovative ways to positively impact the community. In his spare time, he likes to go diving, play soccer and explore the Pacific Islands.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Build a water consumption forecasting solution for a water utility agency using Amazon Forecast

Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts, without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating supply and demand for inventory management, travel demand forecasting, workforce planning, and computing cloud infrastructure usage.

You can use Forecast to seamlessly conduct what-if analyses up to 80% faster to analyze and quantify the potential impact of business levers on your demand forecasts. A what-if analysis helps you investigate and explain how different scenarios might affect the baseline forecast created by Forecast. With Forecast, there are no servers to provision or ML models to build manually. Additionally, you only pay for what you use, and there is no minimum fee or upfront commitment. To use Forecast, you only need to provide historical data for what you want to forecast, and, optionally, any additional data that you believe may impact your forecasts.

Water utility providers have several forecasting use cases, but primary among them is predicting water consumption in an area or building to meet the demand. Also, it’s important for utility providers to forecast the increased consumption demand because of more apartments added in a building or more houses in the area. Predicting water consumption accurately is critical to avoid any service interruptions to the customer.

This post explores using Forecast to address this use case by using historical time series data.

Solution overview

Water is a natural resource and very critical to industry, agriculture, households, and our lives. Accurate water consumption forecasting is critical to make sure that an agency can run day-to-day operations efficiently. Water consumption forecasting is particularly challenging because demand is dynamic, and seasonal weather changes can have an impact. Predicting water consumption accurately is important so customers don’t face any service interruptions and in order to provide a stable service while maintaining low prices. Improved forecasting enables you to plan ahead to structure more cost-effective future contracts. The following are the two most common use cases:

Better demand management – As a utility provider agency, you need to find a balance between water demand and supply. The agency collects information like number of people living in an apartment and number of apartments in a building before providing service. As a utility agency, you must balance aggregate supply and demand. You need to store sufficient water in order to meet the demand. Moreover, demand forecasting has become more challenging for the following reasons:
- The demand isn’t stable at all times and varies throughout the day. For example, water consumption at midnight is much less compared to in the morning.
- Weather can also have an impact on the overall consumption. For example, water consumption is higher in the summer than the winter in the northern hemisphere, and the other way around in the southern hemisphere.
- There is not enough rainfall or water storage mechanisms (lakes, reservoirs), or water filtering is insufficient. During the summer, demand can’t always keep up with supply. The water agencies have to forecast carefully to acquire other sources, which may be more expensive. Therefore, it’s critical for utility agencies to find alternative water sources like harvesting rainwater, capturing condensation from air handling units, or reclaiming wastewater.
Conducting a what-if analysis for increased demand – Demand for water is rising due to multiple reasons. This includes a combination of population growth, economic development, and changing consumption patterns. Let’s imagine a scenario where an existing apartment building builds an extension and the number of households and people increase by a certain percentage. Now you need to do an analysis to forecast the supply for increased demand. This also helps you make a cost-effective contract for increased demand.

Forecasting can be challenging because you first need accurate models to forecast demand and then a quick and simple way to reproduce the forecast across a range of scenarios.

This post focuses on a solution to perform water consumption forecasting and a what-if analysis. This post doesn’t consider weather data for model training. However, you can add weather data, given its correlation to water consumption.

Prerequisites

Before getting started, we set up our resources. For this post, we use the us-east-1 Region.

Create an Amazon Simple Storage Service (Amazon S3) bucket for storing the historical time series data. For instructions, refer to Create your first S3 bucket.
Download data files from the GitHub repo and upload to the newly created S3 bucket.
Create a new AWS Identity and Access Management (IAM) role. For instructions, see Set Up Permissions for Amazon Forecast. Be sure to provide the name of your S3 bucket.

Create a dataset group and datasets

This post demonstrates two use cases related to water demand forecast: forecasting the water demand based on past water consumption, and conducting a what-if analysis for increased demand.

Forecast can accept three types of datasets: target time series (TTS), related time series (RTS), and item metadata (IM). Target time series data defines the historical demand for the resources you’re predicting. The target time series dataset is mandatory. A related time series dataset includes time-series data that isn’t included in a target time series dataset and might improve the accuracy of your predictor.

In our example, the target time series dataset contains item_id and timestamp dimensions, and the complementary related time series dataset includes no_of_consumer. An important note with this dataset: the TTS ends on 2023-01-01, and the RTS ends on 2023-01-15. When performing what-if scenarios, it’s important to manipulate RTS variables beyond your known time horizon in TTS.

To conduct a what-if analysis, we need to import two CSV files representing the target time series data and the related time series data. Our example target time series file contains the item_id, timestamp, and demand, and our related time series file contains the product item_id, timestamp, and no_of consumer.

To import your data, complete the following steps:

On the Forecast console, choose View dataset groups.
Choose Create dataset group.
For Dataset group name, enter a name (for this post, water_consumption_datasetgroup).
For Forecasting domain, choose a forecasting domain (for this post, Custom).
Choose Next.
On the Create target time series dataset page, provide the dataset name, frequency of your data, and data schema.
On the Dataset import details page, enter a dataset import name.
For Import file type, select CSV and enter the data location.
Choose the IAM role you created earlier as a prerequisite.
Choose Start.

You’re redirected to the dashboard that you can use to track progress.

To import the related time series file, on the dashboard, choose Import.
On the Create related time series dataset page, provide the dataset name and data schema.
On the Dataset import details page, enter a dataset import name.
For Import file type, select CSV and enter the data location.
Choose the IAM role you created earlier.
Choose Start.

Train a predictor

Next, we train a predictor.

On the dashboard, choose Start under Train a predictor.
On the Train predictor page, enter a name for your predictor.
Specify how long in the future you want to forecast and at what frequency.
Specify the number of quantiles you want to forecast for.

Forecast uses AutoPredictor to create predictors. For more information, refer to Training Predictors.

Choose Create.

Create a forecast

After our predictor is trained (this can take approximately 3.5 hours), we create a forecast. You will know that your predictor is trained when you see the View predictors button on your dashboard.

Choose Start under Generate forecasts on the dashboard.
On the Create a forecast page, enter a forecast name.
For Predictor, choose the predictor that you created.
Optionally, specify the forecast quantiles.
Specify the items to generate a forecast for.
Choose Start.

Query your forecast

You can query a forecast using the Query forecast option. By default, the complete range of the forecast is returned. You can request a specific date range within the complete forecast. When you query a forecast, you must specify filtering criteria. A filter is a key-value pair. The key is one of the schema attribute names (including forecast dimensions) from one of the datasets used to create the forecast. The value is a valid value for the specified key. You can specify multiple key-value pairs. The returned forecast will only contain items that satisfy all the criteria.

Choose Query forecast on the dashboard.
Provide the filter criteria for start date and end date.
Specify your forecast key and value.
Choose Get Forecast.

The following screenshot shows the forecast energy consumption for the same apartment (item ID A_10001) using the forecast model.

Create a what-if analysis

At this point, we have created our baseline forecast can now conduct a what-if analysis. Let’s imagine a scenario where an existing apartment building adds an extension, and the number of households and people increases by 20%. Now you need to do an analysis to forecast increased supply based on increased demand.

There are three stages to conducting a what-if analysis: setting up the analysis, creating the what-if forecast by defining what is changed in the scenario, and comparing the results.

To set up your analysis, choose Explore what-if analysis on the dashboard.
Choose Create.
Enter a unique name and choose the baseline forecast.
Choose the items in your dataset you want to conduct a what-if analysis for. You have two options:
- Select all items is the default, which we choose in this post.
- If you want to pick specific items, choose Select items with a file and import a CSV file containing the unique identifier for the corresponding item and any associated dimensions.
Choose Create what-if analysis.

Create a what-if forecast

Next, we create a what-if forecast to define the scenario we want to analyze.

In the What-if forecast section, choose Create.
Enter a name of your scenario.
You can define your scenario through two options:
- Use transformation functions – Use the transformation builder to transform the related time series data you imported. For this walkthrough, we evaluate how the demand for an item in our dataset changes when the number of consumers increases by 20% when compared to the price in the baseline forecast.
- Define the what-if forecast with a replacement dataset – Replace the related time series dataset you imported.

For our example, we create a scenario where we increase no_of_consumer by 20% applicable to item ID A_10001, and no_of_consumer is a feature in the dataset. You need this analysis to forecast and meet the water supply for increased demand. This analysis also helps you make a cost-effective contract based on the water demand forecast.

For What-if forecast definition method, select Use transformation functions.
Choose Multiply as our operator, no_of_consumer as our time series, and enter 1.2.
Choose Add condition.
Choose Equals as the operation and enter A_10001 for item_id.
Choose Create.

Compare the forecasts

We can now compare the what-if forecasts for both our scenarios, comparing a 20% increase in consumers with the baseline demand.

On the analysis insights page, navigate to the Compare what-if forecasts section.
For item_id, enter the item to analyze (in our scenario, enter A_10001).
For What-if forecasts, choose water_demand_whatif_analyis.
Choose Compare what-if.
You can choose the baseline forecast for the analysis.

The following graph shows the resulting demand for our scenario. The red line shows the forecast of future water consumption for 20% increased population. The P90 forecast type indicates the true value is expected to be lower than the predicted value 90% of the time. You can use this demand forecast to effectively manage water supply for increased demand and avoid any service interruptions.

Export your data

To export your data to CSV, complete the following steps:

Choose Create export.
Enter a name for your export file (for this post, water_demand_export).
Specify the scenarios to be exported by selecting the scenarios on the What-If Forecast drop-down menu.

You can export multiple scenarios at once in a combined file.

For Export location, specify the Amazon S3 location.
To begin the export, choose Create Export.
To download the export, navigate to S3 file path location on the Amazon S3 console, select the file, and choose Download.

The export file will contain the timestamp, item_id, and forecasts for each quantile for all scenarios selected (including the base scenario).

Clean up the resources

To avoid incurring future charges, remove the resources created by this solution:

Delete the Forecast resources you created.
Delete the S3 bucket.

Conclusion

In this post, we showed you how easy to use how to use Forecast and its underlying system architecture to predict water demand using water consumption data. A what-if scenario analysis is a critical tool to help navigate through the uncertainties of business. It provides foresight and a mechanism to stress-test ideas, leaving businesses more resilient, better prepared, and in control of their future. Other utility providers like electricity or gas providers can use Forecast to build solutions and meet utility demand in a cost-effective way.

The steps in this post demonstrated how to build the solution on the AWS Management Console. To directly use Forecast APIs for building the solution, follow the notebook in our GitHub repo.

We encourage you to learn more by visiting the Amazon Forecast Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.

About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Patching In: Cyberpunk 2077’s DLSS 3 Upgrade

Playing with the Future

PC Games Unleash Global Innovation

Cyberpunk 2077’s Visual Storytelling Packs a Punch

Limitations

Training the classifier

Impact on educators and call for input

Better Broadcast Benefits

Give a Boost to Broadcasts

Create a 3D World, Win Serious Studio Hardware

Problem statement

How distributed training works

Solution overview

Enable distributed training

Benchmarking

Conclusion

About the authors

Solution overview

Prerequisites

Create a dataset group and datasets

Train a predictor

Create a forecast

Query your forecast

Create a what-if analysis

Create a what-if forecast

Compare the forecasts

Export your data

Clean up the resources

Conclusion

About the Author

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.