1,650+ Global Interns Gleam With NVIDIA Green

A record number of interns calls for a record-sized celebration.

In our largest contingent ever, over 1,650 interns from 350+ schools started with NVIDIA worldwide over the past year.

Amidst busy work days tackling real-world projects across engineering, automation, robotics and more, the group’s also finishing up a three-day celebration, culminating today with National Intern Day. Events ranged from tech demos to virtual meditation and yoga classes to an exclusive Q&A with NVIDIA founder and CEO Jensen Huang.

The three stories below highlight the meaningful work of our interns — who are roughly half undergrads, half grad students — and the connections they’ve been forging.

Bailey Tinkers His Way Into Computer Engineering

Ever since he started tinkering with PC parts at age 12, Darryl Bailey — a computer engineering student at Georgia Tech — knew he wanted to work with computers.

He’s doing just that this summer as an ASIC verification intern on the Compute Express Link team, which ensures bug-free pre-silicon design across multiple GPUs. So far, he’s worked on a dashboard that will simplify the workflows of design verification engineers.

“This project specifically has a high impact because the script is going to be used by all design verification engineers at NVIDIA,” he said. “It’s a really cool feeling to have my code out there in action.”

Bailey hasn’t just been gaining technical skills here. He’s also honed his work style and learned how to use soft skills to most effectively wield the hard skills he’s acquired at school.

“The most important thing I’ve gotten out of this internship is that we’re all just one big team,” he said. “I realized that as much as I want to learn and dive into everything related to computers, it’s also okay to just focus on one thing, because we’re all working together towards the final goal.”

Kulkarni Pioneers a Path in Engineering

This summer, Seema Kulkarni, an electrical and computer engineering student at the University of Texas at Austin, joined NVIDIA as a software R&D intern working on NVIDIA Omniverse, a virtual world simulation and collaboration platform for 3D workflows.

Kulkarni comes from a background with limited early exposure to careers in tech. Her academic journey initially suggested a future in finance and marketing. But, wanting to make a more tangible impact, she switched over to engineering.

“I really loved sitting in on the Women’s Leadership Panel hosted by the University Recruiting team because it felt inspiring to know that even as a woman, you can stay in technical fields for a long time and love it,” she said. “Seeing these female leaders drive innovation here at NVIDIA affirmed that no matter where you’re at, there’s always room to flourish.”

So far, Kulkarni has been working on projects like building asset validators that will simplify user interface for NVIDIA Omniverse users and debugging Universal Scene Description code to resolve critical issues. It’s taught her to be a better software engineer because it’s challenged her to think in the way that the engineers that came before her did.

Kim Kicks Off Her Second Round in Technical Writing

Writing may not be the first thing that comes to mind when one thinks about an internship at NVIDIA.

But as JJ Kim, a marketing major from Boston University, points out, communication is key in any business, even in tech.

This summer, she’sKim is  returning for her second stint on the enterprise marketing team as a technical writing intern. She’s assisted with SIGGRAPH preparations and is churning out explainer blogs, which break down technical concepts in a digestible, approachable way.

“I always feel like I’m learning new things when I write these explainer blogs because they do such a good job of helping someone like me — who has limited technical knowledge — understand what it is that NVIDIA technology does and the impact that it’s making,” she said.

Kim says what’s brought her back for a second internship is NVIDIA’s inclusive, welcoming culture.

“Everyone is so willing to help, which makes work feel like such a safe environment,” she said. “I’m not afraid to try new things out or ask questions because I have such an amazing team of experienced people to work with.”

Read more about NVIDIA’s internship program. Applications are accepted year-round.  

The post 1,650+ Global Interns Gleam With NVIDIA Green appeared first on NVIDIA Blog.

Read More

New hardware offers faster computation for artificial intelligence, with much less energy

As scientists push the boundaries of machine learning, the amount of time, energy, and money required to train increasingly complex neural network models is skyrocketing. A new area of artificial intelligence called analog deep learning promises faster computation with a fraction of the energy usage.

Programmable resistors are the key building blocks in analog deep learning, just like transistors are the core elements for digital processors. By repeating arrays of programmable resistors in complex layers, researchers can create a network of analog artificial “neurons” and “synapses” that execute computations just like a digital neural network. This network can then be trained to achieve complex AI tasks like image recognition and natural language processing.

A multidisciplinary team of MIT researchers set out to push the speed limits of a type of human-made analog synapse that they had previously developed. They utilized a practical inorganic material in the fabrication process that enables their devices to run 1 million times faster than previous versions, which is also about 1 million times faster than the synapses in the human brain.

Moreover, this inorganic material also makes the resistor extremely energy-efficient. Unlike materials used in the earlier version of their device, the new material is compatible with silicon fabrication techniques. This change has enabled fabricating devices at the nanometer scale and could pave the way for integration into commercial computing hardware for deep-learning applications.

“With that key insight, and the very powerful nanofabrication techniques we have at MIT.nano, we have been able to put these pieces together and demonstrate that these devices are intrinsically very fast and operate with reasonable voltages,” says senior author Jesús A. del Alamo, the Donner Professor in MIT’s Department of Electrical Engineering and Computer Science (EECS). “This work has really put these devices at a point where they now look really promising for future applications.”

“The working mechanism of the device is electrochemical insertion of the smallest ion, the proton, into an insulating oxide to modulate its electronic conductivity. Because we are working with very thin devices, we could accelerate the motion of this ion by using a strong electric field, and push these ionic devices to the nanosecond operation regime,” explains senior author Bilge Yildiz, the Breene M. Kerr Professor in the departments of Nuclear Science and Engineering and Materials Science and Engineering.

“The action potential in biological cells rises and falls with a timescale of milliseconds, since the voltage difference of about 0.1 volt is constrained by the stability of water,” says senior author Ju Li, the Battelle Energy Alliance Professor of Nuclear Science and Engineering and professor of materials science and engineering, “Here we apply up to 10 volts across a special solid glass film of nanoscale thickness that conducts protons, without permanently damaging it. And the stronger the field, the faster the ionic devices.”

These programmable resistors vastly increase the speed at which a neural network is trained, while drastically reducing the cost and energy to perform that training. This could help scientists develop deep learning models much more quickly, which could then be applied in uses like self-driving cars, fraud detection, or medical image analysis.

“Once you have an analog processor, you will no longer be training networks everyone else is working on. You will be training networks with unprecedented complexities that no one else can afford to, and therefore vastly outperform them all. In other words, this is not a faster car, this is a spacecraft,” adds lead author and MIT postdoc Murat Onen.

Co-authors include Frances M. Ross, the Ellen Swallow Richards Professor in the Department of Materials Science and Engineering; postdocs Nicolas Emond and Baoming Wang; and Difei Zhang, an EECS graduate student. The research is published today in Science.

Accelerating deep learning

Analog deep learning is faster and more energy-efficient than its digital counterpart for two main reasons. “First, computation is performed in memory, so enormous loads of data are not transferred back and forth from memory to a processor.” Analog processors also conduct operations in parallel. If the matrix size expands, an analog processor doesn’t need more time to complete new operations because all computation occurs simultaneously.

The key element of MIT’s new analog processor technology is known as a protonic programmable resistor. These resistors, which are measured in nanometers (one nanometer is one billionth of a meter), are arranged in an array, like a chess board.

In the human brain, learning happens due to the strengthening and weakening of connections between neurons, called synapses. Deep neural networks have long adopted this strategy, where the network weights are programmed through training algorithms. In the case of this new processor, increasing and decreasing the electrical conductance of protonic resistors enables analog machine learning.

The conductance is controlled by the movement of protons. To increase the conductance, more protons are pushed into a channel in the resistor, while to decrease conductance protons are taken out. This is accomplished using an electrolyte (similar to that of a battery) that conducts protons but blocks electrons.

To develop a super-fast and highly energy efficient programmable protonic resistor, the researchers looked to different materials for the electrolyte. While other devices used organic compounds, Onen focused on inorganic phosphosilicate glass (PSG).

PSG is basically silicon dioxide, which is the powdery desiccant material found in tiny bags that come in the box with new furniture to remove moisture. It is studied as a proton conductor under humidified conditions for fuel cells. It is also the most well-known oxide used in silicon processing. To make PSG, a tiny bit of phosphorus is added to the silicon to give it special characteristics for proton conduction.

Onen hypothesized that an optimized PSG could have a high proton conductivity at room temperature without the need for water, which would make it an ideal solid electrolyte for this application. He was right.

Surprising speed

PSG enables ultrafast proton movement because it contains a multitude of nanometer-sized pores whose surfaces provide paths for proton diffusion. It can also withstand very strong, pulsed electric fields. This is critical, Onen explains, because applying more voltage to the device enables protons to move at blinding speeds.

“The speed certainly was surprising. Normally, we would not apply such extreme fields across devices, in order to not turn them into ash. But instead, protons ended up shuttling at immense speeds across the device stack, specifically a million times faster compared to what we had before. And this movement doesn’t damage anything, thanks to the small size and low mass of protons. It is almost like teleporting,” he says.

“The nanosecond timescale means we are close to the ballistic or even quantum tunneling regime for the proton, under such an extreme field,” adds Li.

Because the protons don’t damage the material, the resistor can run for millions of cycles without breaking down. This new electrolyte enabled a programmable protonic resistor that is a million times faster than their previous device and can operate effectively at room temperature, which is important for incorporating it into computing hardware.

Thanks to the insulating properties of PSG, almost no electric current passes through the material as protons move. This makes the device extremely energy efficient, Onen adds.

Now that they have demonstrated the effectiveness of these programmable resistors, the researchers plan to reengineer them for high-volume manufacturing, says del Alamo. Then they can study the properties of resistor arrays and scale them up so they can be embedded into systems.

At the same time, they plan to study the materials to remove bottlenecks that limit the voltage that is required to efficiently transfer the protons to, through, and from the electrolyte.

“Another exciting direction that these ionic devices can enable is energy-efficient hardware to emulate the neural circuits and synaptic plasticity rules that are deduced in neuroscience, beyond analog deep neural networks. We have already started such a collaboration with neuroscience, supported by the MIT Quest for Intelligence,” adds Yildiz.

“The collaboration that we have is going to be essential to innovate in the future. The path forward is still going to be very challenging, but at the same time it is very exciting,” del Alamo says.

“Intercalation reactions such as those found in lithium-ion batteries have been explored extensively for memory devices. This work demonstrates that proton-based memory devices deliver impressive and surprising switching speed and endurance,” says William Chueh, associate professor of materials science and engineering at Stanford University, who was not involved with this research. “It lays the foundation for a new class of memory devices for powering deep learning algorithms.”

“This work demonstrates a significant breakthrough in biologically inspired resistive-memory devices. These all-solid-state protonic devices are based on exquisite atomic-scale control of protons, similar to biological synapses but at orders of magnitude faster rates,” says Elizabeth Dickey, the Teddy & Wilton Hawkins Distinguished Professor and head of the Department of Materials Science and Engineering at Carnegie Mellon University, who was not involved with this work. “I commend the interdisciplinary MIT team for this exciting development, which will enable future-generation computational devices.”

This research is funded, in part, by the MIT-IBM Watson AI Lab.

Read More

Load-testing TensorFlow Serving’s REST Interface

Posted by Chansung Park and Sayak Paul (ML-GDEs)

In this post, we’ll share the lessons and findings learned from conducting load tests for an image classification model across numerous deployment configurations. These configurations involve REST-based deployments with TensorFlow Serving. In this way, we aim to equip the readers with a holistic understanding of the differences between the configurations.

This post is less about code and more about the architectural decisions we had to make for performing the deployments. We’ll first provide an overview of our setup including the technical specifications. We’ll also share our commentaries on the design choices we made and their impact.

Technical Setup

TensorFlow Serving is feature-rich and has targeted specifications embedded in its designs (more on this later). For online prediction scenarios, the model is usually exposed as some kind of service.

To perform our testing we use a pre-trained ResNet50 model which can classify a variety of images into different categories. We then serve this model in the following way:

Our deployment platform (nodes on the Kubernetes Cluster) is CPU-based. We don’t employ GPUs at any stage of our processes. For this purpose, we can build a CPU-optimized TensorFlow Serving image and take advantage of a few other options which can reduce the latency and boost the overall throughput of the system. We will discuss these later in the post.

You can find all the code and learn how the deployments were performed in this repository. Here, you’ll find example notebooks and detailed setup instructions for playing around with the code. As such, we won’t be discussing the code line by line but rather shed light on the most important parts when necessary.

Throughout the rest of this post, we’ll discuss the key considerations for the deployment experiments respective to TensorFlow Serving including its motivation, limitations, and our experimental results.

With the emergence of serverless offerings like Vertex AI, it has never been easier to deploy models and scale them securely and reliably. These services help reduce the time-to-market tremendously and increase overall developer productivity. That said, there might still be instances where you’d like more granular control over things. This is one of the reasons why we wanted to do these experiments in the first place.

Considerations

TensorFlow Serving has its own sets of constraints and design choices that can impact a deployment. In this section, we provide a concise overview of these considerations.

Deployment infrastructure: We chose GKE because Kubernetes is a standard deployment platform when using GCP, and GKE lets us focus on the ML parts without worrying about the infrastructure since it is a fully managed Google Cloud Platform service. Our main interest is in how to deploy models for CPU-based environments, so we have prepared a CPU-optimized TensorFlow Serving image.

Trade-off between more or fewer servers: We started experiments for TensorFlow Serving setups with the simplest possible VMs equipped with 2vCPU and 4GB RAM, then we gradually upgraded the specification up to 8vCPU and 64GB RAM. On the other hand, we decreased the number of nodes in the Kubernetes cluster from 8 to 2 because it is a trade-off between costs to deploy cheaper servers versus fewer expensive servers.

Options to benefit multi-core environments: We wanted to see if high-end VMs can outperform simple VMs with options to take advantage of the multi-core environment even though there are fewer nodes. To this end, we experimented with a different number inter_op_parallelism and intra_op_parallelism threads for TensorFlow Serving deployment set according to the number of CPU cores.

Dynamic batching and other considerations: Modern ML frameworks such as TensorFlow Serving usually support dynamic batching, initial model warm-up, multiple deployments of multiple versions of different models, and more out of the box. For our purpose of online prediction, we have not tested these features carefully. However, dynamic batching capability is also worth exploring to enhance the performance according to the official document. We have seen that the default batching configuration could reduce the latency a little even though the results of that are not included in this blog post.

Experiments

We have prepared the following environments. In TensorFlow Serving, the number of intra_op_parallelism_threads is set equal to the number of CPU cores while the number of inter_op_parallelism_threads is set from 2 to 8 for experimental purposes as it controls the number of threads to parallelize the execution of independent operations. Below we provide the details on the adjustments we performed on the number of vCPUs, RAM size, and the number of nodes for each Kubernetes cluster. Note that the number of vCPUs and the RAM size are applicable for the cluster nodes individually.

The load tests are conducted using Locust. We have run each load test for 5 minutes. The number of requests are controlled by the number of users, and it depends on the circumstances on the client side. We increased the number of users by one every second up to 150 which we found the handled number of requests reaches the plateau, and the requests are spawned every second to understand how TensorFlow Serving behaves. So you can assume that requests/second doesn’t reflect the real-world situation where clients try to send requests at any time.

We experimented with the following node configurations on a Kubernetes cluster. The configurations are read like so: {num_vcpus_per_node}-{ram}_{num_nodes}:

  • 2vCPUs, 4GB RAM, 8 Nodes
  • 4vCPUs, 8GB RAM, 4 Nodes
  • 8vCPUs, 16GB RAM, 2 Nodes
  • 8vCPUs, 64GB RAM, 2 Nodes

    You can find code for experimenting with these different configurations in the above-mentioned repositories. The deployment for each experiment is provisioned through Kustomize to overlay the base configurations, and file-based configurations are injected through ConfigMap.

    Results

    This section presents the results for each of the above configurations and suggests which configuration is the best based on the environments we considered. As per Figure 1, the best configuration and the environmental setup is observed as 2 nodes, 8 intra_op_parallelism_threads, 8 inter_op_parallelism_threads, 8vCPUs, 16GB RAM based on the result.

    Figure 1: Comparison between different configurations of TensorFlow Serving (original).

    We have observed the following aspects by picking the best options.

    • TensorFlow Serving is more efficient when deployed on fewer, larger (more CPU and RAM) machines, but the RAM capacity doesn’t have much impact on handling more requests. It is important to find the right number of inter_op_parallelism_threads with experimentation. With a higher number the better performance is not always guaranteed even when the nodes are equipped with high-capacity hardware.

    TensorFlow Serving focuses more on reliability than throughput performance. We believe it sacrifices some throughput performance to achieve reliability, but this is the expected behavior of TensorFlow Serving, as stated in the official document. Even though handling as many requests as possible is important, keeping the server as reliable as possible is also substantially important when dealing with a production system.

    There is a trade-off between performance and reliability, so you must be careful to choose the right one. However, it seems like the throughput performance of TensorFlow Serving is close enough to results from other frameworks such as FastAPI, and if you want to factor in richer features such as dynamic batching and sharing GPU resources efficiently between models, we believe TensorFlow Serving is the right one to choose.

    Note on gRPC and TensorFlow Serving

    We are dealing with an image classification model for the deployments, and the input to the model will include images. Hence the size of the request payload can spiral up depending on the image resolution and fidelity. Therefore it’s particularly important to ensure the message transmission is as lightweight as possible. Generally, message transmission is quite a bit faster in gRPC than REST. This post provides a good discussion on the main differences between REST and gRPC APIs.

    TensorFlow Serving can serve a model with gRPC seamlessly, but comparing the performance of a gRPC API and REST API is non-trivial. This is why we did not include that in this post. The interested readers can check out this repository that follows a similar setup but uses a gRPC server instead.

    Costs

    We used the GCP cost estimator for this purpose. Pricing for each experiment configuration was assumed to be live for 24 hours per month (which was sufficient for our experiments).

    Machine Configuration (E2 series)

    Pricing (USD)

    2vCPUs, 4GB RAM, 8 Nodes

    11.15

    4vCPUs, 8GB RAM, 4 Nodes

    11.15

    8vCPUs, 16GB RAM, 2 Nodes

    11.15

    8vCPUs, 64GB RAM, 2 Nodes

    18.21

    Conclusion

    In this post, we discussed some critical lessons we learned from our experience of load-testing a standard image classification model. We considered the industry-grade framework for exposing the model to the end-users – TensorFlow Serving. While our setup for performing the load tests may not fully resemble what happens in the wild, we hope that our findings will at least act as a good starting point for the community. Even though the post demonstrated our approaches with an image classification model, the approaches should be fairly task-agnostic.

    In the interest of brevity, we didn’t do much to push further the efficiency aspects of the model in both the APIs. With modern CPUs, software stack, and OS-level optimizations, it’s possible to improve the latency and throughput of the model. We redirect the interested reader to the following resources that might be relevant:

    Acknowledgements

    We are grateful to the ML Ecosystem team that provided GCP credits for supporting our experiments. We also thank Hannes Hapke and Robert Crowe for providing us with helpful feedback and guidance.

    Read More