Microsoft at ICALP 2023: Deploying cloud capacity robustly against power failures

Microsoft at ICALP 2023: Deploying cloud capacity robustly against power failures

This research paper was accepted by the 50th EATCS International Colloquium on Automata, Languages and Programming (ICALP 2023), which is dedicated to advancing the field of theoretical computer science.

Image that includes the blog title, ”Microsoft at ICALP 2023: Deploying cloud capacity robustly against power failures.” It also shows the ICALP 2023 logo, which is in Paderborn, Germany, and an image of the first page of the published paper. The background is a subtle abstract design.

In the rapidly evolving landscape of cloud computing, escalating demand for cloud resources is placing immense pressure on cloud providers, driving them to consistently invest in new hardware to accommodate datacenters’ expanding capacity needs. Consequently, the ability to power all this hardware has emerged as a key bottleneck, as devices that power datacenters have limited capacity and necessitate efficient utilization. Efficiency is crucial, not only to lower operational costs and subsequently prices to consumers, but also to support the sustainable use of resources, ensure their long-term availability, and preserve the environment for future generations.

At the same time, it is of the utmost importance to ensure power availability for servers, particularly in the event of a power device failure. Modern datacenter architectures have adopted a strategy to mitigate this risk by avoiding the reliance on a single power source for each server. Instead, each server is powered by two devices. Under normal operations, a server draws half its required power from each device. In the event of a failover, the remaining device steps in to support the server’s full power demand, potentially operating at an increased capacity during the failover period.

Two diagrams depicting the power allocation of three servers (1, 2, and 3) to three power devices (a, b, and c). The first diagram shows the allocation in normal operations with server 1 taking half of its power from device a and half from device b, server 2 being split half and half on devices a and c, and server 3 on devices b and c. The second diagram shows the failover due to device c becoming unavailable; now device a must support the full power of server 2, and device b the full power of server 3, server 1 is still taking power from both devices a and b.
Figure 1: This diagram depicts the power allocation of three servers (1, 2, and 3) to the power devices (a, b, and c) that serve them, during both normal operations and in a failover scenario. The height of each power device represents its capacity, and the height of each server within those power devices shows its power consumption. During failover, the servers have fewer devices from which to draw power, resulting in increased energy consumption from the resources available.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Challenges of optimizing power allocation

In our paper, “Online Demand Scheduling with Failovers,” which we’re presenting at the 50th EATCS International Colloquium on Automata, Languages and Programming, (ICALP 2023), we explore a simplified model that emphasizes power management to determine how to optimally place servers within a datacenter. Our model contains multiple power devices and new servers (demands), where each power device has a normal capacity of 1 and a larger failover capacity, B. Demands arrive over time, each with a power requirement, and we must irrevocably assign each demand to a pair of power devices with sufficient capacity. The goal is to maximize the total power of the assigned demands until we are forced to reject a demand due to lack of power. This helps us maximize the usage of the available power devices.

One crucial aspect of this model is the presence of uncertainty. We assign capacity for each demand without knowing future needs. This uncertainty adds complexity, as selection of device pairs for each demand must be carefully executed to avoid allocations that could hinder the placement of future demands. Figure 2 provides an illustration.

Two diagrams, each showing four power devices a, b, c, and d, and demands that have per-device power requirement ¼ each. The first diagram is labelled “bad assignment” and has two demands that take their power from a and b, and two demands from c and d. The second diagram is labelled “good assignment” and has six demands total, each on a different pair of devices.
Figure 2: This example shows four power devices a, b, c, and d, with failover capacity B=1, and six demands that arrive sequentially, with a per-device power requirement of ¼ each. Suppose four demands have arrived so far. The example on the left represents a bad assignment that cannot accept the remaining two demands. This is because if we placed an additional demand, say, on device a, its failover capacity would be exceeded if b failed. On the other hand, a good assignment, depicted on the right, allows for the placement of all six demands.

The example in Figure 2 suggests that we should spread the demands across device pairs. Otherwise, pairs with large loads could have a big impact on the remaining devices should one device fail. On the other hand, there is a danger in spreading out the demands too much and not leaving enough devices free, as shown in Figure 3.

Two diagrams, each showing four power devices a, b, c, and d, and demands that have per-device power requirement either a small amount epsilon or 0.5. The first diagram is labelled “bad assignment” and has six demands of power epsilon that are each assigned on a different pair of devices. The second diagram is labelled “good assignment” and has six demands of power epsilon assigned to the devices a and b, and one demand of power 0.5 assigned to the devices c and d.
Figure 3: This example involves four power devices, labeled a, b, c, and d, each with failover capacity B=1. The scenario also includes seven demands that arrive sequentially, with the first six requiring little power per device (epsilon = 0.01 units of power, for example), and the last requiring 0.5 units of power. In (a), when the small demands are distributed across pairs, the final demand cannot be accommodated because the failover capacity is exceeded. However, by grouping the small demands on a single pair of power devices, as shown in (b), all demands can be successfully accommodated.

In analyzing the examples in Figures 2 and 3, it becomes clear that striking the right balance is crucial. This entails effectively distributing demands across pairs to minimize any consequences resulting from failovers while also ensuring the availability of an adequate number of devices to meet future demand. Attaining the optimal allocation avoids prematurely ending up with an unassignable demand.

To tackle this challenge, we developed algorithms that are guaranteed to efficiently utilize the available power resources. In fact, our allocations are provably close to optimal, even without upfront knowledge of the demands. Our algorithms essentially conquer the inherent uncertainty of the problem.

Optimizing for the worst-case scenario

Our first algorithm takes a conservative approach, protecting against the worst-case scenario. It guarantees that, regardless of the sequence of demand requests, it will utilize at least half of the power when compared with an optimal solution that has prior knowledge of all demands. As we show in the paper, this result represents the optimal outcome achievable in the most challenging instances.

To effectively strike a balance between distributing demands across devices and ensuring sufficient device availability, the goal of this algorithm is to group demands based on their power requirements. Each group of demands is then allocated separately to an appropriately sized collection of devices. The algorithm aims to consolidate similar demands in controlled regions, enabling efficient utilization of failover resources. Notably, we assign at most one demand per pair of devices in each collection, except in the case of minor demands, which are handled separately.

Optimizing for real-world demand

Because the previous algorithm prioritizes robustness against the worst-case demand sequences, it may unintentionally result in underutilizing the available power by a factor of ½, which is unavoidable in that setting. However, these worst-case scenarios are uncommon in typical datacenter operations. Accordingly, we shifted our focus to a more realistic model where demands arise from an unknown probability distribution. We designed our second algorithm in this stochastic arrival model, demonstrating that as the number of demands and power devices increases, its assignment progressively converges to the optimal omniscient solution, ensuring that no power is wasted.

To achieve this, the algorithm learns from historical data, enabling informed assignment decisions based on past demands. By creating “allocation templates” derived from previous demands, we learn how to allocate future demands. To implement this concept and prove its guarantee, we have developed new tools in probability and optimization that may be valuable in addressing similar problems in the future.

The post Microsoft at ICALP 2023: Deploying cloud capacity robustly against power failures appeared first on Microsoft Research.

Read More

Sierra Division Studios Presents Three Epic Projects Built With NVIDIA Omniverse

Sierra Division Studios Presents Three Epic Projects Built With NVIDIA Omniverse

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

Jacob Norris is a 3D artist and the president, co-founder and creative director of Sierra Division Studios — an outsource studio specializing in digital 3D content creation. The studio was founded with a single goal in mind: to make groundbreaking artwork at the highest level.

His team is entirely remote — giving employees added flexibility to work from anywhere in the world while increasing the pool of prospective artists who have a vast array of experiences and skill sets that the studio can draw from.

Norris envisions a future where incredible 3D content can be made regardless of location, time or even language, he said. It’s a future in which NVIDIA Omniverse, a platform for connecting and building custom 3D tools and metaverse applications, will play a critical role.

Omniverse is also a powerful tool for making SimReady assets — 3D objects with accurate physical properties. Combined with synthetic data, these assets can help solve real-world problems in simulation, including for AI-powered 3D artists. Learn more about AI and access creative resources to level up your passion projects on the NVIDIA Studio creative side hustle page.

Plus, check out the new community challenge, #StartToFinish. Use the hashtag to submit a screenshot of a favorite project featuring both its beginning and ending stages for a chance to be showcased on the @NVIDIAStudio and @NVIDIAOmniverse social channels.

Tapping Omniverse for Omnipresent Work 

“Omniverse is an incredibly powerful tool for our team in the collaboration process,” said Norris. He noted that the Universal Scene Description format, aka OpenUSD, is key to achieving efficient content creation.

“We used OpenUSD to build a massive library of all the assets from our team,” Norris said. “We accomplished this by adding every mesh and element of a single model into a large, easily viewable overview scene for kitbashing, which is the process of combining elements from several assets into an entirely new model.”

The byproduct of kitbashing.

“Since everything is shared in OpenUSD, our asset library is easily accessible and reduces the time needed to access materials and make edits on the fly,” Norris added. “This helps spur inspirational and imaginational forces.”

During the review phase, the team can compare photorealistic models with incredible visual fidelity side by side in a shared space, ensuring the models are “created to the highest set of standards,” said Norris.

The Last Oil Rig on Earth 

Sierra Division’s The Oil Rig video is set on Earth’s last operational fossil fuel rig, which is visited by a playful drone named Quark. The piece’s storytelling takes the audience through an impeccably detailed environment.

Real or rendered?

A scene as complex as the one above required blockouts in Unreal Engine. The team snapped models together from a set of greybox modular pieces, ensuring the environment bits were easy to work with. Once satisfied with the environment concept and layout, the team added further detail to the models.

Building blocks in Unreal Engine.

Norris’ Lenovo ThinkPad P73 NVIDIA Studio laptop with NVIDIA RTX A5000 graphics powered NVIDIA DLSS technology to increase the interactivity of the viewport — by using AI to upscale frames rendered at lower resolution while retaining high-fidelity detail.

Sierra Division then created tiling textures, trim sheets and materials to apply to near-finalized models. The studio used Adobe Substance 3D Painter to design custom textures with edge wear and grunge, taking advantage of RTX-accelerated light and ambient occlusion for baking and optimizing assets in seconds.

Oil rig ocean materials refined in Adobe Substance 3D Painter and Designer.

Next, lighting scenarios were tested in the Omniverse USD Composer app with the Unreal Engine Connector, which eliminates the need to upload, download and refile formats, thanks to OpenUSD.

Stunning detail.

“With OpenUSD, it’s very easy to open the same file you’re viewing in the engine and quickly make edits without having to re-import,” said Norris.

Team-wide review of renders made easier with Omniverse USD Composer.

Sierra Division analyzed daytime, nighttime, rainy and cloudy scenarios to see how the scene resonated emotionally, helping to decide the mood of the story they wanted to tell with their in-progress assets. They settled on a cloudy environment with well-placed lights to evoke feelings of mystery and intrigue.

Don’t underestimate the value of emotion. ‘Oil Rig’ re-light artwork by Ted Mebratu.

“From story-building to asset and scene creation to final renders with RTX, AI and GPU-accelerated features helped us every step of the way.” — Jacob Norris

From here, the team added cameras to the scene to determine compositions for final renders.

“If we were to try to compose the entire environment without cameras or direction, it would take much longer, and we wouldn’t have perfectly laid-out camera shots nor specifically lit renders,” said Norris. “It’s just much easier and more fun to do it this way and to pick camera shots earlier on.”

Final renders were exported lightning fast with Norris’ RTX A5000 GPU into Adobe Photoshop. Over 30 GPU-accelerated features gave Norris plenty of options to play with colors and contrast, and make final image adjustments smoothly and quickly.

Pick camera angles before final composition.

The Oil Rig modular set is available for purchase on Epic Games Unreal Marketplace. Sierra Division donates a portion of every sale to Ocean Conservancy — a nonprofit working to reduce trash, create sustainable fisheries and preserve wildlife.

The Explorer’s Room

For another video, called “The Explorer’s Room,” Sierra Division collaborated with 3D artist Mostafa Sohbi. An environment originally created by Sohbi was a great starting point to expand on the idea of an “explorer” that collects artifacts, gets into precarious situations and uses tools to help him out.

Norris and Sierra Division’s creative workflow for this piece closely mirrored the team’s work on The Oil Rig.

The Explorer’s Room.

“We all know Nathan Drake, Lara Croft, Indiana Jones and other adventurous characters,” said Norris. “They were big inspirations for us to create a living environment that tells our version of the story, while also allowing others to take the same assets and work on their own versions of the story, adding or changing elements in it.”

Don’t lose your way.

Norris stressed the importance of GPU technology in his creative workflow. “Having the fastest-performing GPU allowed us to focus more on the creative process and telling our story, instead of trying to work around slow technology or accommodating for poor performance with lower-quality artwork,” said Norris.

It’s the little details that make this render extraordinary.

“We simply made what we thought was awesome, looked awesome and felt great to share with others,” Norris said. “So it was a no-brainer for us to use NVIDIA RTX GPUs.”

Traveling soon?

Money Heist 

Norris said much of the content Sierra Division creates offers the opportunity for others to use the studio’s assets to tell their own stories. “We don’t always want to over impose our own ideas into a scene or an environment, but we do want to show what is possible,” he added.

Don’t get any ideas.

Sierra Division created the Heist Essentials and Tools Collection set of props to share with game developers, content creators and virtual production teams.

Photorealistic detail.

“It’s always a thrill to recreate props inspired by movies like Ocean’s Eleven and Mission Impossible, and create assets someone might use during these types of missions and sequences,” said Norris.

How much money is that?

Try to spot all of the hidden treasures.

Jacob Norris, president, co-founder and creative director of Sierra Division Studios.

Check out Sierra Division on the studio’s website, ArtStation, Twitter and Instagram.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Get started with NVIDIA Omniverse by downloading the standard license free, or learn how Omniverse Enterprise can connect your team. Developers can get started with Omniverse resources. Stay up to date on the platform by subscribing to the newsletter, and follow NVIDIA Omniverse on Instagram, Medium and TwitterFor more, join the Omniverse community and check out the Omniverse forums, Discord server, Twitch and YouTube channels.

Read More

Exploring institutions for global AI governance

Exploring institutions for global AI governance

New white paper investigates models and functions of international institutions that could help manage opportunities and mitigate risks of advanced AI. Growing awareness of the global impact of advanced artificial intelligence (AI) has inspired public discussions about the need for international governance structures to help manage opportunities and mitigate risks involved. Many discussions have drawn on analogies with the ICAO (International Civil Aviation Organization) in civil aviation; CERN (European Organization for Nuclear Research) in particle physics; IAEA (International Atomic Energy Agency) in nuclear technology, and intergovernmental and multi-stakeholder organisations in many other domains. And yet, while analogies can be a useful start, the technologies emerging from AI will be unlike aviation, particle physics, or nuclear technology. To succeed with AI governance, we need to better understand: what specific benefits and risks we need to manage internationally, what governance functions those benefits and risks require, what organisations can best provide those functions.Read More