An Ultimate GFN Thursday: 41 New Games, Plus ‘Baldur’s Gate 3’ Full Release and First Bethesda Titles to Join the Cloud in August

An Ultimate GFN Thursday: 41 New Games, Plus ‘Baldur’s Gate 3’ Full Release and First Bethesda Titles to Join the Cloud in August

The Ultimate upgrade is complete — GeForce NOW Ultimate performance is now streaming all throughout North America and Europe, delivering RTX 4080-class power for gamers across these regions. Celebrate this month with 41 new games, on top of the full release of Baldur’s Gate 3 and the first Bethesda titles coming to the cloud as the NVIDIA and Microsoft partnership benefits gamers everywhere.

And catch GeForce NOW at QuakeCon — the popular bring-your-own-PC mega-event running Aug. 10-13 — where the in-person and digital GeForce NOW Ultimate challenge will kick off.

Plus, game on with gaming peripherals and accessories company SteelSeries, which will be giving away codes for three-day GeForce NOW Ultimate and Priority memberships, along with popular GeForce NOW games and in-game goodies.

The Ultimate Rollout

Ultimate upgrade on GeForce NOW
Ultimate members everywhere have unlocked their maximum PC gaming potential.

The rollout of GeForce RTX 4080 SuperPODs across the world this year lit up cities with cutting-edge performance from the cloud. RTX 3080 members were introduced to the Ultimate membership, featuring gaming at 4K resolution 120 frames per second, or even up to 240 fps with ultra-low latency thanks to NVIDIA Reflex technology.

Ultimate memberships also bring the benefits of the NVIDIA Ada Lovelace architecture — including DLSS 3 with frame generation for the highest frame rates and visual fidelity, and full ray tracing for the most immersive, cinematic, in-game lighting experiences. Plus, ultrawide resolutions were supported for the first time ever from the cloud.

And members can experience it all without having to upgrade a single piece of hardware. With RTX 4080-class servers fully deployed, gamers can now experience ultra-high fps streaming from GeForce RTX 4080-class power in the cloud and see how an Ultimate membership raises the bar on cloud gaming.

To celebrate, the GeForce NOW team will be showing off Ultimate at QuakeCon with a special GeForce NOW Ultimate challenge. Members can register now to be first in line to get a free one-day upgrade to an Ultimate membership and see how their skills improve with 240 fps gaming when the challenge launches next week. Top scorers at QuakeCon can win various prizes, along with those participating in the challenge from home. Keep an eye out on GeForce NOW’s Twitter and Facebook accounts for more details.

It’s Party Time

The best thing to pair with an Ultimate membership are the best games in the cloud. Members have been enjoying early access to Baldur’s Gate 3 from Larian Studios, the role-playing game set in the world of Dungeons and Dragons that raised the bar for the RPG genre.

Baldur's Gate 3 full launch on GeForce NOW
Roll a nat 20 when streaming from the cloud.

Now, the full PC game launches and is streamable from GeForce NOW today. Choose from a wide selection of D&D races and classes, or play as an origin character with a handcrafted background. Adventure, loot, battle and romance while journeying through the Forgotten Realms and beyond. The game features a turn-based combat system, a dialogue system with choices and consequences, and a rich story that adapts to player actions and decisions.

Stream it across devices, whether solo or with others in online co-op mode. Those playing from the cloud will be able to enjoy it without worrying about download times or system requirements.

The Ultimate Shooters

Several titles from Bethesda’s well-known franchises — DOOM, Quake and Wolfenstein — will join the cloud this month for a mix of modern and classic first-person shooter games to enjoy across nearly all devices.

Feel the heat with the DOOM franchise, recognizable through its fast-paced epic gameplay and iconic heavy-metal soundtrack. Players take on the role of the DOOM Slayer to fight hordes of invading demons.

In addition, the Quake series features single- and multiplayer campaigns with gritty gameplay and epic music scores in which members can enjoy two sides of the legendary series.

First titles from Bethesda franchises to join GeForce NOW
The first Bethesda titles to heat up the cloud.

The modern Wolfenstein games feature intense first-person combat against oversized Nazi robots, hulking super soldiers and elite shock troops. Discover an unfamiliar world ruled by a familiar enemy — one that’s changed and twisted history as you know it.

Experience all of these iconic franchises with an Ultimate or Priority membership. Priority members get faster access to GeForce RTX servers in the cloud over free members, along with up to six-hour gaming sessions. Ultimate members can raze their enemies in epic 4K and ultrawide resolution, with up to eight-hour gaming sessions.

Ready, Set, Play!

SteelSeries Game On giveaway on GeForce NOW
Game on!

GeForce NOW and SteelSeries are rewarding gamers ‌throughout‌ August as part of the SteelSeries’ Game On sweepstakes.

Each week, gamers will have a chance to win three-day GeForce NOW Ultimate and Priority codes bundled with popular titles supported in the cloud — RuneScape, Genshin Impact, Brawlhalla and Dying Light 2 — as well in-game goodies.

Check GFN Thursday each week to see what the reward drop will be and head over to the SteelSeries Games site for more details on how to enter. Plus, save 20% with code “NVIDIAGAMEON” this month for premium SteelSeries products, which are perfect to pair with GeForce NOW cloud gaming.

Members can look forward to the 10 new games joining this week:

  • F1 Manager 2023 (New release on Steam, July 31)
  • Bloons TD 6 (Free on Epic Games Store, Aug. 3)
  • Bloons TD Battles 2 (Steam)
  • Brick Rigs (Steam)
  • Demonologist (Steam)
  • Empires of the Undergrowth (Steam)
  • Stardeus (Steam)
  • The Talos Principle (Steam)
  • Teenage Mutant Ninja Turtles: Shredder’s Revenge (Steam)
  • Yet Another Zombie Survivors (Steam)

And here’s what the rest of August looks like:

  • WrestleQuest (New release on Steam, Aug. 7)
  • I Am Future (New release on Steam, Aug. 8)
  • Atlas Fallen (New release on Steam, Aug. 10)
  • Sengoku Dynasty (New release on Steam, Aug. 10)
  • Tales & Tactics (New release on Steam, Aug. 10)
  • Moving Out 2 (New release on Steam, Aug. 15)
  • Hammerwatch II (New release on Steam, Aug. 15)
  • Desynced (New release on Steam, Aug. 15)
  • Wayfinder (New release on Steam, Aug. 15)
  • The Cosmic Wheel Sisterhood (New release on Steam, Aug. 16)
  • Gord (New release on Steam, Aug. 17)
  • Book of Hours (New release on Steam, Aug. 17)
  • Shadow Gambit: The Cursed Crew (New release on Steam, Aug. 17)
  • The Texas Chain Saw Massacre (New release on Steam, Aug. 18)
  • Bomb Rush Cyberfunk (New release on Steam, Aug. 18)
  • Jumplight Odyssey (New release on Steam, Aug. 21)
  • Blasphemous 2 (New release on Steam, Aug. 24)
  • RIDE 5 (New release on Steam, Aug. 24)
  • Sea of Stars (New release on Steam, Aug. 29)
  • Trine 5: A Clockwork Conspiracy (New release on Steam, Aug. 31)
  • Deceit 2 (New release on Steam, Aug. 31)
  • Inkbound (Steam)
  • LEGO Brawls (Epic Games Store)
  • Regiments (Steam)
  • Session (Epic Games Store)
  • Smalland: Survive the Wilds (Epic Games Store)
  • Superhot (Epic Games Store)
  • Terra Invicta (Epic Games Store)
  • Wall World (Steam)
  • Wild West Dynasty (Epic Games Store)
  • WRECKFEST (Epic Games Store)
  • Xenonauts 2 (Epic Games Store)

A Jammin’ July

On top of the 14 games announced in July, four extra joined the cloud last month:

  • Let’s School (New release on Steam, July 26)
  • Grand Emprise: Time Travel Survival (New release on Steam, July 27)
  • Dragon’s Dogma: Dark Arisen (Steam)
  • OCTOPATH TRAVELER (Epic Games Store)

What are you looking forward to streaming this month? Let us know your answer on Twitter or in the comments below.

Read More

Collaborators: Data-driven decision-making with Jina Suh and Shamsi Iqbal

Collaborators: Data-driven decision-making with Jina Suh and Shamsi Iqbal

black and white photos of Principal Researcher Dr. Jina Suh and Principal Applied and Data Science Manager Dr. Shamsi Iqbal, next to the Microsoft Research Podcast

Episode 144 | August 3, 2023

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

In this episode of the podcast, Dr. Gretchen Huizinga welcomes Principal Researcher Dr. Jina Suh and Principal Applied and Data Science Manager Dr. Shamsi Iqbal to the show to discuss their most recent work together, a research project aimed at developing data-driven tools to support organizational leaders and executives in their decision-making. The longtime collaborators explore how a long history of collaboration helps them thrive in their work to help workplaces thrive, how their relationship has evolved over the years, particularly with Iqbal’s move from the research side to the product side, and how research and product can align to achieve impact.

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

JINA SUH: So in this particular project we’re working on now, we’re focusing our attention on people leaders. And these people leaders have to make decisions that impact the work practices, work culture, you know, eventually the wellbeing of the team. And so the question we’re raising is how do we design tools that support these people leaders to attend to their work practices, their team, and their environment to enable more decisive and effective action in a data-driven way?

SHAMSI IQBAL: And so we need to think big, think from an organizational point of view. And then we need to think about if we walk it back, how does this impact teams? And if we want teams to function well, how do we enable and empower the individuals within those teams?

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC ENDS]


I’m here today with Dr. Jina Suh, a Principal Researcher in the Human Understanding and Empathy group at Microsoft Research, and Dr. Shamsi Iqbal, the Principal Applied and Data Science Manager for the Viva Insights team in the Microsoft Data Platform and Growth organization. Jina and Shamsi are collaborating on a research project for Viva Insights that they hope will help people leaders reduce uncertainties and make informed and data-driven decisions for their teams. Before we unpack how they hope to do that, let’s meet our collaborators. Jina, I’ll start with you. Tell us a bit more about the Human Understanding and Empathy group at Microsoft Research and your role there. So what lines of research does the group pursue? And what are your particular interests and passions?

JINA SUH: Thank you for having me here, first of all. So our group does exactly what the name says. [LAUGHTER] We use technologies to gain understanding about people, and we try to design and develop technologies that use this understanding towards human wellbeing. Um, so we try to build technologies that are more empathic, you know, therapeutic; they can assist and augment people and so forth. And so my particular area of interest is in, um, identifying challenges and barriers for mental wellbeing and designing technology interventions and tools to support mental wellbeing. And so while I’ve done this research in the clinical domains, my interest of late has been around workplace wellbeing or mental health in non-clinical domains.

HUIZINGA: Mm hmm. So tell me a little bit more; when you say human understanding and empathy, and yet you’re working with machines, um, are you focused more on the psychological aspects of understanding people and applying those to machine technologies?

SUH: That and, and some more. So we use technologies to gain better understanding about people, their psychologies, their physiology, the contexts around them, whether they’re at work in front of a computer or they’re working out. But we also use technology to bring interventions in a way that actually is more accessible than traditional, I guess, going to therapists, um, and seeing, seeing somebody in person. So we try to bring technologies and interventions in the moment of wherever you are.

HUIZINGA: Yeah, we could have a whole podcast on “can machines have empathy?” but we won’t … today! Maybe I’ll have you back for that. Uh, Shamsi, you’ve had research space in Building 99, and now you’re embedded in a product group. So give us a brief guided tour of Microsoft Viva, and then tell us about the specific work you’re doing there.

SHAMSI IQBAL: Well, thanks, Gretchen. I’m super excited to be here and especially with my good friend Jina today. So let me talk a little bit about Microsoft Viva first. So, um, this is an employee experience platform that is built for organizations, teams, and individuals to support their workplace experience needs. And by experience needs, what I mean is, uh, things that help them thrive at work. So you could imagine these are data-driven insights about how they work and how they achieve their goals, curated knowledge that helps them get to their goals. There are opportunities to foster employee communications and collaborations. There is also learning opportunities tailored to their needs … um, all elements that are really important for people thriving at work.

HUIZINGA: So give me like a 10,000-foot view. I understand there’s four sort of major elements to, to Viva, and you’re particularly in the Insights. What are the other ones, and what does each of them do kind of in the context of what you just said?

IQBAL: So there are a few, and there are a few that are also coming on board soon.

HUIZINGA: Ooohhh! [LAUGHS]

IQBAL: So there is, for example, there is Viva Engage, uh, that helps with employee communication and collaboration. There is Viva Goals that helps you with exactly what I said— goals—and helps you achieve outcomes. There is Viva Topics that helps you with the knowledge generation and knowledge curation and contextually, uh, help people get access to that knowledge. So, so these are a few examples of the modules that are within Viva. And I work in Viva Insights. I lead a team of applied scientists and data scientists, and our particular charter is to bring deep science into the product. And we do this through incubation of new ideas, where we get to partner with MSR and OAR and all these other cool research groups that exist in, in Microsoft. And then we are also tasked with translating some of these complex research findings into practical product directions. So these are kind of like the two main things that we are responsible for.

HUIZINGA: So without giving away any industry secrets, can you hint at what is coming online, or is that all under wraps right now?

IQBAL: Well, I mean, some things are obviously under wraps, so I shouldn’t be letting out all the secrets. But in general, right now, we are focusing on really organizations and organizational leaders and how we can help them achieve the outcomes that they’re trying to achieve. And we are trying to do this in a, in a data-driven way where we can show them the data that is going to be important for them, to help them, uh, reach their organizational decisions. And also at the same time, we want to be able to point them towards actions and provide them support for actions that will help them better achieve those outcomes.

HUIZINGA: Yeah. I’ll be interested when we get into the meat of the project, how what Jina does with human understanding and empathy plays into what you’re doing with kind of business/workplace productivity and I guess we’ll call it wellbeing, because if you’re doing a good job, you usually feel good, right?

IQBAL: Right. So yeah, I think that that’s really important, and that’s where Jina and I kind of started, is that thinking about productivity and wellbeing really being intertwined together. So I mean, if you’re not feeling good about yourself, you’re really not going to be productive. That is at an individual level …

HUIZINGA: … and vice versa. Shamsi, before we go on, I want you to talk a little bit more about your move from research to product. And I’ve often called this the human version of tech transfer, but what prompted the move, and how is your life the same or different now that you’re no longer sort of a research official?

IQBAL: Well, I still like to think of myself as a research official, but I’ll, I’ll give you a little bit of history behind that. So I was in Microsoft Research for, what, 13 years and kind of like settled into my role thinking that, well, going for the next big project and working on the research insights and kind of like making impact at the research level was satisfying. And then COVID happened. 2020. The workplace transformed completely. And I took a step back, and I was thinking that, well, I mean, this may be the time to take the research that we have done for so many years in the lab and take it to practice.

HUIZINGA: Yeah.

IQBAL: And so there was this opportunity in Viva Insights at that point, which was just announced, and I felt that, well, let me go and see if I can do actually some tech transfer there, so bring in some of the work that we have been doing in MSR for years and see how it applies in a practical setting.

HUIZINGA: Yeah. And, and obviously you’re connected to a lot of the people here like Jina. So having followed both of you—and not in a stalker kind of way—but following you and your work, um, over the last couple of years, I know you have a rich history of both research and academic collaborations, but I want to know what you’re working on now and kind of give us an explanation of the project and how it came about. I like to call this “how I met your mother.” Although you guys have known each other for years. So, Jina, why don’t you take the lead on this, and Shamsi can corroborate if the story’s accurate on how you guys got together on this, but what are you doing with this project, and, um, how did it come about?

SUH: Yeah, so I wanted to kind of go back to what Shamsi was saying before. We’ve been collaborating for a while, but our common passion area is really at the intersection of productivity and, and wellbeing. And I think this is like, I don’t know, a match made in heaven … is to help people be productive by, um, you know, helping them achieve wellbeing at work and vice versa. And so my focus area is individual wellbeing. And as prior literature should have warned me, and I had, I had been ignoring, helping individuals can only go so far. There are organizational factors that make it difficult for individuals to perform activities that help them thrive. And so Shamsi and I have been working on several projects that started as individual tools, but later it really revealed fundamental organizational issues where we couldn’t just ignore factors outside of an individual. So in this particular project we’re working on now, we’re focusing our attention on people leaders. These are organizational leaders and executives like C-suites, as well as middle managers and line managers. And these people leaders have to make decisions that impact the work practices, work culture, you know, eventually the wellbeing of the team. And sometimes these decisions are made with a hunch based on anecdotes or these decisions are not made at all. And so the question we’re raising is how do we design tools that support these people leaders to attend to their work practices, um, their team, and their environment to enable more decisive and effective action in a data-driven way? And so what is the role of data in this process, and how do we facilitate that, you know, reflexive conversation with data to reduce uncertainty about these decisions?

HUIZINGA: Mmmm. You know what I love, and this is just a natural give-and-take in the conversation, but the idea that the, the individual is a situated being in a larger cultural or societal or workplace setting, and you can’t just say, well, if I make you happy, everything’s cool. So you’ve got a lot of factors coming in. So it’s really interesting to see where this work might go. I love that. Some collaborators just started working with each other and you guys, because you’ve been together for some time, have an established relationship. How do you think, Shamsi, that that has impacted the work you do or … if at all? I mean, because we’re focusing on collaboration, I’m kind of interested to tease out some of the people on the show who have just started working together. They don’t know each other. They’re in different states, different countries. Are there any advantages to working together for years and now doing a collaboration?

IQBAL: Oh! I can name so many advantages! Jina and I have worked closely for so long. We know each other’s strengths, weaknesses, working styles. So I mean, when I moved into Viva Insights, I knew that Jina was going to be a close collaborator not just because of the projects that we had worked on and the natural connection and alignment to Viva Insights, but it’s just because I knew that whenever I have a question, I can go to Jina and she’s going to go and dig out all the research. She maybe already knew that I was going to ask that question, and she had that already ready. I don’t know. I mean, she seems to read my mind before even I know the questions that I was going to ask her. So this was very natural for me to continue with that collaboration. And my colleagues over in Viva Insights, they also know Jina from previous, uh, interactions. So whenever I say that, “Well, I’m collaborating with Jina Suh in MSR,” and they say, “Oh yeah, we know Jina! So we are in good hands.”

HUIZINGA: Jina, is that true? You can read minds?

SUH: I am …

HUIZINGA: Or just hers? [LAUGHTER]

SUH: I’m … I’m sweating right now! [LAUGHTER] I’m so nervous.

HUIZINGA: Oh my gosh! … Well, so how about you? I mean, you had talked earlier about sort of a language barrier when you don’t know each other and, and because you’ve both been researchers, so … what advantages can you identify from this kind of connection?

SUH: Yeah, I think having Shamsi in the product group and having Shamsi being, uh, in Microsoft Research before, she knows how to translate my words into the product words, and I, I, I’m pretty sure, Shamsi, you, you might have struggled at the beginning. I’m not sure … at the product group.

IQBAL: I did, I did. I still do! [LAUGHTER]

SUH: But I think that struggle, I mean, she knows how to amplify my research and how to, um, talk about it in a way that the product groups will appreciate it. And she finds and identifies opportunities where research is needed, where I could actually shine. You know, before it was like, “I did all this research! Come look at me! Come look at me!” Shamsi will like, you know, find me, find opportunities for me, and say, “Hey, we have this gap. Can you come and speak about that?” And so, I don’t know, having that bridge I think really helps. And Shamsi is more than a collaborator. She’s more of a mentor for me, um, in that regard.

HUIZINGA: Awesome … so academics definitely have a particular way of talking and writing … and communicating, and product people do, too. So what might be an example of what would need that kind of level of translation, if you will? What, what might not someone get?

IQBAL: I think one of the things that I am still learning, and it took me a while to get to that point where I really started understanding that I need to fill this gap because there is a substantial gap between where research findings are and how that actually resonates with a product team.

HUIZINGA: Interesting.

IQBAL: And I think that the biggest question there is that taking the research findings and situating it in a real-world problem. So in the product world, we talk a lot about customer needs. And coming from research, I had the idea that, well, I will identify the problems and if it’s a compelling problem and I have a good solution, the product teams will come. I have no responsibility in figuring out the customer need because, come on, I already identified a great problem. And I think that I have been humbled over the past couple of years that that is quite not how it works. But being in that space now allows me … every time I come across a research question or I’m setting up some kind of a hypothesis, I take a step back and think about, OK, so how does it relate to the product? What customer need, for now, are we solving, or a future customer need are we going to solve?

HUIZINGA: Right.

IQBAL: And I think that with Jina, she keeps me grounded in the research, but she also has an engineering background, as well. So it is not that she does not understand the space. She understands the constraints in implementation in building something. So that is really good for me because I can borrow that knowledge. And when I talk to my product colleagues, I, I can leverage that, as well.

HUIZINGA: That’s, that’s hilarious because you’ve just identified a third group, which is engineering. Jina, I’m interested to know how previous collaborations might have informed the approach to the work you do now, so briefly talk about your early work together and what learnings could you share from any early successes or failures?

SUH: Yeah, maybe this is a little one-sided than a story about a collaboration, but I’ve, I’ve always looked up to Shamsi as kind of like the expert in productivity and wellbeing, so …

IQBAL: I am now blushing! [LAUGHTER]

SUH: So I think I’m more on the receiving end in this collaborative relationship …

IQBAL: That is not true! [LAUGHTER]

SUH: So, you know, I’ve always been passionate about mental health and emotional wellbeing, and unfortunately, mental health isn’t a high priority for a lot of people and organizations. And, you know, in the workplace, it’s sometimes tricky whether this concept of mental health should or should not be part of the work equation. And I’ve always admired how Shamsi was able to naturally … I mean, it’s, it’s kind of amazing how seamlessly she’s integrating aspects of mental health into the research that she does without really calling it mental health. [LAUGHTER] So, for example, like helping people transition in and out of work and disengage from work. I mean, how close to mental health could that be? You know, taking breaks at work, helping people focus more and distract less … like all these studies around attention that she’s done for years. So these kinds of, um, way that Shamsi is able to bring aspects of something that I’m really passionate about into, uh, into the workplace and into a language where product groups and the businesses really care about, that’s one of my biggest learnings from looking up to Shamsi and working together. You know, she’s constantly helping me, trying to understand, um, how do we actually formulate and, and talk about wellbeing in the context of the workplace so that leaders and organizational leaders, as well as product and business owners, as well, Microsoft in general, appreciate the work that we do. So that’s been really a blessing to have Shamsi be my partner …

HUIZINGA: Yeah. Shamsi, do you want to spread the love back on her? [LAUGHS]

IQBAL: Yeah, I think that I get motivated by Jina every single day, and, um, I think one of the things, which … I was going to interrupt you, but you were, you were, you were articulating this so nicely that I felt that I needed to wait and not interrupt and then pick an opportune moment to interrupt! So I, I, I am bringing back my PhD thesis into this conversation. [LAUGHS]

HUIZINGA: Right on!

IQBAL: Yeah! So, so one thing which was super interesting to me when I moved over to the product group and I was starting to look deeply into how we can bring some of the wellbeing-related work into the product. And I started digging into the organizational behavior literature. And what was fascinating is that everything we talked about in terms of wellbeing had a different definition in terms of like workplace thriving. And so things about giving people mental space to work and giving people opportunity to grow and belonging and all of these constructs that we typically relate to mental health, those are actually important workplace wellbeing constructs that have a direct relationship to workplace outcomes. And so I tried to reframe it in that way so that it doesn’t come across as a “good to have”; it comes across as a “really necessary thing to have.” What has happened over the past few months or so, I would say, there has been a shift in how organizations are thinking about people and individuals and wellbeing and productivity, and this is just a function of how the world is right now, right? So organizations and leaders are thinking that maybe now is the time to bring the focus back onto outcomes— productivity, revenue. And it seems that all the other things that we were focusing on … 2020, 2021 … about really being employee-centric and allowing employees to bring their best selves to work, it seems on the surface that that has gone sideways. But I’m going to argue that we should not be doing that because at the end of the day, the individuals are the ones whose work is going to aggregate up to the organizational outcomes. So if we want organizations to succeed, we need individuals to succeed. And so, um, at the beginning of the podcast, Gretchen, you were talking about individuals being parts of organizations. So individuals are embedded in teams; teams are embedded in organizations. So if organizations figure out what they want to do, it kind of bubbles down to the individuals. And so we need to think big, think from an organizational point of view, because that keeps us scoped and constrained. And then we need to think about if we walk it back, how does this impact teams? And if we want teams to function well, how do we enable and empower the individuals within those teams? So, so it’s a, it’s a bigger construct than what we had originally started with. And I think that now I am also pushing myself to think beyond just individuals and thinking about how we can best support individuals to thinking about how that actually bubbles up to an organization.

HUIZINGA: Right.

SUH: This is exactly what I’m talking about. [LAUGHTER]

HUIZINGA: Yeah, no, I’m …

SUH: This is exactly … She’s helping me. [LAUGHS]

HUIZINGA: It’s a podcast and you can’t see Jina smiling and nodding her head as Shamsi’s talking. Umm, let’s, let’s drill in a little bit on alignment between product and research, because we talked a little bit earlier about the language barrier and sometimes the outcome difference. And it’s not necessarily conflicting, but it might be different. How do you get to what I’ll call the Goldilocks position of alignment, and what role do you think collaboration plays, if any, in facilitating that alignment?

IQBAL: So it is not easy. And, I mean, obviously, and I think that again, this is where I’m starting to learn how to do this better. I think that starting off with a problem that a product team cares about—and the leaders in the product team care about—I think that that’s where we really want to start. And in this particular collaboration that Jina and I are right now, um, we started off with having a completely different focus, and then in January I came back and told Jina, scratch that; we’ll have to go back to the drawing board and change things! And she didn’t bat an eyelash. Because I was expecting that she would push back and say that, well, I, I have things to deliver, as well. You can’t come and randomize me. But I mean, knowing Jina, she was completely on board. I mean, I was worried. She was less worried than I was. But I think that, um, going back to your original question, I think that alignment in terms of picking a problem that product teams care about, I think that that’s super important. I think that then, going back to the original point about translating the research findings, for this particular project, what we are doing is that we are looking at something that is not going to block the product right now in any way. We are looking at something in the future that will hopefully help, and we are really trying to understand this population in, in a much more deeper way than what we have done before.

HUIZINGA: Right, right. Jina on that same note, you know, Shamsi’s actually sitting over in product right now, so she’s talking about finding a problem that product people care about. But what about from the research angle? How do product people get you on board?

SUH: Yeah, I think for me, I, I wonder about my role in Microsoft Research. You know, why am I in Microsoft doing research? Why am I not doing research somewhere else? So I try to make a concerted effort to see if there are collaborators outside of just research to justify and to make sure that my impact is beyond just research. And so it was just natural, like, you know, me and Shamsi having shared interests, as well as her, you know … collaborating together with Shamsi and her moving to Viva was a natural kind of transition for me. And so having connections into Viva and making sure that I participate in, you know, Viva share-outs or other things where I learn about the product’s priorities, as well as the questions that they have, concerns, challenges that they’re facing. Those are all great opportunities for me to learn about what the product is going through and how I can maybe think about my research direction a little bit differently. You know, I feel like every research question can be morphed into different things, can be looked at it from different perspectives. But having that extra, um, signal from the product group helps me, you know, think about it in a different way, and then I can approach the product group and say, hey, I heard your challenges, and I thought about it. Here’s my research direction. I think it aligns. You know, it’s kind of a back-and-forth dance we have to play, and sometimes it doesn’t quite work out. Sometimes, you know, we just don’t have the resources or interests. But you know, in this case with Shamsi and Viva, I think our interests are just like perfectly aligned. So, you know, Shamsi talked about pivoting … Shamsi has given me enough warnings or, you know, kind of signals that things are changing early enough that I was already thinking about, OK, well, what does it mean for us to pivot? So it wasn’t that big of a deal.

HUIZINGA: Well, yeah, and we’ll get to pivot in a second. So the interesting thing to me right now is on this particular project, where you’re working on data-driven insights for people leaders to make data-driven decisions, how do you then balance say, you have a job at Microsoft Research, you have a lane that you’re running in in terms of deliverables for your bosses, does that impact you in terms of other things you’re doing? Do you have more than one project on the go at a time, or are you pretty focused on this? How does it look?

IQBAL: Jina is smiling! [LAUGHTER]

SUH: I think the DNA of a researcher is that you have way too many things going [LAUGHS] on than you have hands and arms to handle them, so, yes, I have a lot of things going on …

HUIZINGA: What about the researchers? Do the product people have to figure out something that the researchers care about?

IQBAL: So when we started first conceptualizing this project, I think that we started off with the intent that we will have research outcomes and research contributions, but that would be constrained within a particular product space. I think that that’s how we kept it kind of like both interesting for research and for product.

HUIZINGA: Got it.

IQBAL: I mean, one of the responsibilities that I have in my new role is that I also have to kind of deliver ideas that are not going to be immediately relevant maybe but towards the future. And so this gives me the opportunity to explore and incubate those new ideas. Maybe it works out; maybe it doesn’t. Maybe it creates a new direction. The product team is not going to hold me accountable for that because they have given me that, that flexibility that I can go and explore.

HUIZINGA: Have some runway …

IQBAL: Yeah. And so that’s, that’s why I tried to pick something—or Jina and I worked together to pick something—which would have enough interest as a research contribution as well as something that could be picked up by product leader, as well.

HUIZINGA: That’s a perfect way of putting it. You know, speaking of incubation, in some ways, Microsoft is well known for its internships. And you have an intern working on this project right now. So it’s sort of a Microsoft/Microsoft Research/university, um, collaboration. Jina, tell us about the student you’re working with and then talk about how Microsoft Research internships are beneficial, maybe … and anchor that on this particular project.

SUH: So the intern that is working on our project is Pranav Khadpe. He’s a PhD student in the, uh, Human-Computer Interaction Institute at Carnegie Mellon University. So Pranav studies and builds infrastructures that strengthen interdependence and collaboration in occupational communities, which is, I think, really aligned to what this podcast is trying to do!

HUIZINGA: Yeah, absolutely.

SUH: So for example, he builds technologies that support people seeking help and getting feedback, getting mentoring and a sense of belonging through interaction with others within those communities. And I think internships primarily have the benefit of mentoring for the future generation of researchers, right? We’re actually building this pipeline of researchers into technology companies. We’re giving them opportunities to, to experience what it’s like to be in the industry research and use that experience and, and entice them to come work for us, right? [LAUGHS]

HUIZINGA: Right. It’s a farm team!

SUH: Right. [LAUGHTER] Um, so I feel like I have this dual role at Microsoft Research. On one hand, we are researchers like Shamsi and I. We need to continue to push the boundaries of scientific knowledge and disseminate it with the rest of the world. But on the other hand, I need to bring value of that research back into our products and business, right? And so, um, internships that are designed with product partners are really forcing functions for us to think about this dual role, right? It’s a learning opportunity for all of us involved. So from the research side, we learn how to find the right balance between research and product, um, and ensuring that we do successful technology transfer. But from the product perspective, they learn how to apply scientific rigor in their product decisions or decisions that they make … or designs that they make. And it’s a, it’s a really great opportunity for Pranav to be sitting in between the product and research. He’s not only learning what he’s already being trained to do in his PhD, being mainly an independent researcher, but he’s also learning how to bring that back into the product. So now he’s being trained not only to be a researcher in MSR but perhaps an applied scientist in the industry, as well. So I think there’s that benefit.

HUIZINGA: And that gets back to the individual being situated within an organization. And so being an independent researcher is cool, but you’re always going to be working with a team of some kind … if you want a paycheck. [LAUGHS] Or you can just go off and invent. Shamsi, I always ask what could possibly go wrong? Some people hate that question, but I think it’s worth asking. And while I know that data driven—quotation marks around that, air quotes around that—is generally a positive buzz-phrase in decision-making today, I wonder how you’re addressing this augment-not-replace mandate in the work you’re doing. How do you keep humans in the loop with real life wisdom and emotions and prevent the march toward what I would call outsourcing decision-making, writ large, to machines?

IQBAL: I think it’s a, it’s a great question, and it’s a very timely one, right? And, uh, the way that I like to think about it, being a human-computer interaction researcher is—who is now dealing with a lot of data—is that data can show but not necessarily tell. And I think that the “telling” part comes from the humans. Maybe in the future, AI and data will be able to do the telling job better, but right now, humans have the context. So a human being who has the context can look at the data and explain why it is showing certain things. And I think that that’s where I feel that the integration of the human in that process is so important. The challenge is showing them the right data. I think that that is also where the human comes in, in figuring out what they need to see. The data can show them that, and then the human gets to tell the story around it.

HUIZINGA: OK. Jina, do you have any insights on that?

SUH: One of the challenges is that, um, we’re not only just building tools to help people look at the data and contextualize it and explain it and understand it but also show them how powerful it can be in changing their behaviors, changing their organization. And it takes work. So one challenge that, you know, we were just having a discussion over lunch is that, how do we actually get people to be motivated enough to interact with the data, have a conversation with the data? And so these are some of the questions that we are still … uh, we don’t have an answer for; we’re still trying to answer. But, you know, our role is not just to feed data and help them look at it and tell a story about it, but also demonstrate that it’s empowering so that they can have more engaging experience with that data.

HUIZINGA: Tell me what you mean by having a conversation with the data. I mean, what does that look like?

SUH: The obvious … most obvious example would be with the large language models.

HUIZINGA: OK!

SUH: You can have a …

HUIZINGA: An actual conversation!

SUH: An actual conversation with the data. But you can also do that through user experience, right? You can be asking questions. I think a lot of these things happen naturally in your head. You’re formulating questions about data. You’re finding insights. You move on to the next question. You become curious. You ask the next question. You explain it. You bring your context to it and then you explain it. So that sort of experience. But I think that takes a lot of work. And we need to make sure that we entice them to, to make sure that there’s value in doing that extra work.

HUIZINGA: Well, and the fact that you’re embedded in a group called Human Understanding and Empathy and that your intern is on human-computer interaction, the human-centeredness is a huge factor in this kind of work today. Ummm. The path from lab to life, as they say—wait, it’s what I say!—is not necessarily straight and linear and sometimes you have to pivot or, as I like to say, add a kick ball change to the research choreography. How did this work begin? Shamsi, you told a story, um, early on, and I think people like stories. I’m going to have you both address this, but I want Shamsi to go first. How did it begin, and then how did it change? What were the forcing functions that made the pivot happen and that you both reacted to quite eagerly, um, both to meet the research and organizational outcomes?

IQBAL: As we said many times in this podcast, I mean, Jina and I, we, we naturally gravitate towards the same research problems. And so, we were looking at one aspect of Viva Insights last year with another intern, and that internship project, apart from being really impactful and well-received in Viva Insights, as well, I think it was just a lot of fun doing a joint internship project with Jina. And so this time when the application period came around, it was a no-brainer. We were going to submit another proposal. And at that point … based on some of the work that we had done last year … so we were really going to focus on something around how we can help people managers and their reports have better conversations and … towards their goals. Right, Jina? I think that that’s where we had kind of like decided that we were going to focus on. And then we started interviewing interns with that project in mind, and then it was December or January where things shifted, the tech world went through quite a tumultuous time, and, uh, we just had to pivot because our organization had also changed directions and figured that, well, we need to focus more on supporting organizations and organization leaders through this time of change. Which meant that we could still do our internship project, but it just didn’t seem right in terms of what we could do, in terms of impact and maybe immediate impact, too, uh, for the field. So Jina and I talked. I recommended that we shift the intern that we had talked to. I think that we had already talked to Pranav. I mean, he seemed really versatile and smart. And then we just decided, I think he’ll be OK. I think that he will be fine. I think that he would actually be even a better fit for the project that we have in mind.

HUIZINGA: Yeah. You know, as you’re talking, I’m thinking, OK, yeah, we think of the workers in a workplace, but the pressure on leaders or managers is intense to make the right decision. So it seems really in line with the empathy and the productivity to bring those two together, to, to help the people who have the severe pressure of making the right decisions at the right time for their teams, so that’s awesome. Jina, do you have anything to add to the pivot path? I mean, from your perspective.

SUH: Yeah, I mean, like I was saying, I think Shamsi was giving us, or me, plenty of signals that this might be happening. So it gave me plenty of opportunities to think about the research. And really, we didn’t make too many changes. I mean, I’d, I’d like to think that we’re, we’re trying to get at the same problem but from a slightly different angle. And so, you know, before it was individual and manager conversations. Now we’re going from, you know, managers to organizational leaders. At the end of the day, like the real transformative change in an organization happens through the leadership. And so, I think before, we were just kind of trying to connect the individual needs to their, to their immediate managers. But now I think we’re going at the problem in a more fundamental way, really tackling the organizational leaders, helping them make the right decisions to, to help their organizations thrive. And I’m more excited about this new direction than before.

HUIZINGA: Yeah. You know, I hadn’t asked this, and I should be asking it to everyone that comes in the booth or on screen. What are the primary methodologies you engage with in this research? I mean, quantitative, qualitative, mixed?

SUH: Yeah, we, we do everything, I think. [LAUGHS] Um, I, I think that’s the beauty of the human-computer interaction … the space is huge. So we do anything from qualitative interviews, uh, you know, contextual inquiry, like observing people, understanding their work practices, to interviewing people, as well as running survey studies. We’ve done purely quantitative studies, as well, looking at Viva Insights data and understanding the correlation between different signals that Viva Insights is providing with workplace stress at a large scale, at high fidelity, so…

IQBAL: I think another thing that I would add is that sometimes we also build prototypes based on the ideas that we come up with and so we get to evaluate those prototypes in, in smaller scale but in far more depth. And so those kinds of results are also super important for the product teams because that helps bring those ideas to life, is that well, I understand your research and I understand your research findings, but what do I do with it? And so if you have a prototype and that shows that, well, this is something that you might be able to do, and then it’s up to them to figure out whether or not this is actually scalable.

HUIZINGA: Well, as we wrap up, I’d like to give each of you, uh, the chance to do a little future envisioning. And I know that that’s kind of a researcher’s brain anyway, is to say what kind of a future do I want to help build? But how will the workplace be different or better because of the collaborative work you’ve done? Jina, why don’t you go first.

SUH: As researchers, I think it’s our job to get ahead of the curve and to really teach the world how to design technology in a, in a way that considers both its positive and negative impacts. So in the context of using data at work, or data about work, how the data gets used for and against people at work …

HUIZINGA: Ooohh!

SUH: … there’s a lot of fear. Yes, there’s a lot of fear about workplace surveillance.

HUIZINGA: Yeah!

SUH: And so the question for us is, you know, how do we demonstrate that this data can be used ethically and responsibly and that there is value in, in this data. So I, I am hoping that, you know, through this collaboration, I’m hoping that we can pave the way for how to design these technologies responsibly, um, and, and develop data-driven practices.

HUIZINGA: Shamsi, close the show with us and tell me your preferred future. What, what are you going to contribute to the workplace world?

IQBAL: So I would just add one more thing. I think that data responsibility, transparency, and ethical use of data, I think it’s at the core of Microsoft’s mission, and I think it’s on us to show that in our products. I think that the other thing, which is a little away from the data, I think that just going back to this concept of leaders and, uh, individuals, I have always maintained that there is oftentimes a tension between what an individual’s goals might be and what an organization’s goals might be. And I’m hoping through this work that we can kind of like help resolve some of those tensions, that once organization leaders are provided with the right kind of insights and data, they will be more motivated to take actions that will be also beneficial to individuals. Oftentimes, that connection is not very clear, uh, but I’m hoping that we can shed some light on it.

HUIZINGA: Jina, Shamsi, so good to see you again. Smiles so big. Thanks for coming in and sharing your “insights” today.

SUH: Thank you for having us.

IQBAL: Thank you so much. This was a lot of fun.

The post Collaborators: Data-driven decision-making with Jina Suh and Shamsi Iqbal appeared first on Microsoft Research.

Read More

Hugging Face Joins the PyTorch Foundation as a Premier Member

Hugging Face Joins the PyTorch Foundation as a Premier Member

Smiling hugging face

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Hugging Face has joined as a premier member.

Hugging Face has been a long time supporter and contributor to the PyTorch Ecosystem by providing powerful models and resources that accelerate research, development, and adoption of AI technologies, particularly in the field of natural language processing.

“Our mission has always been to democratize AI and make it accessible to everyone. We’re truly aligned with PyTorch’s objective of reducing the barrier of entry to practitioners. By joining the PyTorch Foundation, we can further amplify that impact and support this very important framework of the ecosystem that is PyTorch,” said Lysandre Debut, Head of Open Source at Hugging Face. “We believe the two ecosystems have significant overlap, and collaborating with the foundation will allow us to bridge the gap to provide the best software, the best tools to the machine learning community at large.”

Hugging Face’s Model Hub and open source libraries promote collaboration and knowledge sharing within the AI open source community, making Hugging Face a great match to the growing PyTorch Foundation. They continue to drive industry adoption and collaboration by creating user-friendly tools and resources and providing accessible and well-documented libraries.

“Hugging Face’s commitment to open source development and their exceptional contributions to the PyTorch ecosystem have truly impressed us. With their help, we will drive innovation, foster collaboration, and empower the global AI community to create transformative solutions for the AI community,” said PyTorch Foundation Executive Director Ibrahim Haddad. “We welcome Hugging Face to the PyTorch Foundation and look forward to the achievements that lie ahead.”

As a premier member, Hugging Face is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Lysandre Debut

We’re happy to welcome Lysandre Debut, Head of Open Source at Hugging Face to our board. Lysandre has been at Hugging Face since the company’s pivot to open-source, and was the first engineer to focus entirely on the open-source mission. Now leading the open-source part of the organization, Lysandre remains technically involved by being a core maintainer of the Transformers library.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Hugging Face

Hugging Face is a community and company dedicated to lowering the barrier of entry to Machine Learning and Deep Learning. Strong advocates for open-source and open-science, their model Hub hosts more than 250,000 public models and 50,000 public datasets that are very simple to use. Transformers, Diffusers, PEFT, Accelerate, and Datasets are some of the open-source tools made available by Hugging Face.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page: www.linuxfoundation.org/trademark-usage. Linux is a registered trademark of Linus Torvalds.

Read More

Build a personalized avatar with generative AI using Amazon SageMaker

Build a personalized avatar with generative AI using Amazon SageMaker

Generative AI has become a common tool for enhancing and accelerating the creative process across various industries, including entertainment, advertising, and graphic design. It enables more personalized experiences for audiences and improves the overall quality of the final products.

One significant benefit of generative AI is creating unique and personalized experiences for users. For example, generative AI is used by streaming services to generate personalized movie titles and visuals to increase viewer engagement and build visuals for titles based on a user’s viewing history and preferences. The system then generates thousands of variations of a title’s artwork and tests them to determine which version most attracts the user’s attention. In some cases, personalized artwork for TV series significantly increased clickthrough rates and view rates as compared to shows without personalized artwork.

In this post, we demonstrate how you can use generative AI models like Stable Diffusion to build a personalized avatar solution on Amazon SageMaker and save inference cost with multi-model endpoints (MMEs) at the same time. The solution demonstrates how, by uploading 10–12 images of yourself, you can fine-tune a personalized model that can then generate avatars based on any text prompt, as shown in the following screenshots. Although this example generates personalized avatars, you can apply the technique to any creative art generation by fine-tuning on specific objects or styles.

Solution overview

The following architecture diagram outlines the end-to-end solution for our avatar generator.

The scope of this post and the example GitHub code we provide focus only on the model training and inference orchestration (the green section in the preceding diagram). You can reference the full solution architecture and build on top of the example we provide.

Model training and inference can be broken down into four steps:

  1. Upload images to Amazon Simple Storage Service (Amazon S3). In this step, we ask you to provide a minimum of 10 high-resolution images of yourself. The more images, the better the result, but the longer it will take to train.
  2. Fine-tune a Stable Diffusion 2.1 base model using SageMaker asynchronous inference. We explain the rationale for using an inference endpoint for training later in this post. The fine-tuning process starts with preparing the images, including face cropping, background variation, and resizing for the model. Then we use Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique for large language models (LLMs), to fine-tune the model. Finally, in postprocessing, we package the fine-tuned LoRA weights with the inference script and configuration files (tar.gz) and upload them to an S3 bucket location for SageMaker MMEs.
  3. Host the fine-tuned models using SageMaker MMEs with GPU. SageMaker will dynamically load and cache the model from the Amazon S3 location based on the inference traffic to each model.
  4. Use the fine-tuned model for inference. After the Amazon Simple Notification Service (Amazon SNS) notification indicating the fine-tuning is sent, you can immediately use that model by supplying a target_model parameter when invoking the MME to create your avatar.

We explain each step in more detail in the following sections and walk through some of the sample code snippets.

Prepare the images

To achieve the best results from fine-tuning Stable Diffusion to generate images of yourself, you typically need to provide a large quantity and variety of photos of yourself from different angles, with different expressions, and in different backgrounds. However, with our implementation, you can now achieve a high-quality result with as few as 10 input images. We have also added automated preprocessing to extract your face from each photo. All you need is to capture the essence of how you look clearly from multiple perspectives. Include a front-facing photo, a profile shot from each side, and photos from angles in between. You should also include photos with different facial expressions like smiling, frowning, and a neutral expression. Having a mix of expressions will allow the model to better reproduce your unique facial features. The input images dictate the quality of avatar you can generate. To make sure this is done properly, we recommend an intuitive front-end UI experience to guide the user through the image capture and upload process.

The following are example selfie images at different angles with different facial expressions.

Fine-tune a Stable Diffusion model

After the images are uploaded to Amazon S3, we can invoke the SageMaker asynchronous inference endpoint to start our training process. Asynchronous endpoints are intended for inference use cases with large payloads (up to 1 GB) and long processing times (up to 1 hour). It also provides a built-in queuing mechanism for queuing up requests, and a task completion notification mechanism via Amazon SNS, in addition to other native features of SageMaker hosting such as auto scaling.

Even though fine-tuning is not an inference use case, we chose to utilize it here in lieu of SageMaker training jobs due to its built-in queuing and notification mechanisms and managed auto scaling, including the ability to scale down to 0 instances when the service is not in use. This allows us to easily scale the fine-tuning service to a large number of concurrent users and eliminates the need to implement and manage the additional components. However, it does come with the drawback of the 1 GB payload and 1 hour maximum processing time. In our testing, we found that 20 minutes is sufficient time to get reasonably good results with roughly 10 input images on an ml.g5.2xlarge instance. However, SageMaker training would be the recommended approach for larger-scale fine-tuning jobs.

To host the asynchronous endpoint, we must complete several steps. The first is to define our model server. For this post, we use the Large Model Inference Container (LMI). LMI is powered by DJL Serving, which is a high-performance, programming language-agnostic model serving solution. We chose this option because the SageMaker managed inference container already has many of the training libraries we need, such as Hugging Face Diffusers and Accelerate. This greatly reduces the amount of work required to customize the container for our fine-tuning job.

The following code snippet shows the version of the LMI container we used in our example:

inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In addition to that, we need to have a serving.properties file that configures the serving properties, including the inference engine to use, the location of the model artifact, and dynamic batching. Lastly, we must have a model.py file that loads the model into the inference engine and prepares the data input and output from the model. In our example, we use the model.py file to spin up the fine-tuning job, which we explain in greater detail in a later section. Both the serving.properties and model.py files are provided in the training_service folder.

The next step after defining our model server is to create an endpoint configuration that defines how our asynchronous inference will be served. For our example, we are just defining the maximum concurrent invocation limit and the output S3 location. With the ml.g5.2xlarge instance, we have found that we are able to fine-tune up to two models concurrently without encountering an out-of-memory (OOM) exception, and therefore we set max_concurrent_invocations_per_instance to 2. This number may need to be adjusted if we’re using a different set of tuning parameters or a smaller instance type. We recommend setting this to 1 initially and monitoring the GPU memory utilization in Amazon CloudWatch.

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=f"s3://{bucket}/{s3_prefix}/async_inference/output" , # Where our results will be stored
    max_concurrent_invocations_per_instance=2,
    notification_config={
      "SuccessTopic": "...",
      "ErrorTopic": "...",
    }, #  Notification configuration
)

Finally, we create a SageMaker model that packages the container information, model files, and AWS Identity and Access Management (IAM) role into a single object. The model is deployed using the endpoint configuration we defined earlier:

model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=role,
    env=env
)

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    async_inference_config=async_inference_config
)

predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

When the endpoint is ready, we use the following sample code to invoke the asynchronous endpoint and start the fine-tuning process:

sm_runtime = boto3.client("sagemaker-runtime")

input_s3_loc = sess.upload_data("data/jw.tar.gz", bucket, s3_prefix)

response = sm_runtime.invoke_endpoint_async(
    EndpointName=sd_tuning.endpoint_name,
    InputLocation=input_s3_loc)

For more details about LMI on SageMaker, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

After invocation, the asynchronous endpoint starts queueing our fine-tuning job. Each job runs through the following steps: prepare the images, perform Dreambooth and LoRA fine-tuning, and prepare the model artifacts. Let’s dive deeper into the fine-tuning process.

Prepare the images

As we mentioned earlier, the quality of input images directly impacts the quality of fine-tuned model. For the avatar use case, we want the model to focus on the facial features. Instead of requiring users to provide carefully curated images of exact size and content, we implement a preprocessing step using computer vision techniques to alleviate this burden. In the preprocessing step, we first use a face detection model to isolate the largest face in each image. Then we crop and pad the image to the required size of 512 x 512 pixels for our model. Finally, we segment the face from the background and add random background variations. This helps highlight the facial features, allowing our model to learn from the face itself rather than the background. The following images illustrate the three steps in this process.

Step 1: Face detection using computer vision Step 2: Crop and pad the image to 512 x 512 pixels Step 3 (Optional): Segment and add background variation

Dreambooth and LoRA fine-tuning

For fine-tuning, we combined the techniques of Dreambooth and LoRA. Dreambooth allows you to personalize your Stable Diffusion model, embedding a subject into the model’s output domain using a unique identifier and expanding the model’s language vision dictionary. It uses a method called prior preservation to preserve the model’s semantic knowledge of the class of the subject, in this case a person, and use other objects in the class to improve the final image output. This is how Dreambooth can achieve high-quality results with just a few input images of the subject.

The following code snippet shows the inputs to our trainer.py class for our avatar solution. Notice we chose <<TOK>> as the unique identifier. This is purposely done to avoid picking a name that may already be in the model’s dictionary. If the name already exists, the model has to unlearn and then relearn the subject, which may lead to poor fine-tuning results. The subject class is set to “a photo of person”, which enables prior preservation by first generating photos of people to feed in as additional inputs during the fine-tuning process. This will help reduce overfitting as model tries to preserve the previous knowledge of a person using the prior preservation method.

status = trn.run(base_model="stabilityai/stable-diffusion-2-1-base",
    resolution=512,
    n_steps=1000,
    concept_prompt="photo of <<TOK>>", # << unique identifier of the subject
    learning_rate=1e-4,
    gradient_accumulation=1,
    fp16=True,
    use_8bit_adam=True,
    gradient_checkpointing=True,
    train_text_encoder=True,
    with_prior_preservation=True,
    prior_loss_weight=1.0,
    class_prompt="a photo of person", # << subject class
    num_class_images=50,
    class_data_dir=class_data_dir,
    lora_r=128,
    lora_alpha=1,
    lora_bias="none",
    lora_dropout=0.05,
    lora_text_encoder_r=64,
    lora_text_encoder_alpha=1,
    lora_text_encoder_bias="none",
    lora_text_encoder_dropout=0.05
)

A number of memory-saving options have been enabled in the configuration, including fp16, use_8bit_adam, and gradient accumulation. This reduces the memory footprint to under 12 GB, which allows for fine-tuning of up to two models concurrently on an ml.g5.2xlarge instance.

LoRA is an efficient fine-tuning technique for LLMs that freezes most of the weights and attaches a small adapter network to specific layers of the pre-trained LLM, allowing for faster training and optimized storage. For Stable Diffusion, the adapter is attached to the text encoder and U-Net components of the inference pipeline. The text encoder converts the input prompt to a latent space that is understood by the U-Net model, and the U-Net model uses the latent meaning to generate the image in the subsequent diffusion process. The output of the fine-tuning is just the text_encoder and U-Net adapter weights. At inference time, these weights can be reattached to the base Stable Diffusion model to reproduce the fine-tuning results.

The figures below are detail diagram of LoRA fine-tuning provided by original author: Cheng-Han Chiang, Yung-Sung Chuang, Hung-yi Lee, “AACL_2022_tutorial_PLMs,” 2022

By combining both methods, we were able to generate a personalized model while tuning an order-of-magnitude fewer parameters. This resulted in a much faster training time and reduced GPU utilization. Additionally, storage was optimized with the adapter weight being only 70 MB, compared to 6 GB for a full Stable Diffusion model, representing a 99% size reduction.

Prepare the model artifacts

After fine-tuning is complete, the postprocessing step will TAR the LoRA weights with the rest of the model serving files for NVIDIA Triton. We use a Python backend, which means the Triton config file and the Python script used for inference are required. Note that the Python script has to be named model.py. The final model TAR file should have the following file structure:

|--sd_lora
   |--config.pbtxt
   |--1
      |--model.py
      |--output #LoRA weights
         |--text_encoder
         |--unet
         |--train.sh

Host the fine-tuned models using SageMaker MMEs with GPU

After the models have been fine-tuned, we host the personalized Stable Diffusion models using a SageMaker MME. A SageMaker MME is a powerful deployment feature that allows hosting multiple models in a single container behind a single endpoint. It automatically manages traffic and routing to your models to optimize resource utilization, save costs, and minimize operational burden of managing thousands of endpoints. In our example, we run on GPU instances, and SageMaker MMEs support GPU using Triton Server. This allows you to run multiple models on a single GPU device and take advantage of accelerated compute. For more detail on how to host Stable Diffusion on SageMaker MMEs, refer to Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker.

For our example, we made additional optimization to load the fine-tuned models faster during cold start situations. This is possible because of LoRA’s adapter design. Because the base model weights and Conda environments are the same for all fine-tuned models, we can share these common resources by pre-loading them onto the hosting container. This leaves only the Triton config file, Python backend (model.py), and LoRA adaptor weights to be dynamically loaded from Amazon S3 after the first invocation. The following diagram provides a side-by-side comparison.

This significantly reduces the model TAR file from approximately 6 GB to 70 MB, and therefore is much faster to load and unpack. To do the preloading in our example, we created a utility Python backend model in models/model_setup. The script simply copies the base Stable Diffusion model and Conda environment from Amazon S3 to a common location to share across all the fine-tuned models. The following is the code snippet that performs the task:

def initialize(self, args):
          
        #conda env setup
        self.conda_pack_path = Path(args['model_repository']) / "sd_env.tar.gz"
        self.conda_target_path = Path("/tmp/conda")
        
        self.conda_env_path = self.conda_target_path / "sd_env.tar.gz"
             
        if not self.conda_env_path.exists():
            self.conda_env_path.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy(self.conda_pack_path, self.conda_env_path)
        
        #base diffusion model setup
        self.base_model_path = Path(args['model_repository']) / "stable_diff.tar.gz"
        
        try:
            with tarfile.open(self.base_model_path) as tar:
                tar.extractall('/tmp')
                
            self.response_message = "Model env setup successful."
        
        except Exception as e:
            # print the exception message
            print(f"Caught an exception: {e}")
            self.response_message = f"Caught an exception: {e}"

Then each fine-tuned model will point to the shared location on the container. The Conda environment is referenced in the config.pbtxt.

name: "pipeline_0"
backend: "python"
max_batch_size: 1

...

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/tmp/conda/sd_env.tar.gz"}
}

The Stable Diffusion base model is loaded from the initialize() function of each model.py file. We then apply the personalized LoRA weights to the unet and text_encoder model to reproduce each fine-tuned model:

...

class TritonPythonModel:

    def initialize(self, args):
        self.output_dtype = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]),
                                               "generated_image")["data_type"])
        
        self.model_dir = args['model_repository']
    
        device='cuda'
        self.pipe = StableDiffusionPipeline.from_pretrained('/tmp/stable_diff',
                                                            torch_dtype=torch.float16,
                                                            revision="fp16").to(device)
                                                            
        # Load the LoRA weights
        self.pipe.unet = PeftModel.from_pretrained(self.pipe.unet, unet_sub_dir)

        if os.path.exists(text_encoder_sub_dir):
            self.pipe.text_encoder = PeftModel.from_pretrained(self.pipe.text_encoder, text_encoder_sub_dir)

Use the fine-tuned model for inference

Now we can try our fine-tuned model by invoking the MME endpoint. The input parameters we exposed in our example include prompt, negative_prompt, and gen_args, as shown in the following code snippet. We set the data type and shape of each input item in the dictionary and convert them into a JSON string. Finally, the string payload and TargetModel are passed into the request to generate your avatar picture.

import random

prompt = """<<TOK>> epic portrait, zoomed out, blurred background cityscape, bokeh,
 perfect symmetry, by artgem, artstation ,concept art,cinematic lighting, highly 
 detailed, octane, concept art, sharp focus, rockstar games, post processing, 
 picture of the day, ambient lighting, epic composition"""

negative_prompt = """
beard, goatee, ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = random.randint(1, 1000000000)

gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=7, seed=seed))

inputs = dict(prompt = prompt, 
              negative_prompt = negative_prompt, 
              gen_args = gen_args)

payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}

response = sm_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
    TargetModel="sd_lora.tar.gz",
)
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
original_image = decode_image(output[0]["data"][0])
original_image

Clean up

Follow the instructions in the cleanup section of the notebook to delete the resources provisioned as part of this post to avoid unnecessary charges. Refer to Amazon SageMaker Pricing for details regarding the cost of the inference instances.

Conclusion

In this post, we demonstrated how to create a personalized avatar solution using Stable Diffusion on SageMaker. By fine-tuning a pre-trained model with just a few images, we can generate avatars that reflect the individuality and personality of each user. This is just one of many examples of how we can use generative AI to create customized and unique experiences for users. The possibilities are endless, and we encourage you to experiment with this technology and explore its potential to enhance the creative process. We hope this post has been informative and inspiring. We encourage you to try the example and share your creations with us using hashtags #sagemaker #mme #genai on social platforms. We would love to see what you make.

In addition to Stable Diffusion, many other generative AI models are available on Amazon SageMaker JumpStart. Refer to Getting started with Amazon SageMaker JumpStart to explore their capabilities.


About the Authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Lana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for content moderation, computer vision, and natural language processing. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, and advertising & marketing.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

SageMaker Distribution is now available on Amazon SageMaker Studio

SageMaker Distribution is now available on Amazon SageMaker Studio

SageMaker Distribution is a pre-built Docker image containing many popular packages for machine learning (ML), data science, and data visualization. This includes deep learning frameworks like PyTorch, TensorFlow, and Keras; popular Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab. In addition to this, SageMaker Distribution supports conda, micromamba, and pip as Python package managers.

In May 2023, we launched SageMaker Distribution as an open-source project at JupyterCon. This launch helped you use SageMaker Distribution to run experiments on your local environments. We are now natively providing that image in Amazon SageMaker Studio so that you gain the high performance, compute, and security benefits of running your experiments on Amazon SageMaker.

Compared to the earlier open-source launch, you have the following additional capabilities:

  • The open-source image is now available as a first-party image in SageMaker Studio. You can now simply choose the open-source SageMaker Distribution from the list when choosing an image and kernel for your notebooks, without having to create a custom image.
  • The SageMaker Python SDK package is now built-in with the image.

In this post, we show the features and advantages of using the SageMaker Distribution image.

Use SageMaker Distribution in SageMaker Studio

If you have access to an existing Studio domain, you can launch SageMaker Studio. To create a Studio domain, follow the directions in Onboard to Amazon SageMaker Domain.

  1. In the SageMaker Studio UI, choose File from the menu bar, choose New, and choose Notebook.
  2. When prompted for the image and instance, choose the SageMaker Distribution v0 CPU or SageMaker Distribution v0 GPU image.
  3. Choose your Kernel, then choose Select.

You can now start running your commands without needing to install common ML packages and frameworks! You can also run notebooks running on supported frameworks such as PyTorch and TensorFlow from the SageMaker examples repository, without having to switch the active kernels.

Run code remotely using SageMaker Distribution

In the public beta announcement, we discussed graduating notebooks from local compute environments to SageMaker Studio, and also operationalizing the notebook using notebook jobs.

Additionally, you can directly run your local notebook code as a SageMaker training job by simply adding a @remote decorator to your function.

Let’s try an example. Add the following code to your Studio notebook running on the SageMaker Distribution image:

from sagemaker.remote_function import remote

@remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
def divide(x, y):
    return x / y

divide(2, 3.0)

When you run the cell, the function will run as a remote SageMaker training job on an ml.m5.xlarge notebook, and the SDK automatically picks up the SageMaker Distribution image as the training image in Amazon Elastic Container Registry (Amazon ECR). For deep learning workloads, you can also run your script on multiple parallel instances.

Reproduce Conda environments from SageMaker Distribution elsewhere

SageMaker Distribution is available as a public Docker image. However, for data scientists more familiar with Conda environments than Docker, the GitHub repository also provides the environment files for each image build so you can build Conda environments for both CPU and GPU versions.

The build artifacts for each version are stored under the sagemaker-distribution/build_artifacts directory. To create the same environment as any of the available SageMaker Distribution versions, run the following commands, replacing the --file parameter with the right environment files:

conda create --name conda-sagemaker-distribution 
  --file sagemaker-distribution/build_artifacts/v0/v0.2/v0.2.1/cpu.env.out
# activate the environment
conda activate conda-sagemaker-distribution

Customize the open-source SageMaker Distribution image

The open-source SageMaker Distribution image has the most commonly used packages for data science and ML. However, data scientists might require access to additional packages, and enterprise customers might have proprietary packages that provide additional capabilities for their users. In such cases, there are multiple options to have a runtime environment with all required packages. In order of increasing complexity, they are listed as follows:

  • You can install packages directly on the notebook. We recommend Conda and micromamba, but pip also works.
  • Data scientists familiar with Conda for package management can reproduce the Conda environment from SageMaker Distribution elsewhere and install and manage additional packages in that environment going forward.
  • If administrators want a repeatable and controlled runtime environment for their users, they can extend SageMaker Distribution’s Docker images and maintain their own image. See Bring your own SageMaker image for detailed instructions to create and use a custom image in Studio.

Clean up

If you experimented with SageMaker Studio, shut down all Studio apps to avoid paying for unused compute usage. See Shut down and Update Studio Apps for instructions.

Conclusion

Today, we announced the launch of the open-source SageMaker Distribution image within SageMaker Studio. We showed you how to use the image in SageMaker Studio as one of the available first-party images, how to operationalize your scripts using the SageMaker Python SDK @remote decorator, how to reproduce the Conda environments from SageMaker Distribution outside Studio, and how to customize the image. We encourage you to try out SageMaker Distribution and share your feedback through GitHub!

Additional References


About the authors

Durga Sury is an ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and hiking with her 5-year-old husky.

Ketan Vijayvargiya is a Senior Software Development Engineer in Amazon Web Services (AWS). His focus areas are machine learning, distributed systems and open source. Outside work, he likes to spend his time self-hosting and enjoying nature.

Read More

Automate caption creation and search for images at enterprise scale using generative AI and Amazon Kendra

Automate caption creation and search for images at enterprise scale using generative AI and Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they are looking for, even when it’s scattered across multiple locations and content repositories within your organization.

Amazon Kendra supports a variety of document formats, such as Microsoft Word, PDF, and text from various data sources. In this post, we focus on extending the document support in Amazon Kendra to make images searchable by their displayed content. Images can often be searched using supplemented metadata such as keywords. However, it takes a lot of manual effort to add detailed metadata to potentially thousands of images. Generative AI (GenAI) can be helpful in generating the metadata automatically. By generating textual captions, the GenAI caption predictions offer descriptive metadata for images. The Amazon Kendra index can then be enriched with the generated metadata during document ingestion to enable searching the images without any manual effort.

As an example, a GenAI model can be used to generate a textual description for the following image as “a dog laying on the ground under an umbrella” during document ingestion of the image.

Image of a dog laying under an umbrella as an example of what can be searched in this solution

An object recognition model can still detect keywords such as “dog” and “umbrella,” but a GenAI model offers deeper understanding of what is represented in the image by identifying that the dog lies under the umbrella. This helps us build more refined searches in the image search process. The textual description is added as metadata to an Amazon Kendra search index via an automated custom document enrichment (CDE). Users searching for terms like “dog” or “umbrella” will then be able to find the image, as shown in the following screenshot.

Image of Kendra search tool

In this post, we show how to use CDE in Amazon Kendra using a GenAI model deployed on Amazon SageMaker. We demonstrate CDE using simple examples and provide a step-by-step guide for you to experience CDE in an Amazon Kendra index in your own AWS account. It allows users to quickly and easily find the images they need without having to manually tag or categorize them. This solution can also be customized and scaled to meet the needs of different applications and industries.

Image captioning with GenAI

Image description with GenAI involves using ML algorithms to generate textual descriptions of images. The process is also known as image captioning, and operates at the intersection of computer vision and natural language processing (NLP). It has applications in areas where data is multi-modal such as ecommerce, where data contains text in the form of metadata as well as images, or in healthcare, where data could contain MRIs or CT scans along with doctor’s notes and diagnoses, to name a few use cases.

GenAI models learn to recognize objects and features within the images, and then generate descriptions of those objects and features in natural language. The state-of-the-art models use an encoder-decoder architecture, where the image information is encoded in the intermediate layers of the neural network and decoded into textual descriptions. These can be considered as two distinct stages: feature extraction from images and textual caption generation. In the feature extraction stage (encoder), the GenAI model processes the image to extract relevant visual features, such as object shapes, colors, and textures. In the caption generation stage (decoder), the model generates a natural language description of the image based on the extracted visual features.

GenAI models are typically trained on vast amounts of data, which make them suitable for various tasks without additional training. Adapting to custom datasets and new domains is also easily achievable through few-shot learning. Pre-training methods allow multi-modal applications to be easily trained using state-of-the-art language and image models. These pre-training methods also allow you to mix and match the vision model and language model that best fits your data.

The quality of the generated image descriptions depends on the quality and size of the training data, the architecture of the GenAI model, and the quality of the feature extraction and caption generation algorithms. Although image description with GenAI is an active area of research, it shows very good results in a wide range of applications, such as image search, visual storytelling, and accessibility for people with visual impairments.

Use cases

GenAI image captioning is useful in the following use cases:

  • Ecommerce – A common industry use case where images and text occur together is retail. Ecommerce in particular stores vast amounts of data as product images along with textual descriptions. The textual description or metadata is important to ensure that the best products are displayed to the user based on the search queries. Moreover, with the trend of ecommerce sites obtaining data from 3P vendors, the product descriptions are often incomplete, amounting to numerous manual hours and huge overhead resulting from tagging the right information in the metadata columns. GenAI-based image captioning is particularly useful for automating this laborious process. Fine-tuning the model on custom fashion data such as fashion images along with text describing the attributes of fashion products can be used to generate metadata that then improves a user’s search experience.
  • Marketing – Another use case of image search is digital asset management. Marketing firms store vast amounts of digital data that needs to be centralized, easily searchable, and scalable enabled by data catalogs. A centralized data lake with informative data catalogs would reduce duplication efforts and enable wider sharing of creative content and consistency between teams. For graphic design platforms popularly used for enabling social media content generation, or presentations in corporate settings, a faster search could result in an improved user experience by rendering the correct search results for the images that users want to look for and enabling users to search using natural language queries.
  • Manufacturing – The manufacturing industry stores a lot of image data like architecture blueprints of components, buildings, hardware, and equipment. The ability to search through such data enables product teams to easily recreate designs from a starting point that already exists and eliminates a lot of design overhead, thereby speeding up the process of design generation.
  • Healthcare – Doctors and medical researchers can catalog and search through MRIs and CT scans, specimen samples, images of the ailment such as rashes and deformities, along with doctor’s notes, diagnoses, and clinical trials details.
  • Metaverse or augmented reality – Advertising a product is about creating a story that users can imagine and relate to. With AI-powered tools and analytics, it has become easier than ever to build not just one story but customized stories to appear to end-users’ unique tastes and sensibilities. This is where image-to-text models can be a game changer. Visual storytelling can assist in creating characters, adapting them to different styles, and captioning them. It can also be used to power stimulating experiences in the metaverse or augmented reality and immersive content including video games. Image search enables developers, designers, and teams to search their content using natural language queries, which can maintain consistency of content between various teams.
  • Accessibility of digital content for blind and low vision – This is primarily enabled by assistive technologies such as screenreaders, Braille systems that allow touch reading and writing, and special keyboards for navigating websites and applications across the internet. Images, however, need to be delivered as textual content that can then be communicated as speech. Image captioning using GenAI algorithms is a crucial piece for redesigning the internet and making it more inclusive by providing everyone a chance to access, understand, and interact with online content.

Model details and model fine-tuning for custom datasets

In this solution, we take advantage of the vit-gpt2-image-captioning model available from Hugging Face, which is licensed under Apache 2.0 without performing any further fine-tuning. Vit is a foundational model for image data, and GPT-2 is a foundational model for language. The multi-modal combination of the two offers the capability of image captioning. Hugging Face hosts state-of-the-art image captioning models, which can be deployed in AWS in a few clicks and offer simple-to-deploy inference endpoints. Although we can use this pre-trained model directly, we can also customize the model to fit domain-specific datasets, more data types such as video or spatial data, and unique use cases. There are several GenAI models where some models perform best with certain datasets, or your team might already be using vision and language models. This solution offers the flexibility of choosing the best-performing vision and language model as the image captioning model through straightforward replacement of the model we have used.

For customization of the models to unique industry applications, open-source models available on AWS through Hugging Face offer several possibilities. A pre-trained model can be tested for the unique dataset or trained on samples of the labeled data to fine-tune it. Novel research methods also allow any combination of vision and language models to be combined efficiently and trained on your dataset. This newly trained model can then be deployed in SageMaker for the image captioning described in this solution.

An example of a customized image search is Enterprise Resource Planning (ERP). In ERP, image data collected from different stages of logistics or supply chain management could include tax receipts, vendor orders, payslips, and more, which need to be automatically categorized for the purview of different teams within the organization. Another example is to use medical scans and doctor diagnoses to predict new medical images for automatic classification. The vision model extracts features from the MRI, CT, or X-ray images and the text model captions it with the medical diagnoses.

Solution overview

The following diagram shows the architecture for image search with GenAI and Amazon Kendra.

Architecture of proposed solution

We ingest images from Amazon Simple Storage Service (Amazon S3) into Amazon Kendra. During ingestion to Amazon Kendra, the GenAI model hosted on SageMaker is invoked to generate an image description. Additionally, text visible in an image is extracted by Amazon Textract. The image description and the extracted text are stored as metadata and made available to the Amazon Kendra search index. After ingestion, images can be searched via the Amazon Kendra search console, API, or SDK.

We use the advanced operations of CDE in Amazon Kendra to call the GenAI model and Amazon Textract during the image ingestion step. However, we can use CDE for a wider range of use cases. With CDE, you can create, modify, or delete document attributes and content when you ingest your documents into Amazon Kendra. This means you can manipulate and ingest your data as needed. This can be achieved by invoking pre- and post-extraction AWS Lambda functions during ingestion, which allows for data enrichment or modification. For example, we can use Amazon Medical Comprehend when ingesting medical textual data to add ML-generated insights to the search metadata.

You can use our solution to search images through Amazon Kendra by following these steps:

  1. Upload images to an image repository like an S3 bucket.
  2. The image repository is then indexed by Amazon Kendra, which is a search engine that can be used to search for structured and unstructured data. During indexing, the GenAI model as well as Amazon Textract are invoked to generate the image metadata. You can trigger the indexing manually or on a predefined schedule.
  3. You can then search for images using natural language queries, such as “Find images of red roses” or “Show me pictures of dogs playing in the park,” through the Amazon Kendra console, SDK, or API. These queries are processed by Amazon Kendra, which uses ML algorithms to understand the meaning behind the queries and retrieve relevant images from the indexed repository.
  4. The search results are presented to you, along with their corresponding textual descriptions, allowing you to quickly and easily find the images you are looking for.

Prerequisites

You must have the following prerequisites:

  • An AWS account
  • Permissions to provision and invoke the following services via AWS CloudFormation: Amazon S3, Amazon Kendra, Lambda, and Amazon Textract.

Cost estimate

The cost of deploying this solution as a proof of concept is projected in the following table. This is the reason we use Amazon Kendra with the Developer Edition, which is not recommended for production workloads, but provides a low-cost option for developers. We assume that the search functionality of Amazon Kendra is used for 20 working days for 3 hours each day, and therefore calculate associated costs for 60 monthly active hours.

Service Time Consumed Cost Estimate per Month
Amazon S3 Storage of 10 GB with data transfer 2.30 USD
Amazon Kendra Developer Edition with 60 hours/month 67.90 USD
Amazon Textract 100% detect document text on 10,000 images 15.00 USD
Amazon SageMaker Real-time inference with ml.g4dn.xlarge for one model deployed on one endpoint for 3 hours every day for 20 days 44.00 USD
. . 129.2 USD

Deploy resources with AWS CloudFormation

The CloudFormation stack deploys the following resources:

  • A Lambda function that downloads the image captioning model from Hugging Face hub and subsequently builds the model assets
  • A Lambda function that populates the inference code and zipped model artifacts to a destination S3 bucket
  • An S3 bucket for storing the zipped model artifacts and inference code
  • An S3 bucket for storing the uploaded images and Amazon Kendra documents
  • An Amazon Kendra index for searching through the generated image captions
  • A SageMaker real-time inference endpoint for deploying the Hugging Face image
  • captioning model
  • A Lambda function that is triggered while enriching the Amazon Kendra index on demand. It invokes Amazon Textract and a SageMaker real-time inference endpoint.

Additionally, AWS CloudFormation deploys all the necessary AWS Identity and Access

Management (IAM) roles and policies, a VPC along with subnets, a security group, and an internet gateway in which the custom resource Lambda function is run.

Complete the following steps to provision your resources:

  1. Choose Launch stack to launch the CloudFormation template in the us-east-1 Region:
  2. Choose Next.
  3. On the Specify stack details page, leave the template URL and S3 URI of the parameters file at their defaults, then choose Next.
  4. Continue to choose Next on the subsequent pages.
  5. Choose Create stack to deploy the stack.

Monitor the status of the stack. When the status shows as CREATE_COMPLETE, the deployment is complete.

Ingest and search example images

Complete the following steps to ingest and search your images:

  1. On the Amazon S3 console, create a folder called images in the kendra-image-search-stack-imagecaptions S3 bucket in the us-east-1 Region.
  2. Upload the following images to the images folder.

Image of a beach to test with the kendra image search using automated text captioningImage of a dog celebrating a birthday to test with the kendra image search using automated text captioningImage of a dog under an umbrella to test with the kendra image search using automated text captioningImage of a tablet, notebook and coffee on a desk to test with the kendra image search using automated text captioning

  1. Navigate to the Amazon Kendra console in us-east-1 Region.
  2. In the navigation pane, choose Indexes, then choose your index (kendra-index).
  3. Choose Data sources, then choose generated_image_captions.
  4. Choose Sync now.

Wait for the synchronization to be complete before continuing to the next steps.

  1. In the navigation pane, choose Indexes, then choose kendra-index.
  2. Navigate to the search console.
  3. Try the following queries individually or combined: “dog,” “umbrella,” and “newsletter,” and find out which images are ranked high by Amazon Kendra.

Feel free to test your own queries that fit the uploaded images.

Clean up

To deprovisioning all the resources, complete the following step

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack kendra-genai-image-search and choose Delete.

Wait until the stack status changes to DELETE_COMPLETE.

Conclusion

In this post, we saw how Amazon Kendra and GenAI can be combined to automate the creation of meaningful metadata for images. State-of-the-art GenAI models are extremely useful for generating text captions describing the content of an image. This has several industry use cases, ranging from healthcare and life sciences, retail and ecommerce, digital asset platforms, and media. Image captioning is also crucial for building a more inclusive digital world and redesigning the internet, metaverse, and immersive technologies to cater to the needs of visually challenged sections of society.

Image search enabled through captions enables digital content to be easily searchable without manual effort for these applications, and removes duplication efforts. The CloudFormation template we provided makes it straightforward to deploy this solution to enable image search using Amazon Kendra. A simple architecture of images stored in Amazon S3 and GenAI to create textual descriptions of the images can be used with CDE in Amazon Kendra to power this solution.

This is only one application of GenAI with Amazon Kendra. To dive deeper into how to build GenAI applications with Amazon Kendra, refer to Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models. For building and scaling GenAI applications, we recommend checking out Amazon Bedrock.


About the Authors

Charalampos Grouzakis is a Data Scientist within AWS Professional Services. He has over 11 years of experience in developing and leading data science, machine learning, and big data initiatives. Currently he is helping enterprise customers modernizing their AI/ML workloads within the cloud using industry best practices. Prior to joining AWS, he was consulting customers in various industries such as Automotive, Manufacturing, Telecommunications, Media & Entertainment, Retail and Financial Services. He is passionate about enabling customers to accelerate their AI/ML journey in the cloud and to drive tangible business outcomes.


Bharathi Srinivasan
is a Data Scientist at AWS Professional Services where she loves to build cool things on Sagemaker. She is passionate about driving business value from machine learning applications, with a focus on ethical AI. Outside of building new AI experiences for customers, Bharathi loves to write science fiction and challenge herself with endurance sports.

Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist, taking long bike-packing trips.

Tanvi Singhal is a Data Scientist within AWS Professional Services. Her skills and areas of expertise include data science, machine learning, and big data. She supports customers in developing Machine learning models and MLops solutions within the cloud. Prior to joining AWS, she was also a consultant in various industries such as Transportation Networking, Retail and Financial Services. She is passionate about enabling customers on their data/AI journey to the cloud.

Abhishek Maligehalli Shivalingaiah is a Senior AI Services Solution Architect at AWS with focus on Amazon Kendra. He is passionate about building applications using Amazon Kendra ,Generative AI and NLP. He has around 10 years of experience in building Data & AI solutions to create value for customers and enterprises. He has built a (personal) chatbot for fun to answers questions about his career and professional journey. Outside of work he enjoys making portraits of family & friends, and loves creating artworks.

Read More

Research Focus: Week of July 31, 2023

Research Focus: Week of July 31, 2023

Microsoft Research Focus 21 | Week of July 31, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Anonymous Tokens with Stronger Metadata Bit Hiding from Algebraic MACs

Protecting the web from malicious activities such as bots or DoS attacks is an important goal. Researchers and practitioners have identified different approaches to balance user experience and security. For example, anonymous tokens allow an issuer to ensure that a user has been vetted while also protecting the user’s privacy. However, in some cases, the issuance or absence of a token can inform an adversary about the strategies used to distinguish honest users from bots or attackers.

In a recent paper: Anonymous Tokens with Stronger Metadata Bit Hiding from Algebraic MACs, researchers from Microsoft show how they designed an anonymous token protocol between a client and an issuer (also a verifier) that enables the issuer to support its fraud detection mechanisms while preserving users’ privacy.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

Survival Instinct in Offline Reinforcement Learning

On many benchmark datasets, offline reinforcement learning (RL) can produce well-performing and safe policies, even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

In a new paper: Survival Instinct in Offline Reinforcement Learning, researchers from the University of Washington and Microsoft demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. This work shows that pessimism endows the agent with a “survival instinct” – an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. The researchers argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating new ones. This research suggests a new paradigm for RL, whereby an agent is “nudged” to learn a desirable behavior with imperfect reward but purposely biased data coverage.


NEW RESEARCH

Nimble: Rollback Protection for Confidential Cloud Services

Cloud providers today offer confidential computing services in which virtual machines (VMs) support trusted execution environments (TEEs), that isolate a customer’s code from other code (including the hypervisor). TEEs offer security properties such as memory confidentiality and execution integrity, even if the provider is compromised. However, TEEs provide volatile state storage, not persistent state storage. So, if a TEE crashes or is maliciously restarted, its data can be lost.

A common way that TEEs today avoid such data loss is to persist an encrypted version of their data in a fault-tolerant cloud storage system such as Azure Table Storage or Cosmos DB. While authenticated encryption ensures that unauthorized parties cannot see the sensitive data or change its contents, encryption does not prevent a compromised provider from returning encryptions of old data. This is known as a “rollback attack,” in which an attacker can return an application running in a TEE to a previous state, potentially one that is vulnerable to attacks or that causes the application to perform incorrect actions.  

In a recent paper, Nimble: Rollback Protection for Confidential Cloud Services, researchers from Microsoft and academic colleagues introduce Nimble, a cloud service that helps applications running in TEE detect rollback attacks.


NEW RESEARCH

Improving machine learning force fields for molecular dynamics simulations with fine-grained force metrics

Machine learning force fields (MLFFs) provide a cost-effective alternative to ab initio molecular dynamics (MD) simulations – a computational method used in theoretical chemistry and materials science to simulate the behavior of molecules and materials at the atomic level. While they typically produce only small errors on the test set, MLFFs inherently encounter generalization and robustness issues during MD simulations.

In a recent paper: Improving machine learning force fields for molecular dynamics simulations with fine-grained force metrics, researchers from Microsoft propose alleviating those issues using global force metrics and fine-grained metrics from element and conformation aspects to systematically measure MLFFs for every atom and every conformation of molecules. Such force metrics can directly examine MLFFs without running costly MD simulations, reducing the computational cost of MLFF evaluation.

The researchers show that an accurate force prediction by MLFFs for all kinds of atom types and all possible conformations plays a crucial role in their usefulness in MD simulations. In addition, they designed continued learning and fine-tuning approaches to improve the performance of MLFFs.


NEW RESEARCH

Project Rumi: Multimodal paralinguistic prompting for LLMs

Large language models (LLMs) are algorithms that process and generate natural language, which can be used to create powerful new productivity tools. However, LLMs may not fully reflect the context and nuances of a conversation. Their performance depends in part on the quality and specificity of the user’s input, or prompt. User input data is a lexical entry, which lacks paralinguistic information (intonation, gestures, facial expressions, etc.) that may convey a speaker’s intentions. This can lead to misinterpretation, misunderstanding, or inappropriate responses from the LLM.

Conveying unspoken meaning and intention is an essential component in the next generation of AI interaction. To improve the quality of the underlying communication, researchers from Microsoft are developing a system called Project Rumi, which incorporates paralinguistic input into prompt-based interactions with LLMs. This system leverages separately trained vision and audio-based models to detect and analyze non-verbal cues extracted from data streams, assessing sentiment from cognitive and physiological data in real time. This multimodal, muti-step architecture integrates with all pretrained text-based LLMs to provide additional information on the user’s sentiment and intention that is not captured by text-based models.

The post Research Focus: Week of July 31, 2023 appeared first on Microsoft Research.

Read More