Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

Amazon Web Services (AWS) is committed to supporting the development of cutting-edge generative artificial intelligence (AI) technologies by companies and organizations across the globe. As part of this commitment, AWS Japan announced the AWS LLM Development Support Program (LLM Program), through which we’ve had the privilege of working alongside some of Japan’s most innovative teams. From startups to global enterprises, these trailblazers are harnessing the power of large language models (LLMs) and foundation models (FMs) to boost productivity, create differentiated customer experiences, and drive meaningful progress across a variety of industries by taking advantage of purpose-built generative AI infrastructure on AWS. Notably, 12 of the 15 organizations who successfully participated in the program used the powerful compute capabilities of AWS Trainium to train their models and are now exploring AWS Inferentia for inference. Earlier this year, at the conclusion of the program, the LLM Program held a media briefing, where several pioneering companies presented their results and stories. In this blog post, we share a recap of those results and cover how the participating organizations used the LLM Program to accelerate their generative AI initiatives.

AWS LLM Development Support Program in Japan

Since its launch, the LLM Program has welcomed 15 diverse companies and organizations, each with a unique vision for how to use LLMs to drive progress in their respective industries. The program provides comprehensive support through guidance on securing high-performance compute infrastructure, technical assistance and troubleshooting for distributed training, cloud credits, and support for go-to-market. The program also facilitated collaborative knowledge-sharing sessions, where the leading LLM engineers came together to discuss the technical complexities and commercial considerations of their work. This holistic approach enabled participating organizations to rapidly advance their generative AI capabilities and bring transformative solutions to market.

Let’s dive in and explore how these organizations are transforming what’s possible with generative AI on AWS.

Ricoh innovates with curriculum learning to train a bilingual LLM

Ricoh recognized that the development of Japanese LLMs was lagging behind English or multilingual LLMs. To address this, the company’s Digital Technology Development Center developed a Japanese-English bilingual LLM through a carefully crafted curriculum learning strategy.

Takeshi Suzuki, Deputy Director, Digital Technology Development Center, Digital Strategy Division, Ricoh

Takeshi Suzuki, Deputy Director, Digital Technology Development Center, Digital Strategy Division, Ricoh

Takeshi Suzuki, Deputy Director of the Digital Technology Development Center, explains Ricoh’s approach:

“Although new model architectures for FMs and LLMs are rapidly emerging, we focused on refining our training methodologies to create a competitive advantage, rather than solely pursuing architectural novelty.”

This led them to adopt a curriculum learning approach that gradually introduced increasingly complex data to their model.

“If a large amount of difficult Japanese data is introduced from the start into the initial English-trained weights of Llama 2 13B Chat, it can lead to a forgetting effect, hindering learning,” Suzuki says. “Therefore, we started with a substantial amount of English data, then gradually incorporated lower-quality English and Japanese data, before finally fine-tuning on high-quality Japanese content.”

To bring this innovative curriculum learning methodology to life, Ricoh used Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances, powered by Trainium. By using an on-demand cluster of 64 trn1.32xlarge instances (1,024 Trainium chips) with support from the LLM Program, Ricoh performed large-scale distributed training for their 13-billion-parameter bilingual LLM (Llama2-based). In benchmarks using the Japanese llm-jp-eval, the model demonstrated strong logical reasoning performance important in industrial applications.

Stockmark mitigates hallucination by pre-training a Japanese LLM

Stockmark wanted to build highly reliable LLMs for industrial applications and decided to pretrain a Japanese LLM to tackle the challenge of hallucination (factually inaccurate output)—a critical concern in many real-world use cases.

Kosuke Arima, CTO and Co-founder (left) and Dr. Takahiro Omi, VP of Research (right), Stockmark

Kosuke Arima, CTO and Co-founder (left) and Dr. Takahiro Omi, VP of Research (right), Stockmark

“In the industrial world, there is a demand for LLMs where hallucination is suppressed even more than it is in ChatGPT.”

– Kosuke Arima, CTO and co-founder of Stockmark.

Hallucination mitigation depends heavily on the amount of knowledge in LLMs. Multilingual LLMs, which are often used globally, contain only about 0.1 percent of training data in Japanese. Stockmark determined that retrieval augmented generation alone was insufficient to meet the needs of enterprise search or application search, because the LLMs used weren’t proficient in Japanese. So, they decided to develop Japanese LLMs in-house.

“To support practical business use cases, we pre-trained a 13-billion-parameter LLM from scratch using a total of 220 billion tokens of Japanese text data, including not only public data but also original web corpus and patent data for business domains.”

– Dr. Takahiro Omi, VP of Research of Stockmark.

Stockmark quickly developed Stockmark-13b LLM using 16 Trn1 instances powered by Trainium chips in about 30 days. Furthermore, to deploy the developed Stockmark-13b into their own services, they conducted a technical validation of inference using the AWS Inferentia2 chip, and published in a notebook.

NTT builds lightweight, high-performance LLMs for sustainable AI

The NTT group, together with Intel and Sony, has established Innovative Optical and Wireless Network (IOWN) as a new industry forum whose mission is to meet social and technological needs of society through innovative and sustainable technology. As part of this effort, NTT Human Informatics Laboratories is developing the lightweight, high-performance LLM tsuzumi (named after a traditional Japanese percussion instrument). Instead of increasing the parameter size, tsuzumi enhances the quality and quantity of Japanese training data, enabling high Japanese processing ability with a lightweight model. As described in their press release, tsuzumi demonstrates high Japanese language proficiency, as evaluated by the Rakuda benchmark, and possesses multi-modal capabilities that are currently in progress.

Kyosuke Nishida, Senior Distinguished Researcher, NTT Human Informatics Laboratories

Kyosuke Nishida, Senior Distinguished Researcher, NTT Human Informatics Laboratories

“Tsuzumi’s high Japanese language proficiency and multi-modal capabilities can benefit a variety of industry-specific and customer support use cases. In the healthcare and life sciences domain, tsuzumi can help parse electronic medical records, contributing to personalized medical care and accelerating drug discovery,” he explains. “For contact centers, tsuzumi’s multi-modal capabilities, such as visual understanding of manuals and charts, are expected to enhance both customer experience and employee experience.”

– Dr. Kyosuke Nishida, Senior Distinguished Researcher at NTT Human Informatics Laboratories.

By participating in the LLM Program, NTT was able to quickly launch a cluster of 96 NVIDIA H100 GPUs (12 EC2 P5 instances using AWS ParallelCluster). This enabled highly efficient, distributed training through the Elastic Fabric Adapter’s high-speed 3,200 Gbps inter-node communication. The AWS team also provided technical expertise to help NTT seamlessly migrate and validate its environment on AWS.

Customer innovations in domain-specific, multilingual, and multimodal generative AI

From intelligent chatbots that engage in witty banter to multimodal frameworks for autonomous vehicle systems, the LLM Program participants demonstrated the transformative potential of generative AI by using Trainium.

Domain-specific models: Trainium enabled creation of LLMs tailored to specific domains and tasks, unlocking new frontiers of efficiency and specialization. KARAKURI built an LLM (karakuri-ai/karakuri-lm-70b-chat-v0.1) to create customer support chatbots that not only have Japanese proficiency but also respond with a helpful demeanor. Meanwhile, Watashiha injected a dose of humor into the AI realm, developing OGIRI—a humor-focused foundation model that delivers delightfully funny responses to user queries. Poetics created an LLM adept at deciphering the nuances of online business meetings for their meeting analysis tool Jamroll. The Matsuo Institute pre-trained an LLM based on elyza/ELYZA-japanese-Llama-2-7b to develop an LLM-powered recommendation system that can intelligently curate personalized experiences for retail and travel customers. Aiming to build an LLM that specializes in specific tasks, Lightblue developed a small, lightweight LLM that will also reduce inference costs. To address the scalability challenges posed by a shrinking workforce, Recruit built an LLM through continued pre-training (with C4-ja, Wikipedia-ja, Pile, and in-house corpora) and instruction tuning (with databricks-dolly-15k-ja, ichikara-instruction, and in-house instruction data) on elyza/ELYZA-japanese-Llama-2-7b-fast and meta-llama/Llama-2-13b-hf models.

Multi-modal models: Several participants, such as Sparticle, have ventured into the realm of multimodal AI, weaving together language and visual modalities. Turing, with its innovative multi-modal Heron framework, is enhancing LLMs with the ability to interpret and navigate the visual landscape. Preferred Networks (PFN) has crafted a general-purpose vision FM that can seamlessly integrate and process both textual and visual information. As part of their future work, PFN will continue to develop multi-modal FMs based on PLaMo LLM, using the development method established in the LLM Program.

Linguistically-diverse models: The program participants also experimented with the training data, changing the ratio of English to Japanese or using training corpus in other languages. CyberAgent used Trainium to evaluate LLM performance when changing the ratio of Japanese to English included in training data, and expanded to grouped query attention (GQA) and verified architectures such as RetNet and Sparse Mixture of Experts (MoE) for their use cases. Using Trainium, Rinna built Nekomata 14B, based on the Qwen model trained on Chinese and English, by continued pre-training with 66-billion-token Japanese data, in just 6.5 days. Ubitus developed and released Taiwan LLM 13B (Taiwan-LLM-13B-v2.0-base) through joint research with National Taiwan University.

Fueling generative AI innovation in Japan

From startups to enterprises, organizations of all sizes have successfully trained their generative AI foundation models and large language models in the LLM Program. This testament to the program’s success was further underscored by the involvement and support of Japan’s Ministry of Economy, Trade, and Industry (METI). Several of the LLM Program participants will continue to develop their FMs and LLMs as part of the Generative AI Accelerator Challenge (GENIAC), where AWS will provide compute resources as METI announced and described in AWS Japan blog.

AWS will continue to support companies and organizations in their efforts to deploy these transformative models and bring generative AI innovation into real-world applications. We see the immense potential of FMs and LLMs to bolster Japan’s national strengths if implemented widely across various sectors. From a global perspective, AWS is committed to facilitate the development and adoption of these technologies worldwide, driving innovation and progress that will shape the future.

Visit AWS Trainium to learn how you can harness the power of purpose-built AI chips to build next-innovative foundation models while lowering costs.

This post is contributed by AWS LLM Development Support Program Executive Committee Yoshitaka Haribara, Akihiro Tsukada, Daishi Okada, Shoko Utsunomiya, and Technical Core Team Hiroshi Tokoyo, Keita Watanabe, and Masaru Isaka with the Executive Sponsorship represented by Yukiko Sato


About the Authors

Yoshitaka Haribara is a Senior Startup ML Solutions Architect at AWS Japan. In this role, Yoshitaka helps startup customers build generative AI foundation models and large language models on AWS, and came up with the idea of the LLM Program. In his spare time, Yoshitaka enjoys playing the drums.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

As generative artificial intelligence (AI) applications become more prevalent, maintaining responsible AI principles becomes essential. Without proper safeguards, large language models (LLMs) can potentially generate harmful, biased, or inappropriate content, posing risks to individuals and organizations. Applying guardrails helps mitigate these risks by enforcing policies and guidelines that align with ethical principles and legal requirements. Guardrails for Amazon Bedrock evaluates user inputs and model responses based on use case-specific policies, and provides an additional layer of safeguards regardless of the underlying foundation model (FM). Guardrails can be applied across all LLMs on Amazon Bedrock, including fine-tuned models and even generative AI applications outside of Amazon Bedrock. You can create multiple guardrails, each configured with a different combination of controls, and use these guardrails across different applications and use cases. You can configure guardrails in multiple ways, including to deny topics, filter harmful content, remove sensitive information, and detect contextual grounding.

The new ApplyGuardrail API enables you to assess any text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs. In this post, we demonstrate how to use the ApplyGuardrail API with long-context inputs and streaming outputs.

ApplyGuardrail API overview

The ApplyGuardrail API offers several key features:

  • Ease of use – You can integrate the API anywhere in your application flow to validate data before processing or serving results to users. For example, in a Retrieval Augmented Generation (RAG) application, you can now evaluate the user input prior to performing the retrieval instead of waiting until the final response generation.
  • Decoupled from FMs – The API is decoupled from FMs, allowing you to use guardrails without invoking FMs from Amazon Bedrock. For example, you can now use the API with models hosted on Amazon SageMaker. Alternatively, you could use it self-hosted or with models from third-party model providers. All that is needed is taking the input or output and request assessment using the API.

You can use the assessment results from the ApplyGuardrail API to design the experience on your generative AI application, making sure it adheres to your defined policies and guidelines.

The ApplyGuardrail API request allows you to pass all your content that should be guarded using your defined guardrails. The source field should be set to INPUT when the content to evaluated is from a user, typically the LLM prompt. The source should be set to OUTPUT when the model output guardrails should be enforced, typically an LLM response. An example request looks like the following code:

{
    "source": "INPUT" | "OUTPUT",
    "content": [{
        "text": {
            "text": "This is a sample text snippet...",
        }
    }]
}

For more information about the API structure, refer to Guardrails for Amazon Bedrock.

Streaming output

LLMs can generate text in a streaming manner, where the output is produced token by token or word by word, rather than generating the entire output at once. This streaming output capability is particularly useful in scenarios where real-time interaction or continuous generation is required, such as conversational AI assistants or live captioning. Incrementally displaying the output allows for a more natural and responsive user experience. Although it’s advantageous in terms of responsiveness, streaming output introduces challenges when it comes to applying guardrails in real time as the output is generated. Unlike the input scenario, where the entire text is available upfront, the output is generated incrementally, making it difficult to assess the complete context and potential violations.

One of the main challenges is the need to evaluate the output as it’s being generated, without waiting for the entire output to be complete. This requires a mechanism to continuously monitor the streaming output and apply guardrails in real time, while also considering the context and coherence of the generated text. Furthermore, the decision to halt or continue the generation process based on the guardrail assessment needs to be made in real time, which can impact the responsiveness and user experience of the application.

Solution overview: Use guardrails on streaming output

To address the challenges of applying guardrails on streaming output from LLMs, a strategy that combines batching and real-time assessment is required. This strategy involves collecting the streaming output into smaller batches or chunks, evaluating each batch using the ApplyGuardrail API, and then taking appropriate actions based on the assessment results.

The first step in this strategy is to batch the streaming output chunks into smaller batches that are closer to a text unit, which is approximately 1,000 characters. If a batch is smaller, such as 600 characters, you’re still charged for an entire text unit (1,000 characters). For a cost-effective usage of the API, it’s recommended that the batches of chunks are in order of text units, such as 1,000 characters, 2,000, and so on. This way, you minimize the risk of incurring unnecessary costs.

By batching the output into smaller batches, you can invoke the ApplyGuardrail API more frequently, allowing for real-time assessment and decision-making. The batching process should be designed to maintain the context and coherence of the generated text. This can be achieved by making sure the batches don’t split words or sentences, and by carrying over any necessary context from the previous batch. Though the chunking varies between use cases, for the sake of simplicity, this post showcases simple character-level chunking, but it’s recommended to explore options such as semantic chunking or hierarchical chunking while still adhering to the guidelines mentioned in this post.

After the streaming output has been batched into smaller chunks, each chunk can be passed to the API for evaluation. The API will assess the content of each chunk against the defined policies and guidelines, identifying any potential violations or sensitive information.

The assessment results from the API can then be used to determine the appropriate action for the current batch. If a severe violation is detected, the API assessment suggests halting the generation process, and instead a preset message or response can be displayed to the user. However, in some cases, no severe violation is detected, but the guardrail was configured to pass through the request, for example in the case of sensitiveInformationPolicyConfig to anonymize the detected entities instead of blocking. If such an intervention occurs, the output will be masked or modified accordingly before being displayed to the user. For latency-sensitive applications, you can also consider creating multiple buffers and multiple guardrails, each with different policies, and then processing them with the ApplyGuardrail API in parallel. This way, you can minimize the time it takes to make assessments for one guardrail at a time, but maximize getting the assessments from multiple guardrails and multiple batches, though this technique hasn’t been implemented in this example.

Example use case: Apply guardrails to streaming output

In this section, we provide an example of how such a strategy could be implemented. Let’s begin with creating a guardrail. You can use the following code sample to create a guardrail in Amazon Bedrock:

import boto3
REGION_NAME = "us-east-1"

bedrock_client = boto3.client("bedrock", region_name=REGION_NAME)
bedrock_runtime = boto3.client("bedrock-runtime", region_name=REGION_NAME)

response = bedrock_client.create_guardrail(
    name="<name>",
    description="<description>",
    ...
)
# alternatively provide the id and version for your own guardrail
guardrail_id = response['guardrailId'] 
guardrail_version = response['version']

Proper assessment of the policies must be conducted to verify if the input should be later sent to an LLM or whether the output generated by the LLM should be displayed to the user. In the following code, we examine the assessments, which are part of the response from the ApplyGuardrail API, for potential severe violation leading to BLOCKED intervention by the guardrail:

from typing import List, Dict
def check_severe_violations(violations: List[Dict]) -> int:
    """
    When guardrail intervenes either the action on the request is BLOCKED or NONE.
    This method checks the number of the violations leading to blocking the request.

    Args:
        violations (List[Dict]): A list of violation dictionaries, where each dictionary has an 'action' key.

    Returns:
        int: The number of severe violations (where the 'action' is 'BLOCKED').
    """
    severe_violations = [violation['action']=='BLOCKED' for violation in violations]
    return sum(severe_violations)

def is_policy_assessement_blocked(assessments: List[Dict]) -> bool:
    """
    While creating the guardrail you could specify multiple types of policies.
    At the time of assessment all the policies should be checked for potential violations
    If there is even 1 violation that blocks the request, the entire request is blocked
    This method checks if the policy assessment is blocked based on the given assessments.

    Args:
        assessments (list[dict]): A list of assessment dictionaries, where each dictionary may contain 'topicPolicy', 'wordPolicy', 'sensitiveInformationPolicy', and 'contentPolicy' keys.

    Returns:
        bool: True if the policy assessment is blocked, False otherwise.
    """
    blocked = []
    for assessment in assessments:
        if 'topicPolicy' in assessment:
            blocked.append(check_severe_violations(assessment['topicPolicy']['topics']))
        if 'wordPolicy' in assessment:
            if 'customWords' in assessment['wordPolicy']:
                blocked.append(check_severe_violations(assessment['wordPolicy']['customWords']))
            if 'managedWordLists' in assessment['wordPolicy']:
                blocked.append(check_severe_violations(assessment['wordPolicy']['managedWordLists']))
        if 'sensitiveInformationPolicy' in assessment:
            if 'piiEntities' in assessment['sensitiveInformationPolicy']:
                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['piiEntities']))
            if 'regexes' in assessment['sensitiveInformationPolicy']:
                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['regexes']))
        if 'contentPolicy' in assessment:
            blocked.append(check_severe_violations(assessment['contentPolicy']['filters']))
    severe_violation_count = sum(blocked)
    print(f'::Guardrail:: {severe_violation_count} severe violations detected')
    return severe_violation_count>0

We can then define how to apply guardrail. If the response from the API leads to an action == 'GUARDRAIL_INTERVENED', it means that the guardrail has detected a potential violation. We now need to check if the violation was severe enough to block the request or pass it through with either the same text as input or an alternate text in which modifications are made according to the defined policies:

def apply_guardrail(text, source, guardrail_id, guardrail_version):
    response = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version, 
        source=source,
        content=[{"text": {"text": text}}]
    )
    if response['action'] == 'GUARDRAIL_INTERVENED':
        is_blocked = is_policy_assessement_blocked(response['assessments'])
        alternate_text = ' '.join([output['text'] for output in response['output']])
        return is_blocked, alternate_text, response
    else:
        # Return the default response in case of no guardrail intervention
        return False, text, response

Let’s now apply our strategy for streaming output from an LLM. We can maintain a buffer_text, which creates a batch of chunks received from the stream. As soon as len(buffer_text + new_text) > TEXT_UNIT, meaning if the batch is close to a text unit (1,000 characters), it’s ready to be sent to the ApplyGuardrail API. With this mechanism, we can make sure we don’t incur the unnecessary cost of invoking the API on smaller chunks and also that enough context is available inside each batch for the guardrail to make meaningful assessments. Additionally, when the generation is complete from the LLM, the final batch must also be tested for potential violations. If at any point the API detects severe violations, further consumption of the stream is halted and the user is displayed the preset message at the time of creation of the guardrail.

In the following example, we ask the LLM to generate three names and tell us what is a bank. This generation will lead to GUARDRAIL_INTERVENED but not block the generation, and instead anonymize the text (masking the names) and continue with generation.

input_message = "List 3 names of prominent CEOs and later tell me what is a bank and what are the benefits of opening a savings account?"

model_id = "anthropic.claude-3-haiku-20240307-v1:0"
text_unit= 1000 # characters

response = bedrock_runtime.converse_stream(
    modelId=model_id,
    messages=[{
        "role": "user",
        "content": [{"text": input_message}]
    system=[{"text" : "You are an assistant that helps with tasks from users. Be as elaborate as possible"}],
)

stream = response.get('stream')
buffer_text = ""
if stream:
    for event in stream:
        if 'contentBlockDelta' in event:
            new_text = event['contentBlockDelta']['delta']['text']
            if len(buffer_text + new_text) > text_unit:
                is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)
                # print(alt_text, end="")
                if is_blocked:
                    break
                buffer_text = new_text
            else: 
                buffer_text += new_text

        if 'messageStop' in event:
            # print(f"nStop reason: {event['messageStop']['stopReason']}")
            is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)
            # print(alt_text)

After running the preceding code, we receive an example output with masked names:

Certainly! Here are three names of prominent CEOs:

1. {NAME} - CEO of Apple Inc.
2. {NAME} - CEO of Microsoft Corporation
3. {NAME} - CEO of Amazon

Now, let's discuss what a bank is and the benefits of opening a savings account.

A bank is a financial institution that accepts deposits, provides loans, and offers various other financial services to its customers. Banks play a crucial role in the economy by facilitating the flow of money and enabling financial transactions.

Long-context inputs

RAG is a technique that enhances LLMs by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases before generating responses, producing output tailored to specific contexts while providing relevance, accuracy, and efficiency. The input to the LLM in a RAG scenario can be quite long, because it includes the user’s query concatenated with the retrieved information from the knowledge base. This long-context input poses challenges when applying guardrails, because the input may exceed the character limits imposed by the ApplyGuardrail API. To learn more about the quotas applied to Guardrails for Amazon Bedrock, refer to Guardrails quotas.

We evaluated the strategy to avoid the risk from model response in the previous section. In the case of inputs, the risk could be both at the query level or together with the query and the retrieved context for the query.

The retrieved information from the knowledge base may contain sensitive or potentially harmful content, which needs to be identified and handled appropriately, for example masking sensitive information, before being passed to the LLM for generation. Therefore, guardrails must be applied to the entire input to make sure it adheres to the defined policies and constraints.

Solution overview: Use guardrails on long-context inputs

The ApplyGuardrail API has a default limit of 25 text units (approximately 25,000 characters) per second. If the input exceeds this limit, it needs to be chunked and processed sequentially to avoid throttling. Therefore, the strategy becomes relatively straightforward: if the length of input text is less than 25 text units (25,000 characters), then it can be evaluated in a single request, otherwise it needs to be broken down into smaller pieces. The chunk size can vary depending on application behavior and the type of context in the application; you can start with 12 text units and iterate to find the best suitable chunk size. This way, we maximize the allowed default limit while keeping most of the context intact in a single request. Even if the guardrail action is GUARDRAIL_INTERVENED, it doesn’t mean the input is BLOCKED. It could also be true that the input is processed and sensitive information is masked; in this case, the input text must be recompiled with any processed response from the applied guardrail.

text_unit = 1000 # characters
limit_text_unit = 25
max_text_units_in_chunk = 12
def apply_guardrail_with_chunking(text, guardrail_id, guardrail_version="DRAFT"):
    text_length = len(text)
    filtered_text = ''
    if text_length <= limit_text_unit * text_unit:
        return apply_guardrail(text, "INPUT", guardrail_id, guardrail_version)
    else:
        # If the text length is greater than the default text unit limits then it's better to chunk the text to avoid throttling.
        for i, chunk in enumerate(wrap(text, max_text_units_in_chunk * text_unit)):
            print(f'::Guardrail::Applying guardrails at chunk {i+1}')
            is_blocked, alternate_text, response = apply_guardrail(chunk, "INPUT", guardrail_id, guardrail_version)
            if is_blocked:
                filtered_text = alternate_text
                break
            # It could be the case that guardrails intervened and anonymized PII in the input text,
            # we can then take the output from guardrails to create filtered text response.
            filtered_text += alternate_text
        return is_blocked, filtered_text, response

Run the full notebook to test this strategy with long-context input.

Best practices and considerations

When applying guardrails, it’s essential to follow best practices to maintain efficient and effective content moderation:

  • Optimize chunking strategy – Carefully consider the chunking strategy. The chunk size should balance the trade-off between minimizing the number of API calls and making sure the context isn’t lost due to overly small chunks. Similarly, the chunking strategy should take into account the context split; a critical piece of text could span two (or more) chunks if not carefully divided.
  • Asynchronous processing – Implement asynchronous processing for RAG contexts. This can help decouple the guardrail application process from the main application flow, improving responsiveness and overall performance. For frequently retrieved context from vector databases, ApplyGuardrail could be applied one time and results stored in metadata. This would avoid redundant API calls for the same content. This can significantly improve performance and reduce costs.
  • Develop comprehensive test suites – Create a comprehensive test suite that covers a wide range of scenarios, including edge cases and corner cases, to validate the effectiveness of your guardrail implementation.
  • Implement fallback mechanisms – There could be scenarios where the guardrail created doesn’t cover all the possible vulnerabilities and is unable to catch edge cases. For such scenarios, it’s wise to have a fallback mechanism. One such option could be to bring human in the loop, or use an LLM as a judge to evaluate both the input and output.

In addition to the aforementioned considerations, it’s also good practice to regularly audit your guardrail implementation, continuously refine and adapt your guardrail implementation, and implement logging and monitoring mechanisms to capture and analyze the performance and effectiveness of your guardrails.

Clean up

The only resource we created in this example is a guardrail. To delete the guardrail, complete the following steps:

  1. On the Amazon Bedrock console, under Safeguards in the navigation pane, choose Guardrails.
  2. Select the guardrail you created and choose Delete.

Alternatively, you can use the SDK:

bedrock_client.delete_guardrail(guardrailIdentifier = "<your_guardrail_id>")

Key takeaways

Applying guardrails is crucial for maintaining responsible and safe content generation. With the ApplyGuardrail API from Amazon Bedrock, you can effectively moderate both inputs and outputs, protecting your generative AI application against violations and maintaining compliance with your content policies.

Key takeaways from this post include:

  • Understand the importance of applying guardrails in generative AI applications to mitigate risks and maintain content moderation standards
  • Use the ApplyGuardrail API from Amazon Bedrock to validate inputs and outputs against defined policies and rules
  • Implement chunking strategies for long-context inputs and batching techniques for streaming outputs to efficiently utilize the ApplyGuardrail API
  • Follow best practices, optimize performance, and continuously monitor and refine your guardrail implementation to maintain effectiveness and alignment with evolving content moderation needs

Benefits

By incorporating the ApplyGuardrail API into your generative AI application, you can unlock several benefits:

  • Content moderation at scale – The API allows you to moderate content at scale, so your application remains compliant with content policies and guidelines, even when dealing with large volumes of data
  • Customizable policies – You can define and customize content moderation policies tailored to your specific use case and requirements, making sure your application adheres to your organization’s standards and values
  • Real-time moderation – The API enables real-time content moderation, allowing you to detect and mitigate potential violations as they occur, providing a safe and responsible user experience
  • Integration with any LLMApplyGuardrail is an independent API, so it can be integrated with any of your LLMs of choice, so you can take advantage of the power of generative AI while maintaining control over the content being generated
  • Cost-effective solution – With its pay-per-use pricing model and efficient text unit-based billing, the API provides a cost-effective solution for content moderation, especially when dealing with large volumes of data

Conclusion

By using the ApplyGuardrail API from Amazon Bedrock and following the best practices outlined in this post, you can make sure your generative AI application remains safe, responsible, and compliant with content moderation standards, even with long-context inputs and streaming outputs.

To further explore the capabilities of the ApplyGuardrail API and its integration with your generative AI application, consider experimenting with the API using the following resources:

  • Refer to Guardrails for Amazon Bedrock for detailed information on the ApplyGuardrail API, its usage, and integration examples
  • Check out the AWS samples GitHub repository for sample code and reference architectures demonstrating the integration of the ApplyGuardrail API with various generative AI applications
  • Participate in AWS-hosted workshops and tutorials focused on responsible AI and content moderation, where you can learn from experts and gain hands-on experience with the ApplyGuardrail API

Resources

The following resources explain both practical and ethical aspects of applying Guardrails for Amazon Bedrock:


About the Author

Talha Chattha is a Generative AI Specialist Solutions Architect at Amazon Web Services, based in Stockholm. Talha helps establish practices to ease the path to production for Gen AI workloads. Talha is an expert in Amazon Bedrock and supports customers across entire EMEA. He holds passion about meta-agents, scalable on-demand inference, advanced RAG solutions and cost optimized prompt engineering with LLMs. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines. Connect with Talha at LinkedIn using /in/talha-chattha/.

Read More

NVIDIA and Zoox Pave the Way for Autonomous Ride-Hailing

NVIDIA and Zoox Pave the Way for Autonomous Ride-Hailing

In celebration of Zoox’s 10th anniversary, NVIDIA founder and CEO Jensen Huang recently joined the robotaxi company’s CEO, Aicha Evans, and its cofounder and CTO, Jesse Levinson, to discuss the latest in autonomous vehicle (AV) innovation and experience a ride in the Zoox robotaxi.

In a fireside chat at Zoox’s headquarters in Foster City, Calif., the trio reflected on the two companies’ decade of collaboration. Evans and Levinson highlighted how Zoox pioneered the concept of a robotaxi purpose-built for ride-hailing and created groundbreaking innovations along the way, using NVIDIA technology.

“The world has never seen a robotics company like this before,” said Huang. “Zoox started out solely as a sustainable robotics company that delivers robots into the world as a fleet.”

Since 2014, Zoox has been on a mission to create fully autonomous, bidirectional vehicles purpose-built for ride-hailing services. This sets it apart in an industry largely focused on retrofitting existing cars with self-driving technology.

A decade later, the company is operating its robotaxi, powered by NVIDIA GPUs, on public roads.

Computing at the Core

Zoox robotaxis are, at their core, supercomputers on wheels. They’re built on multiple NVIDIA GPUs dedicated to processing the enormous amounts of data generated in real time by their sensors.

The sensor array includes cameras, lidar, radar, long-wave infrared sensors and microphones. The onboard computing system rapidly processes the raw sensor data collected and fuses it to provide a coherent understanding of the vehicle’s surroundings.

The processed data then flows through a perception engine and prediction module to planning and control systems, enabling the vehicle to navigate complex urban environments safely.

NVIDIA GPUs deliver the immense computing power required for the Zoox robotaxis’ autonomous capabilities and continuous learning from new experiences.

Using Simulation as a Virtual Proving Ground

Key to Zoox’s AV development process is its extensive use of simulation. The company uses NVIDIA GPUs and software tools to run a wide array of simulations, testing its autonomous systems in virtual environments before real-world deployment.

These simulations range from synthetic scenarios to replays of real-world scenarios created using data collected from test vehicles. Zoox uses retrofitted Toyota Highlanders equipped with the same sensor and compute packages as its robotaxis to gather driving data and validate its autonomous technology.

This data is then fed back into simulation environments, where it can be used to create countless variations and replays of scenarios and agent interactions.

Zoox also uses what it calls “adversarial simulations,” carefully crafted scenarios designed to test the limits of the autonomous systems and uncover potential edge cases.

The company’s comprehensive approach to simulation allows it to rapidly iterate and improve its autonomous driving software, bolstering AV safety and performance.

“We’ve been using NVIDIA hardware since the very start,” said Levinson. “It’s a huge part of our simulator, and we rely on NVIDIA GPUs in the vehicle to process everything around us in real time.”

A Neat Way to Seat

Zoox’s robotaxi, with its unique bidirectional design and carriage-style seating, is optimized for autonomous operation and passenger comfort, eliminating traditional concepts of a car’s “front” and “back” and providing equal comfort and safety for all occupants.

“I came to visit you when you were zero years old, and the vision was compelling,” Huang said, reflecting on Zoox’s evolution over the years. “The challenge was incredible. The technology, the talent — it is all world-class.”

Using NVIDIA GPUs and tools, Zoox is poised to redefine urban mobility, pioneering a future of safe, efficient and sustainable autonomous transportation for all.

From Testing Miles to Market Projections

As the AV industry gains momentum, recent projections highlight the potential for explosive growth in the robotaxi market. Guidehouse Insights forecasts over 5 million robotaxi deployments by 2030, with numbers expected to surge to almost 34 million by 2035.

The regulatory landscape reflects this progress, with 38 companies currently holding valid permits to test AVs with safety drivers in California. Zoox is currently one of only six companies permitted to test AVs without safety drivers in the state.

As the industry advances, Zoox has created a next-generation robotaxi by combining cutting-edge onboard computing with extensive simulation and development.

In the image at top, NVIDIA founder and CEO Jensen Huang stands with Zoox CEO Aicha Evans and Zoox cofounder and CTO Jesse Levinson in front of a Zoox robotaxi.

Read More

NVIDIA Researchers Harness Real-Time Gen AI to Build Immersive Desert World

NVIDIA Researchers Harness Real-Time Gen AI to Build Immersive Desert World

NVIDIA researchers used NVIDIA Edify, a multimodal architecture for visual generative AI, to build a detailed 3D desert landscape within a few minutes in a live demo at SIGGRAPH’s Real-Time Live event on Tuesday.

During the event — one of the prestigious graphics conference’s top sessions — NVIDIA researchers showed how, with the support of an AI agent, they could build and edit a desert landscape from scratch within five minutes. The live demo highlighted how generative AI can act as an assistant to artists by accelerating ideation and generating custom secondary assets that would otherwise have been sourced from a repository.

By drastically decreasing ideation time, these AI technologies will empower 3D artists to be more productive and creative — giving them the tools to explore concepts faster and expedite parts of their workflows. They could, for example, generate the background assets or 360 HDRi environments that the scene needs in minutes, instead of spending hours finding or creating them.

From Idea to 3D Scene in Three Minutes

Creating a full 3D scene is a complex, time-consuming task. Artists must support their hero asset with plenty of background objects to create a rich scene, then find an appropriate background and an environment map to light it. Due to time constraints, they’ve often had to make a trade-off between rapid results and creative exploration.

With the support of AI agents, creative teams can achieve both goals: quickly bring concepts to life and continue iterating to achieve the right look.

In the Real-Time Live demo, the researchers used an AI agent to instruct an NVIDIA Edify-powered model to generate dozens of 3D assets, including cacti, rocks and the skull of a bull — with previews produced in just seconds.

They next directed the agent to harness other models to create potential backgrounds and a layout of how the objects would be placed in the scene — and showcased how the agent could adapt to last-minute changes in creative direction by quickly swapping the rocks for gold nuggets.

With a design plan in place, they prompted the agent to create full-quality assets and render the scene as a photorealistic image in NVIDIA Omniverse USD Composer, an app for virtual world-building.

NVIDIA Edify Accelerates Environment Generation 

NVIDIA Edify models can help creators focus on hero assets while accelerating the creation of background environments and objects using AI-powered scene generation tools. The Real-Time Live demo showcased two Edify models: 

  • Edify 3D generates ready-to-edit 3D meshes from text or image prompts. Within seconds, the model can generate previews, including rotating animations of each object, to help creators rapidly prototype before committing to a specific design.
  • Edify 360 HDRi uses text or image prompts to generate up to 16K high-dynamic range images (HDRi) of nature landscapes, which can be used as backgrounds and to light scenes.

During the demo, the researchers also showcased an AI agent powered by a large language model, and USD Layout, an AI model that generates scene layouts using OpenUSD, a platform for 3D workflows.

At SIGGRAPH, NVIDIA also announced that two leading creative content companies are giving designers and artists new ways to boost productivity with generative AI using tools powered by NVIDIA Edify.

Shutterstock has launched in commercial beta its Generative 3D service, which lets creators quickly prototype and generate 3D assets using text or image prompts. Its 360 HDRi generator based on Edify also entered early access.

Getty Images updated its Generative AI by Getty Images service with the latest version of NVIDIA Edify. Users can now create images twice as fast, with improved output quality and prompt adherence, and advanced controls and fine-tuning.

Harnessing Universal Scene Description in NVIDIA Omniverse

The 3D objects, environment maps and layouts generated using Edify models are structured with USD, a standard format for describing and composing 3D worlds. This compatibility allows artists to immediately import Edify-powered creations into Omniverse USD Composer.

Within Composer, they can use popular digital content creation tools to further modify the scene by, for example, changing the position of objects, modifying their appearance or adjusting lighting.

Real-Time Live is one of the most anticipated events at SIGGRAPH, featuring about a dozen real-time applications including generative AI, virtual reality and live performance capture technology. Watch the replay below.

 

Read More

Oracle Cloud Infrastructure Expands NVIDIA GPU-Accelerated Instances for AI, Digital Twins and More

Oracle Cloud Infrastructure Expands NVIDIA GPU-Accelerated Instances for AI, Digital Twins and More

Enterprises are rapidly adopting generative AI, large language models (LLMs), advanced graphics and digital twins to increase operational efficiencies, reduce costs and drive innovation.

However, to adopt these technologies effectively, enterprises need access to state-of-the-art, full-stack accelerated computing platforms. To meet this demand, Oracle Cloud Infrastructure (OCI) today announced NVIDIA L40S GPU bare-metal instances available to order and the upcoming availability of a new virtual machine accelerated by a single NVIDIA H100 Tensor Core GPU. This new VM expands OCI’s existing H100 portfolio, which includes an NVIDIA HGX H100 8-GPU bare-metal instance.

Paired with NVIDIA networking and running the NVIDIA software stack, these platforms deliver powerful performance and efficiency, enabling enterprises to advance generative AI.

NVIDIA L40S Now Available to Order on OCI

The NVIDIA L40S is a universal data center GPU designed to deliver breakthrough multi-workload acceleration for generative AI, graphics and video applications. Equipped with fourth-generation Tensor Cores and support for the FP8 data format, the L40S GPU excels in training and fine-tuning small- to mid-size LLMs and in inference across a wide range of generative AI use cases.

For example, a single L40S GPU (FP8) can generate up to 1.4x more tokens per second than a single NVIDIA A100 Tensor Core GPU (FP16) for Llama 3 8B with NVIDIA TensorRT-LLM at an input and output sequence length of 128.

The L40S GPU also has best-in-class graphics and media acceleration. Its third-generation NVIDIA Ray Tracing Cores (RT Cores) and multiple encode/decode engines make it ideal for advanced visualization and digital twin applications.

The L40S GPU delivers up to 3.8x the real-time ray-tracing performance of its predecessor, and supports NVIDIA DLSS 3 for faster rendering and smoother frame rates. This makes the GPU ideal for developing applications on the NVIDIA Omniverse platform, enabling real-time, photorealistic 3D simulations and AI-enabled digital twins. With Omniverse on the L40S GPU, enterprises can develop advanced 3D applications and workflows for industrial digitalization that will allow them to design, simulate and optimize products, processes and facilities in real time before going into production.

OCI will offer the L40S GPU in its BM.GPU.L40S.4 bare-metal compute shape, featuring four NVIDIA L40S GPUs, each with 48GB of GDDR6 memory. This shape includes local NVMe drives with 7.38TB capacity, 4th Generation Intel Xeon CPUs with 112 cores and 1TB of system memory.

These shapes eliminate the overhead of any virtualization for high-throughput and latency-sensitive AI or machine learning workloads with OCI’s bare-metal compute architecture. The accelerated compute shape features the NVIDIA BlueField-3 DPU for improved server efficiency, offloading data center tasks from CPUs to accelerate networking, storage and security workloads. The use of BlueField-3 DPUs furthers OCI’s strategy of off-box virtualization across its entire fleet.

OCI Supercluster with NVIDIA L40S enables ultra-high performance with 800Gbps of internode bandwidth and low latency for up to 3,840 GPUs. OCI’s cluster network uses NVIDIA ConnectX-7 NICs over RoCE v2 to support high-throughput and latency-sensitive workloads, including AI training.

“We chose OCI AI infrastructure with bare-metal instances and NVIDIA L40S GPUs for 30% more efficient video encoding,” said Sharon Carmel, CEO of Beamr Cloud. “Videos processed with Beamr Cloud on OCI will have up to 50% reduced storage and network bandwidth consumption, speeding up file transfers by 2x and increasing productivity for end users. Beamr will provide OCI customers video AI workflows, preparing them for the future of video.”

Single-GPU H100 VMs Coming Soon on OCI 

The VM.GPU.H100.1 compute virtual machine shape, accelerated by a single NVIDIA H100 Tensor Core GPU, is coming soon to OCI. This will provide cost-effective, on-demand access for enterprises looking to use the power of NVIDIA H100 GPUs for their generative AI and HPC workloads.

A single H100 provides a good platform for smaller workloads and LLM inference. For example, one H100 GPU can generate more than 27,000 tokens per second for Llama 3 8B (up to 4x more throughput than a single A100 GPU at FP16 precision) with NVIDIA TensorRT-LLM at an input and output sequence length of 128 and FP8 precision.

The VM.GPU.H100.1 shape includes 2×3.4TB of NVMe drive capacity, 13 cores of 4th Gen Intel Xeon processors and 246GB of system memory, making it well-suited for a range of AI tasks.

“Oracle Cloud’s bare-metal compute with NVIDIA H100 and A100 GPUs, low-latency Supercluster and high-performance storage delivers up to 20% better price-performance for Altair’s computational fluid dynamics and structural mechanics solvers,” said Yeshwant Mummaneni, chief engineer of data management analytics at Altair. “We look forward to leveraging these GPUs with virtual machines for the Altair Unlimited virtual appliance.”

GH200 Bare-Metal Instances Available for Validation

OCI has also made available the BM.GPU.GH200 compute shape for customer testing. It features the NVIDIA Grace Hopper Superchip and NVLink-C2C, a high-bandwidth, cache-coherent 900GB/s connection between the NVIDIA Grace CPU and NVIDIA Hopper GPU. This provides over 600GB of accessible memory, enabling up to 10x higher performance for applications running terabytes of data compared to the NVIDIA A100 GPU.

Optimized Software for Enterprise AI 

Enterprises have a wide variety of NVIDIA GPUs to accelerate their AI, HPC and data analytics workloads on OCI. However, maximizing the full potential of these GPU-accelerated compute instances requires an optimized software layer.

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform available on the OCI Marketplace, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inference to deploy world-class generative AI applications.

Optimized for NVIDIA GPUs, NIM pre-built containers offer developers improved cost of ownership, faster time to market and security. NIM microservices for popular community models, found on the NVIDIA API Catalog, can be deployed easily on OCI.

Performance will continue to improve over time with upcoming GPU-accelerated instances, including NVIDIA H200 Tensor Core GPUs and NVIDIA Blackwell GPUs.

Order the L40S GPU and test the GH200 Superchip by reaching out to OCI. To learn more, join Oracle and NVIDIA at SIGGRAPH, the world’s premier graphics conference, running through Aug. 1.

See notice regarding software product information.

Read More

Research Focus: Week of July 29, 2024

Research Focus: Week of July 29, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: July 22, 2024

Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior

Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). Previous differentiable MAG learning algorithms have been limited to small datasets and failed to scale to larger ones (e.g., with more than 50 variables).

In a recent paper: Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior, researchers from Microsoft and external colleagues explore the potential for causal skeleton, which is the undirected version of the causal graph, to improve accuracy and reduce the search space of the optimization procedure, thereby enhancing the performance of differentiable causal discovery. They propose SPOT (Skeleton Posterior-guided OpTimization), a two-phase framework that harnesses skeleton posterior for differentiable causal discovery in the presence of latent confounders.

Extensive experiments on various datasets show that SPOT substantially outperforms state-of-the-art methods for MAG learning. SPOT also demonstrates its effectiveness in the accuracy of skeleton posterior estimation in comparison with non-parametric bootstrap-based, or more recently, variational inference-based methods. The adoption of skeleton posterior exhibits strong promise in various causal discovery tasks.


Evaluating the Feasibility of Visual Imagery for an EEG-Based Brain–Computer Interface

Brain signals recorded via non-invasive electroencephalography (EEG) could help patients with severe neuromuscular disorders communicate with and control the world around them. Brain-computer interface (BCI) technology could use visual imagery, or the mental simulation of visual information from memory, as an effective control paradigm, directly conveying the user’s intention.

Initial investigations have been unable to fully evaluate the capabilities of true spontaneous visual mental imagery. One major limitation is that the target image is typically displayed immediately preceding the imagery period. This paradigm does not capture spontaneous mental imagery, as would be necessary in an actual BCI application, but something more akin to short-term retention in visual working memory.

In a recent paper: Evaluating the Feasibility of Visual Imagery for an EEG-Based Brain–Computer Interface, researchers from Microsoft and external colleagues show that short-term visual imagery following the presentation of a specific target image provides a stronger, more easily classifiable neural signature in EEG than spontaneous visual imagery from long-term memory following an auditory cue for the image. This research, published in IEEE Transactions on Neural Systems and Rehabilitation Engineering, provides the first direct comparison of short-term and long-term visual imagery tasks and provides greater insight into the feasibility of using visual imagery as a BCI control strategy.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first three episodes on demand.


Evolving Roles and Workflows of Creative Practitioners in the Age of Generative AI

Many creative practitioners – designers, software developers, and architects, for example – are using generative AI models to produce text, images, and other assets. While human-computer interaction (HCI) research explores specific generative AI models and creativity support tools, little is known about practitioners’ evolving roles and workflows with models across a project’s stages. This knowledge could help guide the development of the next generation of creativity support tools.

In a recent paper: Evolving Roles and Workflows of Creative Practitioners in the Age of Generative AI, researchers from Microsoft and the University of California-San Diego, contribute to this knowledge by employing a triangulated method to capture information from interviews, videos, and survey responses of creative practitioners reflecting on projects they completed with generative AI. Their observations help uncover a set of factors that capture practitioners’ perceived roles, challenges, benefits, and interaction patterns when creating with generative AI. From these factors, the researchers offer insights and propose design opportunities and priorities that serve to encourage reflection from the wider community of creativity support tools and generative AI stakeholders, such as systems creators, researchers, and educators, on how to develop systems that meet the needs of creatives in human-centered ways.


“It’s like a rubber duck that talks back”: Understanding Generative AI-Assisted Data Analysis Workflows through a Participatory Prompting Study

End-user tools based on generative AI can help people complete many tasks. One such task is data analysis, which is notoriously challenging for non-experts, but also holds much potential for AI. To understand how data analysis workflows can be assisted or impaired by generative AI, researchers from Microsoft conducted a study using Bing Chat via participatory prompting, a newer methodology in which users and researchers reflect together on tasks through co-engagement with generative AI. The recent paper: “It’s like a rubber duck that talks back”: Understanding Generative AI-Assisted Data Analysis Workflows through a Participatory Prompting Study, demonstrates the value of the participatory prompting method. The researchers found that generative AI benefits the information foraging and sensemaking loops of data analysis in specific ways, but also introduces its own barriers and challenges, arising from the difficulties of query formulation, specifying context, and verifying results. Based on these findings, the paper presents several implications for future AI research and the design of new generative AI interactions.

The post Research Focus: Week of July 29, 2024 appeared first on Microsoft Research.

Read More