Secure AccountantAI Chatbot: Lili’s journey with Amazon Bedrock

Secure AccountantAI Chatbot: Lili’s journey with Amazon Bedrock

This post was written in collaboration with Liran Zelkha and Eyal Solnik from Lili.

Small business proprietors tend to prioritize the operational aspects of their enterprises over administrative tasks, such as maintaining financial records and accounting. While hiring a professional accountant can provide valuable guidance and expertise, it can be cost-prohibitive for many small businesses. Moreover, the availability of accountants might not always align with the immediate needs of business owners, leaving them with unanswered questions or delayed decision-making processes.

In the rapidly evolving world of large language models (LLMs) and generative artificial intelligence (AI), Lili recognized an opportunity to use this technology to address the financial advisory needs of their small business customers. Using Anthropic’s Claude 3 Haiku on Amazon Bedrock, Lili developed an intelligent AccountantAI chatbot capable of providing on-demand accounting advice tailored to each customer’s financial history and unique business requirements. The AccountantAI chatbot serves as a virtual assistant, offering affordable and readily available financial guidance, empowering small business owners to focus on their core expertise while ensuring the financial health of their operations.

About Lili

Lili is a financial platform designed specifically for businesses, offering a combination of advanced business banking with built-in accounting and tax preparation software.

By consolidating financial tools into a user-friendly interface, Lili streamlines and simplifies managing business finances and makes it an attractive solution for business owners seeking a centralized and efficient way to manage their financial operations.

In this post, we’ll explore how Lili, a financial platform designed specifically for businesses, used Amazon Bedrock to build a secure and intelligent AccountantAI chatbot for small business owners. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like Anthropic, Meta, Mistral AI, Stability AI, Cohere, AI21 Labs, and Amazon through a single API, along with a broad set of capabilities that you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

The AccountantAI chatbot provides small business owners with accurate and relevant financial accounting advice in a secure manner. To achieve this, the solution is designed to address two key requirements:

  • Question validation: Implementing guardrails to ensure that the user’s input is a valid and a legitimate financial accounting question. This step helps filter out irrelevant or inappropriate queries, maintaining the integrity of the system.
  • Context enrichment: Augmenting the user’s question with relevant contextual data, such as up-to-date accounting information and user-specific financial data. This step ensures that the chatbot’s responses are tailored to the individual user’s business and financial situation, providing more personalized and actionable advice.

To address the two key requirements of question validation and context enrichment, the AccountantAI solution employs a two-stage architecture comprising an ingestion workflow and a retrieval workflow.

Ingestion workflow

Ingestion Workflow

The ingestion workflow is an offline process that prepares the system for serving customer queries. For this stage, Lili curated a comprehensive golden collection of financial accounting questions, drawing from common inquiries as well as real-world questions from their customer base over the years. This diverse and high-quality collection serves as a reference corpus, ensuring that the chatbot can handle a wide range of relevant queries. The ingestion workflow transforms these curated questions into vector embeddings using Amazon Titan Text Embeddings model API. This process occurs over AWS PrivateLink for Amazon Bedrock, a protected and private connection in your VPC. The vector embeddings are persisted in the application in-memory vector store. These vectors will help to validate user input during the retrieval workflow.

Each curated vector embedding is paired with a matching prompt template that was evaluated during testing to be the most effective.

Example prompt template

<role>
Provides context about the agent's role as Lili's AI assistant for financial questions and outlines the general guidelines applied to all queries.
</role>

<about>
Provides details on Lili platform.
</about>

<features>
Lists out all of Lili's product features in detail. This section aims to explain Lili's features in detail, ensuring that answers are aligned with the Lili platform. For instance, when addressing questions about tax reduction management, highlight the relevant features that Lili offers, which customers should be familiar with.
</features>

<output_format>
Outlines the required formatting for the response to ensure it meets the expected structure.
</output_format>

<user_data>
Data relevant to answering the customer's question.
</user_date>

<knowledge>
Specific accounting knowledge that is relevant to the question and the model is not familiar with, such as updated data for 2024.
<knowledge>

<question>
Contains the user's actual question.
</question>

<instructions>
Provides the core instructions on how to approach answering the question appropriately and meet expectations. It also defines the steps in providing a detailed and high-quality answer.
</instructions>

<reminders>
Important guidelines to remind the agent and make sure it follows them, such as the exact format of the answer.
</reminders>

Retrieval workflow

Retrieval Workflow

Lili’s web chatbot web interface allows users to submit queries and receive real-time responses. When a customer asks a question, it’s sent to the backend system for processing.

  1. The system first converts the query into a vector embedding using the Amazon Titan Text Embeddings model API, which is accessed securely through PrivateLink.
  2. Next, the system performs a similarity search on the pre-computed embeddings of the golden collection, to find the most relevant matches for the user’s query. The system evaluates the similarity scores of the search results against a predetermined threshold. If the user’s question yields matches with low similarity scores, it’s deemed malformed or unclear, and the user is prompted to rephrase or refine their query.
  3. However, if the user’s question produces matches with high similarity scores, it’s considered a legitimate query. In this case, Lili’s backend system proceeds with further processing using the golden question that has the highest similarity score to the user’s query.
  4. Based on the golden question with the highest similarity score, the system retrieves the corresponding prompt template.

This template is augmented with up-to-date accounting information and the customer’s specific financial data from external sources such as Amazon RDS for MySQL. The resulting contextualized prompt is sent to Anthropic’s Claude 3 Haiku on Amazon Bedrock, which generates a tailored response addressing the customer’s query within their unique business context.

Because model providers continually enhance their offerings with innovative updates, Amazon Bedrock simplifies the ability to adopt emerging advancements in generative AI across multiple model providers. This approach has demonstrated its advantages right from the initial rollout of AccountantAI. Lili transitioned from Anthropic’s Claude Instant to Claude 3 within two weeks of its official release on the Amazon Bedrock environment and three weeks after its general availability.

Lili selected Anthropic’s Claude model family for AccountantAI after reviewing industry benchmarks and conducting their own quality assessment. Anthropic Claude on Amazon Bedrock consistently outperformed other models in understanding financial concepts, generating coherent natural language, and providing accurate, tailored recommendations.

After the initial release of AcountantAI, Amazon Bedrock introduced Anthropic’s Claude 3 Haiku model, which Lili evaluated against Anthropic Claude Instant version. The Anthropic Claude 3 Haiku model demonstrated significant improvements across three key evaluation metrics:

  • Quality – Anthropic Claude 3 Haiku delivered higher quality outputs, providing more detailed and better-phrased responses compared to its predecessor.
  • Response time – Anthropic Claude 3 Haiku exhibited a 10 percent to 20 percent improvement in response times over Claude Instant, offering faster performance.
  • Cost – Anthropic Claude 3 Haiku on Amazon Bedrock is the most cost-effective choice. For instance, it is up to 68 percent less costly per 1,000 input/output tokens compared to Anthropic Claude Instant, while delivering higher levels of intelligence and performance. See Anthropic’s Claude 3 models on Amazon Bedrock for more information.

For customers like Lili, this underscores the importance of having access to a fully managed service like Amazon Bedrock, which offers a choice of high-performing foundation models to meet diverse enterprise AI needs. There is no “one size fits all” model, and the ability to select from a range of cutting-edge FMs is crucial for organizations seeking to use the latest advancements in generative AI effectively and cost-efficiently.

Conclusion

The AccountantAI feature, exclusively available to Lili customers, reduces the need for hiring a professional accountant. While professional accountants can provide valuable guidance and expertise, their services can be cost-prohibitive for many small businesses. AccountantAI has already answered thousands of questions, delivering real value to businesses and providing quality responses to financial, tax, and accounting inquiries.

Using Amazon Bedrock for easy, secure, and reliable access to high-performing foundation models from leading AI companies, Lili integrates accounting knowledge at scale with each customer’s unique data. This innovative solution offers affordable expertise on optimizing cash flow, streamlining tax planning, and enabling informed decisions to drive growth. AccountantAI bridges the gap in accounting resources, democratizing access to high-quality financial intelligence for every business.

Explore Lili’s AccountantAI feature powered by Amazon Bedrock to gain affordable and accessible financial intelligence for your business today, or use Amazon Bedrock Playgrounds to experiment with running inference on different models on your data.


About the authors

Doron BleibergDoron Bleiberg is a senior AWS Startups Solution Architect helping Fintech customers in their cloud journey.

Liran ZelkhaLiran Zelkha is the co-founder and CTO at Lili, leading our development and data efforts.

Eyal SolnikEyal Solnik is the head of Data at Lili and leads our AccountantAI product.

Read More

How Mend.io unlocked hidden patterns in CVE data with Anthropic Claude on Amazon Bedrock

How Mend.io unlocked hidden patterns in CVE data with Anthropic Claude on Amazon Bedrock

This post is co-written with Maciej Mensfeld from Mend.io.

In the ever-evolving landscape of cybersecurity, the ability to effectively analyze and categorize Common Vulnerabilities and Exposures (CVEs) is crucial. This post explores how Mend.io, a cybersecurity firm, used Anthropic Claude on Amazon Bedrock to classify and identify CVEs containing specific attack requirements details. By using the power of large language models (LLMs), Mend.io streamlined the analysis of over 70,000 vulnerabilities, automating a process that would have been nearly impossible to accomplish manually. With this capability, they manage to reduce 200 days of human experts’ work. This also allows them to provide higher quality of verdicts to their customers, allowing them to prioritize vulnerabilities better. It gives Mend.io a competitive advantage. This initiative not only underscores the transformative potential of AI in cybersecurity, but also provides valuable insights into the challenges and best practices for integrating LLMs into real-world applications.

The post delves into the challenges faced, such as managing quota limitations, estimating costs, and handling unexpected model responses. We also provide insights into the model selection process, results analysis, conclusions, recommendations, and Mend.io’s future outlook on integrating artificial intelligence (AI) in cybersecurity.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Mend.io is a cybersecurity company dedicated to safeguarding digital ecosystems through innovative solutions. With a deep commitment to using cutting-edge technologies, Mend.io has been at the forefront of integrating AI and machine learning (ML) capabilities into its operations. By continuously pushing the boundaries of what’s possible, Mend.io empowers organizations to stay ahead of evolving cyber threats and maintain a proactive, intelligent approach to security.

Uncovering attack requirements in CVE data

In the cybersecurity domain, the constant influx of CVEs presents a significant challenge. Each year, thousands of new vulnerabilities are reported, with descriptions varying in clarity, completeness, and structure. These reports, often contributed by a diverse global community, can be concise, ambiguous, or lack crucial details, burying critical information such as attack requirements, potential impact, and suggested mitigation steps. The unstructured nature of CVE reports poses a significant obstacle in extracting actionable insights. Automated systems struggle to accurately parse and comprehend the inconsistent and complex narratives, increasing the risk of overlooking or misinterpreting vital details—a scenario with severe implications for security postures.

For cybersecurity professionals, one of the most daunting tasks is identifying the attack requirements—the specific conditions and prerequisites needed for a vulnerability to be successfully exploited—from these vast and highly variable natural language descriptions. Determining whether attack requirements are present or absent is equally crucial, as this information is vital for assessing and mitigating potential risks. With tens of thousands of CVE reports to analyze, manually sifting through each description to extract this nuanced information is impractical and nearly impossible, given the sheer volume of data involved

The decision to use Anthropic Claude on Amazon Bedrock and the advantages it offered

In the face of this daunting challenge, the power of LLMs offered a promising solution. These advanced generative AI models are great at understanding and analyzing vast amounts of text, making them the perfect tool for sifting through the flood of CVE reports to pinpoint those containing attack requirement details.

The decision to use Anthropic Claude on Amazon Bedrock was a strategic one. During evaluations, Mend.io found that Although other LLMs like GPT-4 also showed strong performance in analyzing CVE descriptions, Mend.io’s specific requirements were better aligned with Anthropic Claude’s capabilities. Mend.io used tags like <example-attack-requirement>. When Mend.io evaluated other models with both structured and unstructured prompts, Anthropic Claude’s ability to precisely follow the structured prompts and include the expected tags made it a better fit for Mend.io’s use case during their testing.

Anthropic Claude’s unique capabilities, which allows the recognition of XML tags within prompts, gave it a distinct advantage. This capability enabled Mend.io to structure the prompts in a way that improved precision and value, ensuring that Anthropic Claude’s analysis was tailored to Mend.io’s specific needs. Furthermore, the seamless integration with Amazon Bedrock provided a robust and secure platform for handling sensitive data. The proven security infrastructure of AWS strengthens confidence, allowing Mend.io to process and analyze CVE information without compromising data privacy and security—a critical consideration in the world of cybersecurity.

Crafting the prompt

Crafting the perfect prompt for Anthropic Claude was both an art and a science. It required a deep understanding of the model’s capabilities and a thorough process to make sure Anthropic Claude’s analysis was precise and grounded in practical applications. They composed the prompt with rich context, provided examples, and clearly defined the differences between attack complexity and attack requirements as defined in the Common Vulnerability Scoring System (CVSS) v4.0. This level of detail was crucial to make sure Anthropic Claude could accurately identify the nuanced details within CVE descriptions.

The use of XML tags was a game-changer in structuring the prompt. These tags allowed them to isolate different sections, guiding Anthropic Claude’s focus and improving the accuracy of its responses. With this unique capability, Mend.io could direct the model’s attention to specific aspects of the CVE data, streamlining the analysis process and increasing the value of the insights derived.

With a well-crafted prompt and the power of XML tags, Mend.io equipped Anthropic Claude with the context and structure necessary to navigate the intricate world of CVE descriptions, enabling it to pinpoint the critical attack requirement details that would arm security teams with invaluable insights for prioritizing vulnerabilities and fortifying defenses.

The following example illustrates how to craft a prompt effectively using tags with the goal of identifying phishing emails:

<Instructions>
        Analyze emails to identify potential spam or phishing threats. Users should provide the full email content, including headers, by copy-pasting or uploading the email file directly.
</Instructions>
<AnalysisProcess>
        <StepOne>
            <Title>Analyze Sender Information</Title>
            <Description>Verify the sender's email address and domain. Assess     additional contacts, date, and time to evaluate potential legitimacy and context</Description>
        </StepOne>
        <StepTwo>
            <Title>Examine Email Content</Title>
            <Description>Analyze the subject line and body content for relevance and legitimacy. Caution against quick offers. Evaluate personalization and sender legitimacy.</Description>
        </StepTwo>
        <StepThree>
            <Title>Check for Unsolicited Attachments or Links</Title>
            <Description>Identify and scrutinize hyperlinks for potential phishing or spam indicators. Advise on verifying link legitimacy without direct interaction. Use tools like VirusTotal or Google Safe Browsing for safety checks.</Description>
        </StepThree>
</AnalysisProcess>
<Conclusion>
        Based on the analysis, provide an estimation of the email's likelihood of being spam or phishing, expressed as a percentage to indicate the assessed risk level. This comprehensive analysis helps users make informed decisions about the email's authenticity while emphasizing security and privacy.
</Conclusion>
<DataHandling>
         Refer to uploaded documents as 'knowledge source'. Strictly adhere to facts provided, avoiding speculation. Prioritize documented information over baseline knowledge or external sources. If no answer is found within the documents, state this explicitly.
</DataHandling>

The challenges

While using Anthropic Claude, Mend.io experienced the flexibility and scalability of the service firsthand. As the analysis workload grew to encompass 70,000 CVEs, they encountered opportunities to optimize their usage of the service’s features and cost management capabilities. When using the on-demand model deployment of Amazon Bedrock across AWS Regions, Mend.io proactively managed the API request per minute (RPM) and tokens per minute (TPM) quotas by parallelizing model requests and adjusting the degree of parallelization to operate within the quota limits. They also took advantage of the built-in retry logic in the Boto3 Python library to handle any occasional throttling scenarios seamlessly. For workloads requiring even higher quotas, the Amazon Bedrock Provisioned Throughput option offers a straightforward solution, though it didn’t align with Mend.io’s specific usage pattern in this case.

Although the initial estimate for classifying all 70,000 CVEs was lower, the final cost came in higher due to more complex input data resulting in longer input and output sequences. This highlighted the importance of comprehensive testing and benchmarking. The flexible pricing models in Amazon Bedrock allow organizations to optimize costs by considering alternative model options or data partitioning strategies, where simpler cases can be processed by more cost-effective models, while reserving higher-capacity models for the most challenging instances.

When working with advanced language models like those provided by AWS, it’s crucial to craft prompts that align precisely with the desired output format. In Mend.io’s case, their expectation was to receive straightforward YES/NO answers to their prompts, which would streamline subsequent data curation steps. However, the model often provided additional context, justifications, or explanations beyond the anticipated succinct responses. Although these expanded responses offered valuable insights, they introduced unanticipated complexity into Mend.io’s data processing workflow. This experience highlighted the importance of prompt refinement to make sure the model’s output aligns closely with the specific requirements of the use case. By iterating on prompt formulation and fine-tuning the prompts, organizations can optimize their model’s responses to better match their desired response format, ultimately enhancing the efficiency and effectiveness of their data processing pipelines.

Results

Despite the challenges Mend.io faced, their diligent efforts paid off. They successfully identified CVEs with attack requirement details, arming security teams with precious insights for prioritizing vulnerabilities and fortifying defenses. This outcome was a significant achievement, because understanding the specific prerequisites for a vulnerability to be exploited is crucial in assessing risk and developing effective mitigation strategies. By using the power of Anthropic Claude, Mend.io was able to sift through tens of thousands of CVE reports, extracting the nuanced information about attack requirements that would have been nearly impossible to obtain through manual analysis. This feat not only saved valuable time and resources but also provided cybersecurity teams with a comprehensive view of the threat landscape, enabling them to make informed decisions and prioritize their efforts effectively.

Mend.io conducted an extensive evaluation of Anthropic Claude, issuing 68,378 requests without considering any quota limitations. Based on their initial experiment of analyzing a sample of 100 vulnerabilities to understand attack vectors, they could determine the accuracy of Claude’s direct YES or NO answers. As shown in the following table, Anthropic Claude demonstrated exceptional performance, providing direct YES or NO answers for 99.9883% of the requests. In the few instances where a straightforward answer was not given, Anthropic Claude still provided sufficient information to determine the appropriate response. This evaluation highlights Anthropic Claude’s robust capabilities in handling a wide range of queries with high accuracy and reliability.

Character count of the prompt (without CVE specific details) 13,935
Number of tokens for the prompt (without CVE specific details) 2,733
Total requests 68,378
Unexpected answers 8
Failures (quota limitations excluded) 0
Answer Quality Success Rate 99.9883%

Future plans

The successful application of Anthropic Claude in identifying attack requirement details from CVE data is just the beginning of the vast potential that generative AI holds for the cybersecurity domain. As these advanced models continue to evolve and mature, their capabilities will expand, opening up new frontiers in automating vulnerability analysis, threat detection, and incident response. One promising avenue is the use of generative AI for automating vulnerability categorization and prioritization. By using these models’ ability to analyze and comprehend technical descriptions, organizations can streamline the process of identifying and addressing the most critical vulnerabilities, making sure limited resources are allocated effectively. Furthermore, generative AI models can be trained to detect and flag potential malicious code signatures within software repositories or network traffic. This proactive approach can help cybersecurity teams stay ahead of emerging threats, enabling them to respond swiftly and mitigate risks before they can be exploited.

Beyond vulnerability management and threat detection, generative AI also holds promise in incident response and forensic analysis. These models can assist in parsing and making sense of vast amounts of log data, network traffic records, and other security-related information, accelerating the identification of root causes and enabling more effective remediation efforts. As generative AI continues to advance, its integration with other cutting-edge technologies, such as ML and data analytics, will unlock even more powerful applications in the cybersecurity domain. The ability to process and understand natural language data at scale, combined with the predictive power of ML algorithms, could revolutionize threat intelligence gathering, enabling organizations to anticipate and proactively defend against emerging cyber threats.

Conclusion

The field of cybersecurity is continually advancing, the integration of generative AI models like Anthropic Claude, powered by the robust infrastructure of Amazon Bedrock, represents a significant step forward in advancing digital defense. Mend.io’s successful application of this technology in extracting attack requirement details from CVE data is a testament to the transformative potential of language AI in the vulnerability management and threat analysis domains. By utilizing the power of these advanced models, Mend.io has demonstrated that the complex task of sifting through vast amounts of unstructured data can be tackled with precision and efficiency. This initiative not only empowers security teams with crucial insights for prioritizing vulnerabilities, but also paves the way for future innovations in automating vulnerability analysis, threat detection, and incident response. Anthropic and AWS have played a pivotal role in enabling organizations like Mend.io to take advantage of these cutting-edge technologies.

Looking ahead, the possibilities are truly exciting. As language models continue to evolve and integrate with other emerging technologies, such as ML and data analytics, the potential for revolutionizing threat intelligence gathering and proactive defense becomes increasingly tangible.

If you’re a cybersecurity professional looking to unlock the full potential of language AI in your organization, we encourage you to explore the capabilities of Amazon Bedrock and the Anthropic Claude models. By integrating these cutting-edge technologies into your security operations, you can streamline your vulnerability management processes, enhance threat detection, and bolster your overall cybersecurity posture. Take the first step today and discover how Mend.io’s success can inspire your own journey towards a more secure digital future.


About the Authors

Hemmy Yona is a Solutions Architect at Amazon Web Services based in Israel. With 20 years of experience in software development and group management, Hemmy is passionate about helping customers build innovative, scalable, and cost-effective solutions. Outside of work, you’ll find Hemmy enjoying sports and traveling with family.

Tzahi Mizrahi is a Solutions Architect at Amazon Web Services, specializing in container solutions with over 10 years of experience in development and DevOps lifecycle processes. His expertise includes designing scalable, container-based architectures and optimizing deployment workflows. In his free time, he enjoys music and plays the guitar.

Gili Nachum is a Principal solutions architect at AWS, specializing in Generative AI and Machine Learning. Gili is helping AWS customers build new foundation models, and to leverage LLMs to innovate in their business. In his spare time Gili enjoys family time and Calisthenics.

Maciej Mensfeld is a principal product architect at Mend, focusing on data acquisition, aggregation, and AI/LLM security research. He’s the creator of diffend.io (acquired by Mend) and Karafka. As a Software Architect, Security Researcher, and conference speaker, he teaches Ruby, Rails, and Kafka. Passionate about OSS, Maciej actively contributes to various projects, including Karafka, and is a member of the RubyGems security team.

Read More

What's new in TensorFlow 2.17

What’s new in TensorFlow 2.17

Posted by the TensorFlow team

TensorFlow 2.17 has been released! Highlights of this release (and 2.16) include CUDA update, upcoming Numpy 2.0, and more. For the full release notes, please click here.

Note: Release updates on the new multi-backend Keras will be published on keras.io, starting with Keras 3.0. For more information, please see https://keras.io/keras_3/.

TensorFlow Core

CUDA Update

TensorFlow binary distributions now ship with dedicated CUDA kernels for GPUs with a compute capability of 8.9. This improves the performance on the popular Ada-Generation GPUs like NVIDIA RTX 40**, L4 and L40.

To keep Python wheel sizes in check, we made the decision to no longer ship CUDA kernels for compute capability 5.0. That means the oldest NVIDIA GPU generation supported by the precompiled Python packages is now the Pascal generation (compute capability 6.0). For Maxwell support, we either recommend sticking with TensorFlow version 2.16, or compiling TensorFlow from source. The latter will be possible as long as the used CUDA version still supports Maxwell GPUs.

Numpy 2.0

Upcoming TensorFlow 2.18 release will include support for Numpy 2.0. This may break some edge cases of TensorFlow API usage.

Drop TensorRT support

Starting with TensorFlow 2.18, support for TensorRT will be dropped. TensorFlow 2.17 will be the last version to include it.

Read More

Mistral AI and NVIDIA Unveil Mistral NeMo 12B, a Cutting-Edge Enterprise AI Model

Mistral AI and NVIDIA Unveil Mistral NeMo 12B, a Cutting-Edge Enterprise AI Model

Mistral AI and NVIDIA today released a new state-of-the-art language model, Mistral NeMo 12B, that developers can easily customize and deploy for enterprise applications supporting chatbots, multilingual tasks, coding and summarization.

By combining Mistral AI’s expertise in training data with NVIDIA’s optimized hardware and software ecosystem, the Mistral NeMo model offers high performance for diverse applications.

“We are fortunate to collaborate with the NVIDIA team, leveraging their top-tier hardware and software,” said Guillaume Lample, cofounder and chief scientist of Mistral AI. “Together, we have developed a model with unprecedented accuracy, flexibility, high-efficiency and enterprise-grade support and security thanks to NVIDIA AI Enterprise deployment.”

Mistral NeMo was trained on the NVIDIA DGX Cloud AI platform, which offers dedicated, scalable access to the latest NVIDIA architecture.

NVIDIA TensorRT-LLM for accelerated inference performance on large language models and the NVIDIA NeMo development platform for building custom generative AI models were also used to advance and optimize the process.

This collaboration underscores NVIDIA’s commitment to supporting the model-builder ecosystem.

Delivering Unprecedented Accuracy, Flexibility and Efficiency 

Excelling in multi-turn conversations, math, common sense reasoning, world knowledge and coding, this enterprise-grade AI model delivers precise, reliable performance across diverse tasks.

With a 128K context length, Mistral NeMo processes extensive and complex information more coherently and accurately, ensuring contextually relevant outputs.

Released under the Apache 2.0 license, which fosters innovation and supports the broader AI community, Mistral NeMo is a 12-billion-parameter model. Additionally, the model uses the FP8 data format for model inference, which reduces memory size and speeds deployment without any degradation to accuracy.

That means the model learns tasks better and handles diverse scenarios more effectively, making it ideal for enterprise use cases.

Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

This containerized format allows for easy deployment anywhere, providing enhanced flexibility for various applications.

As a result, models can be deployed anywhere in minutes, rather than several days.

NIM features enterprise-grade software that’s part of NVIDIA AI Enterprise, with dedicated feature branches, rigorous validation processes, and enterprise-grade security and support.

It includes comprehensive support, direct access to an NVIDIA AI expert and defined service-level agreements, delivering reliable and consistent performance.

The open model license allows enterprises to integrate Mistral NeMo into commercial applications seamlessly.

Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

Advanced Model Development and Customization 

The combined expertise of Mistral AI and NVIDIA engineers has optimized training and inference for Mistral NeMo.

Trained with Mistral AI’s expertise, especially on multilinguality, code and multi-turn content, the model benefits from accelerated training on NVIDIA’s full stack.

It’s designed for optimal performance, utilizing efficient model parallelism techniques, scalability and mixed precision with Megatron-LM.

The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.

Availability and Deployment

With the flexibility to run anywhere — cloud, data center or RTX workstation — Mistral NeMo is ready to revolutionize AI applications across various platforms.

Experience Mistral NeMo as an NVIDIA NIM today via ai.nvidia.com, with a downloadable NIM coming soon.

See notice regarding software product information.

Read More

Hot Deal, Cool Prices: GeForce NOW Summer Sale Offers Priority and Ultimate Memberships Half Off

Hot Deal, Cool Prices: GeForce NOW Summer Sale Offers Priority and Ultimate Memberships Half Off

It’s time for a sweet treat — the GeForce NOW Summer Sale offers high-performance cloud gaming at half off for a limited time.

And starting today, gamers can directly access supported PC games on GeForce NOW via Xbox.com game pages, enabling them to get into their favorite Xbox PC games even faster.

It all comes with nine new games joining the cloud this week.

We Halve a Deal

Summer Sale on GeForce NOW
Unlock the power of cloud gaming with GeForce NOW’s sizzling summer sale.

Take advantage of a special new discount — one-month and six-month GeForce NOW Priority or Ultimate memberships are now 50% off until Aug. 18. It’s perfect for members wanting to level up their gaming experience or those looking to try GeForce NOW for the first time to access and stream an ever-growing library of over 1,900 games with top-notch performance.

Priority members enjoy more benefits over free users, including faster access to gaming servers and gaming sessions of up to six hours. They can also stream beautifully ray-traced graphics across multiple devices with RTX ON for the most immersive experience in supported games.

For those looking for top-notch performance, the Ultimate tier provides members with exclusive access to servers and the ability to stream at up to 4K resolution and 120 frames per second, or up to 240 fps — even without upgraded hardware. Ultimate members get all the same benefits as GeForce RTX 40 series GPU owners, including NVIDIA DLSS 3 for the smoothest frame rates and NVIDIA Reflex for the lowest-latency streaming from the cloud.

Strike while it’s hot — this scorching summer sale ends soon.

Path of the Goddess

Kunitsu-Gami: Path of the Goddess on GeForce NOW
Rinse and repeat.

Capcom’s latest release, Kunitsu-Gami: Path of the Goddess is a unique Japanese-inspired, single-player Kagura Action Strategy game.

The game takes place on a mountain covered in defilement. During the day, purify the villages and prepare for sundown. During the night, protect the Maiden against the hordes of the Seethe. Repeat the day-and-night cycle until the mountain has been cleansed of defilement and peace has returned to the land.

Walk the path of the goddess in the cloud with extended gaming sessions for Ultimate and Priority members. Ultimate members can also enjoy seeing supernatural and human worlds collide in ultrawide resolutions for an even more immersive experience.

Slay New Games

Dungeons of Hinterberg on GeForce NOW
Having a holiday in Hinterberg.

In Dungeons of Hinterberg from Microbird Games, play as Luisa, a burnt-out law trainee taking a break from her fast-paced corporate life. Explore the beautiful alpine village of Hinterberg armed with just a sword and a tourist guide, and uncover the magic hidden within its dungeons. Master magic, solve puzzles and slay monsters — all from the cloud.

Check out the list of new games this week:

  • The Crust (New release on Steam, July 15)
  • Gestalt: Steam & Cinder (New release on Steam, July 16)
  • Nobody Wants to Die (New release on Steam, July 17)
  • Dungeons of Hinterberg (New release on Steam and Xbox, available on PC Game Pass, July 18)
  • Flintlock: The Siege of Dawn  (New release on Steam and Xbox, available on PC Game Pass, July 18)
  • Norland (New release on Steam, July 18)
  • Kunitsu-Gami: Path of the Goddess (New release on Steam, July 19)
  • Content Warning (Steam)
  • Crime Boss: Rockay City (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Abstracts: July 18, 2024

Abstracts: July 18, 2024

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researcher Arindam Mitra joins host Gretchen Huizinga to discuss “AgentInstruct: Toward Generative Teaching with Agentic Flows.” In their paper, Mitra and his coauthors introduce an automated multi-agent framework for creating diverse, high-quality synthetic data at scale for language model post-training. In contrast to methods that create data from a seed set of existing prompts and responses, AgentInstruct uses raw data and specifications provided by model builders. The work—which post-trains a model, Orca-3, on AgentInstruct-generated data—is part of project Orca. Orca aims to develop techniques for creating small language models that can perform as well as large language models. Like Orca-3, the earlier Orca, Orca-2, and Orca-Math models show the effectiveness of leveraging synthetic data in training. 

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m here today with Dr. Arindam Mitra, a senior researcher at Microsoft Research and the lead researcher for Microsoft’s Orca project. Dr. Mitra is coauthor of a paper called “AgentInstruct: Toward Generative Teaching with Agentic Flows.” Arindam, it’s a pleasure to have you on Abstracts today.

ARINDAM MITRA: Thank you, Gretchen.

HUIZINGA: So let’s start with a brief overview of your paper. What problem does your research address, and why does it matter?


MITRA: So the post-training phase is very important for language models. You can really improve the model a lot by creating high-quality synthetic data. The problem is, however, though, high-quality synthetic data creation requires lots of human effort and expertise. The problem that we’re trying to tackle is, how do you reduce human effort? How can you create high-quality data with really low amount of human effort? When you have a language model and, let’s say, you want to apply it somewhere, you might have to train a generic model before. Which could be small or big. Doesn’t matter. After that, you can specialize it on the domain that you are looking for, and when you want to do that—to make it really fast, this particular process—it’s best if you go for synthetic data. If you have a way to, actually, generate very high-quality synthetic data, you can fast-track this part of specialization process. Not only single model. So this year, you’re going to see a lot more multi-agent models. And when you are trying to build these multi-agent models, you’re fearing like, OK, it might increase the cost too much, the latency too much. So it’s also very much important that you have a multi-agent system and you can, sort of, replace some of those agents with specialized small models. And when you’re trying to address these goals, you want this process to be something which you know works fast. So that’s why we are trying to make sure we have a very good way to create synthetic data for your specific need.

HUIZINGA: No research exists in a vacuum, and most of it fills some kind of a gap. So tell us what’s already been done in this field and how this work is building on it.

MITRA: So previously, actually, we have seen that in post-training, the more data you have, the better the performance goes for the model you’re training. So what we wanted to test is how much we can scale and what happens if we scale a lot and lot. But we didn’t have the tools for it. So the other approaches people previously used was you had a small set of data and how do we expand this dataset into much larger and larger amount of data. That’s where people were mostly focusing. But it’s not that easy to create that initial seed set. [LAUGHTER] You need to be very expert. The way that we’re doing is, actually, rather you define what you want to create. Like, OK, you want to create tool-use data. So you say, OK, I have a bunch of tools, and I am looking for data in the scenarios where someone can just come give me a description and then maybe that person interact with the AI to figure out how to get the job done. It’s not a one-step thing. And maybe you also have a setting where it’s more like an app developer. You have a bunch of APIs in your phone. You just want to figure out which one is best for the user request, which came through voice command. So different scenarios could be there. So what we’re saying [is], OK, we are not going through the method where you have to come up with your initial own seed data and then we expand. It is more like you define what you want to do. It’s much more abstract. And then, we are, sort of, automating the effort of data creation. So this setting actually of synthetic data creation, we are referring [to] it as generative teaching, and that’s where we are, sort of, differing. So previously, it was more like expansion, and now we are trying from specification to the data that you need.

HUIZINGA: Gotcha. Well talk a little bit more about your methodology and how you went about conducting this research.

MITRA: So first of all, what we are proposing actually is a multi-agent solution. So you start with first describing what you really need. So you describe in detail, like, I need data for this specific skill or this specific scenario. Then, what we do is like, OK, you have some unstructured data or raw data like text documents or code files that you gather from web with permissible license or use something that you own. We don’t care much about what the content is really. So it’s more like we got some random stuff, some random content. And then we’ll guide you how to convert this random something which is not meaningful for you into something which is meaningful for your data creation. For example, like, if you are creating data to teach how to use APIs, you might think about, you need lots of APIs and how do you get these APIs. So what we are saying is, like, we can take something like code and we’ll have agents which will convert these raw code files into list of APIs which is more like a library. So you create automatically this input that is very meaningful for data creation. And then once we have that, we have basically the seed instruction creation step based on your specification. Like, what do you want to create data for? So you have all these different scenarios, and we have multiple agents creating data for different scenarios. And then the last step is actually what we call refinement step. So it’s more like whatever data you created, we’ll go through them and we’ll make them better and better—improve the quality, improve the complexity, improve the trickiness, we’ll teach when not to answer, etc., etc. So make sure we cover the whole space. So by changing the stochastic seed, we are trying to cover the entire possible data space.

HUIZINGA: Right.

MITRA: So that’s the key thing. The way we, sort of, conducted this research is actually we defined 17 skills. Skills meaning reading comprehension, tool use, text modification, content creation, RAG (retrieval-augmented generation) … we have, like, list of 17 skills … conversation … and then we created one multi-agent flow for each of the skills and we generate data. So one key thing I want to highlight is, like, this work, compared to other work, it was not benchmark driven. We want to teach a skill. We don’t care which benchmarks we’re trying to evaluate it on. So we define the skill, like tool use means this to us, reading comprehension means this to us, text modification means this to us. And then we, sort of, generate the data to teach everything for that skill. And then what we did, we created actually 22 million instructions. And we had previously in Orca series, we had 3 million, around, instructions. So the 25 million is what we, sort of, have at the end. And that’s where we actually trained a Mistral model as of now. And we’re going to measure, like, how much we improve the Mistral model by this post-training.

HUIZINGA: Moving from methods to findings, I always look forward to the part of the research paper that finishes the sentence “and what we found was … ,” so give us a quick overview of your results. What did you find?

MITRA: Yes, so the results were actually very exciting for us. So Mistral 7B was our main, sort of, baseline because that’s where we’re trying to showcase, like, how much improvement we are getting. On the other side, we have, like, frontier models—ChatGPT, GPT-4. We want to also measure how far we are from those frontier models, so that’s, sort of, our evaluation setup. So on average actually, we got like 20 percent performance gain over the Mistral, and we evaluated that across 14 benchmarks that test reasoning, content creation, instruction following, format following, etc. But what was more important to us was to do a skill-specific evaluation because we are trying to teach certain skills, and we had, like, 17 skills as we mentioned earlier. So, for example, like, if you are focusing on reading comprehension as a skill, we took LSAT, SAT, and DROP, and many other benchmarks; we created a collection of reading comprehension-based benchmark. And there, we are observing, like, 20 percent improvement over Mistral, and what it means, like, we’re actually achieving GPT-4–level performance. Similarly, if I’m focusing on math skill, there are many datasets which test, like, elementary math, high school math, college-level math. And we improved actually across all these different levels of math. So we see from 40 percent to 150 percent of improvement on different benchmarks of math. So it was more like what we wanted to see. We’re not optimizing for a particular benchmark. We wanted to optimize the skill, and that’s what you’re observing. So you’re observing improvement in math across all these levels, from elementary to high school to college to middle school, etc., everything. The same goes for RAG, as well. We’re observing on RAG skill 92 percent, around, improvement over Mistral. The format following numbers are pretty interesting to us. So format following is very important for SLMs (small language models). You want to make these models practical. You want to make sure that they follow the format so you can parse the result. And we were able to take Mistral beyond Gemini Pro. So that was a very strong performance from the post-training that we did. For summarization, actually we were able to reduce the hallucination rate by 31 percent while achieving the GPT-4–level quality. So overall, all these results were, sort of, highlighting that the methodology that we have, which we’re calling AgentInstruct, is very promising.

HUIZINGA: I think it’s important to get practical and talk about real-world impact. So tell us who you think this research will benefit most and why.

MITRA: Yeah, so again the model builders will, sort of, find it most beneficial. So the significance of our work actually lies in the way we are trying to revolutionize the language model development through scalable, low-effort synthetic creation. And the scalable and low effort is, sort of, the key thing, right. We have shown that we can create very high-quality data. That’s what the numbers are telling us. We want to mention that this is very scalable and low effort, and that’s what we think might help the most for model builders.

HUIZINGA: So, Arindam, let’s borrow a phrase from the machine learning lexicon and go for a little one-shot learning here: if you had to boil down why your work is important, what’s the one thing you want our listeners to take away from this research?

MITRA: The key takeaway would be, like, the AgentInstruct method enables the generation of vast, diverse, and high-quality synthetic data with very minimal human input. So that’s one thing I would, like, to remember from this paper.

HUIZINGA: So as we close, talk briefly about the limitations that you encountered in this project and directions for future research. What are the outstanding challenges in this field, and what’s on your research agenda to overcome them?

MITRA: Yes, so we’re exploring further automation. But apart from making this data creation more automated and less human involvement needed, we’re trying to focus on two other aspects. One is automated model debugging, and the other is automated model repairing. So now that we have the ability to generate data for a particular skill, let’s say math, for model debugging, what we need is basically an error handler. Like something we can plug in which takes the question and the answer coming from a different model and verifies if the answer is correct or not. So that’s the part we’re working on right now, figuring out this error handler. And the second aspect is repairing. So once we have the error, we figure out, OK, this is where the model is struggling. How can we give feedback or how can we give more knowledge so it can basically correct those errors? So those are some things we’re working on right now.

[MUSIC PLAYS]

HUIZINGA: Well, Arindam Mitra, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find a preprint on arXiv. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: July 18, 2024 appeared first on Microsoft Research.

Read More

Samplable Anonymous Aggregation for Private Federated Data Analytics

We revisit the problem of designing scalable protocols for private statistics and private federated learning when each device holds its private data. Locally differentially private algorithms require little trust but are (provably) limited in their utility. Centrally differentially private algorithms can allow significantly better utility but require a trusted curator. This gap has led to significant interest in the design and implementation of simple cryptographic primitives, that can allow central-like utility guarantees without having to trust a central server.
Our first contribution is to…Apple Machine Learning Research