January 2024 – Vedere AI

MobileDiffusion: Rapid text-to-image generation on-device

Posted by Yang Zhao, Senior Software Engineer, and Tingbo Hou, Senior Staff Software Engineer, Core ML

Text-to-image diffusion models have shown exceptional capabilities in generating high-quality images from text prompts. However, leading models feature billions of parameters and are consequently expensive to run, requiring powerful desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While recent advancements in inference solutions on Android via MediaPipe and iOS via Core ML have been made in the past year, rapid (sub-second) text-to-image generation on mobile devices has remained out of reach.

To that end, in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, we introduce a novel approach with the potential for rapid text-to-image generation on-device. MobileDiffusion is an efficient latent diffusion model specifically designed for mobile devices. We also adopt DiffusionGAN to achieve one-step sampling during inference, which fine-tunes a pre-trained diffusion model while leveraging a GAN to model the denoising step. We have tested MobileDiffusion on iOS and Android premium devices, and it can run in half a second to generate a 512×512 high-quality image. Its comparably small model size of just 520M parameters makes it uniquely suited for mobile deployment.

Rapid text-to-image generation on-device.

Background

The relative inefficiency of text-to-image diffusion models arises from two primary challenges. First, the inherent design of diffusion models requires iterative denoising to generate images, necessitating multiple evaluations of the model. Second, the complexity of the network architecture in text-to-image diffusion models involves a substantial number of parameters, regularly reaching into the billions and resulting in computationally expensive evaluations. As a result, despite the potential benefits of deploying generative models on mobile devices, such as enhancing user experience and addressing emerging privacy concerns, it remains relatively unexplored within the current literature.

The optimization of inference efficiency in text-to-image diffusion models has been an active research area. Previous studies predominantly concentrate on addressing the first challenge, seeking to reduce the number of function evaluations (NFEs). Leveraging advanced numerical solvers (e.g., DPM) or distillation techniques (e.g., progressive distillation, consistency distillation), the number of necessary sampling steps have significantly reduced from several hundreds to single digits. Some recent techniques, like DiffusionGAN and Adversarial Diffusion Distillation, even reduce to a single necessary step.

However, on mobile devices, even a small number of evaluation steps can be slow due to the complexity of model architecture. Thus far, the architectural efficiency of text-to-image diffusion models has received comparatively less attention. A handful of earlier works briefly touches upon this matter, involving the removal of redundant neural network blocks (e.g., SnapFusion). However, these efforts lack a comprehensive analysis of each component within the model architecture, thereby falling short of providing a holistic guide for designing highly efficient architectures.

MobileDiffusion

Effectively overcoming the challenges imposed by the limited computational power of mobile devices requires an in-depth and holistic exploration of the model’s architectural efficiency. In pursuit of this objective, our research undertakes a detailed examination of each constituent and computational operation within Stable Diffusion’s UNet architecture. We present a comprehensive guide for crafting highly efficient text-to-image diffusion models culminating in the MobileDiffusion.

The design of MobileDiffusion follows that of latent diffusion models. It contains three components: a text encoder, a diffusion UNet, and an image decoder. For the text encoder, we use CLIP-ViT/L14, which is a small model (125M parameters) suitable for mobile. We then turn our focus to the diffusion UNet and image decoder.

Diffusion UNet

As illustrated in the figure below, diffusion UNets commonly interleave transformer blocks and convolution blocks. We conduct a comprehensive investigation of these two fundamental building blocks. Throughout the study, we control the training pipeline (e.g., data, optimizer) to study the effects of different architectures.

In classic text-to-image diffusion models, a transformer block consists of a self-attention layer (SA) for modeling long-range dependencies among visual features, a cross-attention layer (CA) to capture interactions between text conditioning and visual features, and a feed-forward layer (FF) to post-process the output of attention layers. These transformer blocks hold a pivotal role in text-to-image diffusion models, serving as the primary components responsible for text comprehension. However, they also pose a significant efficiency challenge, given the computational expense of the attention operation, which is quadratic to the sequence length. We follow the idea of UViT architecture, which places more transformer blocks at the bottleneck of the UNet. This design choice is motivated by the fact that the attention computation is less resource-intensive at the bottleneck due to its lower dimensionality.

Our UNet architecture incorporates more transformers in the middle, and skips self-attention (SA) layers at higher resolutions.

Convolution blocks, in particular ResNet blocks, are deployed at each level of the UNet. While these blocks are instrumental for feature extraction and information flow, the associated computational costs, especially at high-resolution levels, can be substantial. One proven approach in this context is separable convolution. We observed that replacing regular convolution layers with lightweight separable convolution layers in the deeper segments of the UNet yields similar performance.

In the figure below, we compare the UNets of several diffusion models. Our MobileDiffusion exhibits superior efficiency in terms of FLOPs (floating-point operations) and number of parameters.

Comparison of some diffusion UNets.

Image decoder

In addition to the UNet, we also optimized the image decoder. We trained a variational autoencoder (VAE) to encode an RGB image to an 8-channel latent variable, with 8× smaller spatial size of the image. A latent variable can be decoded to an image and gets 8× larger in size. To further enhance efficiency, we design a lightweight decoder architecture by pruning the original’s width and depth. The resulting lightweight decoder leads to a significant performance boost, with nearly 50% latency improvement and better quality. For more details, please refer to our paper.

VAE reconstruction. Our VAE decoders have better visual quality than SD (Stable Diffusion).

Decoder	#Params (M)	PSNR↑	SSIM↑	LPIPS↓
SD	49.5	26.7	0.76	0.037
Ours	39.3	30.0	0.83	0.032
Ours-Lite	9.8	30.2	0.84	0.032

Quality evaluation of VAE decoders. Our lite decoder is much smaller than SD, with better quality metrics, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).

One-step sampling

In addition to optimizing the model architecture, we adopt a DiffusionGAN hybrid to achieve one-step sampling. Training DiffusionGAN hybrid models for text-to-image generation encounters several intricacies. Notably, the discriminator, a classifier distinguishing real data and generated data, must make judgments based on both texture and semantics. Moreover, the cost of training text-to-image models can be extremely high, particularly in the case of GAN-based models, where the discriminator introduces additional parameters. Purely GAN-based text-to-image models (e.g., StyleGAN-T, GigaGAN) confront similar complexities, resulting in highly intricate and expensive training.

To overcome these challenges, we use a pre-trained diffusion UNet to initialize the generator and discriminator. This design enables seamless initialization with the pre-trained diffusion model. We postulate that the internal features within the diffusion model contain rich information of the intricate interplay between textual and visual data. This initialization strategy significantly streamlines the training.

The figure below illustrates the training procedure. After initialization, a noisy image is sent to the generator for one-step diffusion. The result is evaluated against ground truth with a reconstruction loss, similar to diffusion model training. We then add noise to the output and send it to the discriminator, whose result is evaluated with a GAN loss, effectively adopting the GAN to model a denoising step. By using pre-trained weights to initialize the generator and the discriminator, the training becomes a fine-tuning process, which converges in less than 10K iterations.

Illustration of DiffusionGAN fine-tuning.

Results

Below we show example images generated by our MobileDiffusion with DiffusionGAN one-step sampling. With such a compact model (520M parameters in total), MobileDiffusion can generate high-quality diverse images for various domains.

Images generated by our MobileDiffusion

We measured the performance of our MobileDiffusion on both iOS and Android devices, using different runtime optimizers. The latency numbers are reported below. We see that MobileDiffusion is very efficient and can run within half a second to generate a 512×512 image. This lightning speed potentially enables many interesting use cases on mobile devices.

Latency measurements (s) on mobile devices.

Conclusion

With superior efficiency in terms of latency and size, MobileDiffusion has the potential to be a very friendly option for mobile deployments given its capability to enable a rapid image generation experience while typing text prompts. And we will ensure any application of this technology will be in-line with Google’s responsible AI practices.

Acknowledgments

We like to thank our collaborators and contributors that helped bring MobileDiffusion to on-device: Zhisheng Xiao, Yanwu Xu, Jiuqiang Tang, Haolin Jia, Lutz Justen, Daniel Fenner, Ronald Wotzlaw, Jianing Wei, Raman Sarokin, Juhyun Lee, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann.

Microsoft Research Forum: New series explores bold ideas in technology research in the era of AI

Microsoft Research Forum (opens in new tab) is a new series of conversations that explore recent advances, bold new ideas, and important discussions within the global research community. Leading Microsoft researchers will share insights into their work, followed by live online discussions with audience participants.

This post provides an overview of the inaugural Microsoft Research Forum conversation, with a summary of each presentation. Full details, including the copilot experience (opens in new tab) and replays of each session (opens in new tab), are available on demand. Register now (opens in new tab) to attend upcoming Research Forum events.

Keynote: Research in the era of AI

Peter Lee, CVP, Microsoft Research & Incubations

2023 was an incredible year for AI research, with rapid change and the emerging sparks of artificial general intelligence. Generative AI now influences everything in research, and research has never mattered more to innovating technology that will benefit society. And while there is plenty of reason for optimism, we must also be clear-eyed about risks and limitations—another direction where research can play an important role.

In this environment, openness and collaboration are essential, not just to advance the research, but to ensure technology is developed with a commitment to safety and ethical use. Microsoft continues to invest in its commitment to responsible AI (RAI), which is deeply integrated not only into every engineering group across the company, but also across functions like finance, security, and legal teams. Additional progress will require close collaboration with the broader research community.

Some of the most promising and tangible advances are coming in medicine and materials science. Examples include work by Microsoft AI4Science, a Microsoft Research lab, which is working with the Global Health Drug Discovery Institute to accelerate discovery of new treatments for infectious diseases.

Panel discussion: AI Frontiers

Research Forum January 2024 - panel discussion with Ashley Llorens, Sebastien Bubeck, Ahmed Awadallah, and Ece Kamar

Ashley Llorens, VP and Distinguished Scientist, Microsoft
Ece Kamar, Managing Director, Microsoft Research AI Frontiers
Sébastien Bubeck, VP, Microsoft GenAI
Ahmed Awadallah, Senior Principal Research Manager, Microsoft Research AI Frontiers

The panelists explored their aspirations for AI in the near future, as well as the challenges to overcome. Examples include:

Going beyond language to build AI systems that become helpers in the physical world. AI can do more than just answer questions; it can better understand our goals and intentions and create a difference in people’s lives.
Beyond trying to get AI to mimic the human mind, can AI actually illuminate how the human mind works and uncover the building blocks of reasoning?
Making AI technology smaller would help reduce the cost and increase the performance of current AI systems. How can we divide problems into smaller pieces to solve? And how can we lower the requirements of big data, large neural networks, and massive computing resources?
Can we create a virtuous feedback loop, where AI learns from people that use it, rather than simply delivering answers from a static base of information?

The panelists also explored the rapid pace of technology development. Historical timelines of three to five years are now condensed into mere weeks. In this environment, collaboration is essential to quickly develop ideas and scale up experimentation across organizations. This also amplifies existing concerns about optimizing for safety and alleviating bias in language models.

Lightning Talks

Improving reasoning in language models with LASER: Layer-Selective Rank Reduction

Research Forum January 2024 - Dipendra Misra

Dipendra Misra, Senior Researcher, Microsoft Research NYC and AI Frontiers

Large language models (LLMs) have revolutionized machine learning. As researchers continue to advance this technology, one approach involves performing an intervention in the models and observing how that affects their performance. This talk presents LASER, a new method of intervention that can increase LLMs’ accuracy while reducing their memory footprint.

Evaluation and understanding of foundation models

Research Forum January 2024 - Besmira Nushi

Besmira Nushi, Principal Researcher, Microsoft Research AI Frontiers

Model evaluation and understanding serve as guides to AI innovation. But evaluation is hard, and new generative tasks pose new challenges in evaluation and understanding. This talk explores efforts to measure, inform, and accelerate model improvement, which help the scientific community understand and study new forms and levels of intelligence.

Generative AI meets structural biology: Equilibrium distribution prediction

Research Forum January 2024 - Shuxin Zheng

Shuxin Zheng, Principal Researcher, Microsoft Research AI4Science

Distributional Graphormer (DIG) is a deep learning framework for predicting protein structures with greater accuracy, a fundamental challenge in molecular science. Using generative AI to solve the problem of predicting equilibrium distribution, DIG opens exciting new possibilities. By learning about different states and behaviors of molecules, scientists can make breakthroughs in developing new drugs, creating advanced materials, and understanding biological processes.

Augmenting human cognition and decision making with AI

Research Forum January 2024 - Jake Hofman

Jake Hofman, Senior Principal Researcher, Microsoft Research NYC

How can AI help people make better decisions, be more productive, and improve themselves in a sustainable way? Some technology can help in the short term without providing lasting solutions. For example, relying on a spell checker may not improve one’s ability to spell correctly. This talk explores choices in the design and use of AI tools to help with decision making and the importance of rigorous measurement and experimentation to maximize the benefits and minimize the risks.

Kahani: Visual storytelling through culturally nuanced images

Research Forum January 2024 - Sameer Segal

Sameer Segal, Principal Research Software Development Engineer, Microsoft Research India

Image generation models can produce visually stunning images from natural language descriptions, but they often lack cultural awareness and nuances. These models may rely on stereotypes and fail to understand local words, which require heavy fixes like modifying or significantly fine tuning the model. Image generation can also require sophisticated prompting, beyond the abilities of many laypeople.

This talk looks at Kahani, a Microsoft Research project focused on developing a visual storytelling prototype that allows people to create visually striking and culturally nuanced images just by describing them in their local languages. Kahani leverages state-of-the-art techniques like inpainting and models like Segment Anything and GPT-4V(ision) to generate feedback for the candidate images.

Closing remarks and announcements

Research Forum January 2024 - Ashley Llorens

Ashley Llorens, VP and Distinguished Scientist, Microsoft

The acceleration of AI underscores the importance of engagement across disciplines, organizations, and geographies. This session introduced the first cohort of fellows for Microsoft Research’s AI & Society Fellows (opens in new tab) program, which aims to foster deep interdisciplinary collaboration that maximizes the value of AI for people and society. The session also provided an update on the Accelerate Foundation Models Research (opens in new tab) (AFMR) program, which issues grants that make leading models, hosted through Microsoft Azure, accessible to academic research teams. To date, AFMR grants are supporting nearly 200 projects across 80 research institutions around the world. These projects include work in AI model innovation and evaluation, responsible AI, health, AI for scientific discovery, and more.

The post Microsoft Research Forum: New series explores bold ideas in technology research in the era of AI appeared first on Microsoft Research.

Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

Improving how users discover new content is critical to increase user engagement and satisfaction on media platforms. Keyword search alone has challenges capturing semantics and user intent, leading to results that lack relevant context; for example, finding date night or Christmas-themed movies. This can drive lower retention rates if users can’t reliably find the content they want. However, with large language models (LLMs), there is an opportunity to solve these semantic and user intent challenges. By combining embeddings that capture semantics with a technique called Retrieval Augmented Generation (RAG), you can generate more relevant answers based on retrieved context from your own data sources.

In this post, we show you how to securely create a movie chatbot by implementing RAG with your own data using Knowledge Bases for Amazon Bedrock. We use the IMDb and Box Office Mojo dataset to simulate a catalog for media and entertainment customers and showcase how you can build your own RAG solution in just a couple of steps.

Solution overview

The IMDb and Box Office Mojo Movies/TV/OTT licensable data package provides a wide range of entertainment metadata, including over 1.6 billion user ratings; credits for more than 13 million cast and crew members; 10 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

Introduction to Knowledge Bases for Amazon Bedrock

To equip an LLM with up-to-date proprietary information, organizations use RAG, a technique that involves fetching data from company data sources and enriching the prompt with that data to deliver more relevant and accurate responses. Knowledge Bases for Amazon Bedrock enable a fully managed RAG capability that allows you to customize LLM responses with contextual and relevant company data. Knowledge Bases automate the end-to-end RAG workflow, including ingestion, retrieval, prompt augmentation, and citations, eliminating the need for you to write custom code to integrate data sources and manage queries. Knowledge Bases for Amazon Bedrock also enable multi-turn conversations so that the LLM can answer complex user queries with the correct answer.

We use the following services as part of this solution:

Knowledge Bases for Amazon Bedrock
Amazon OpenSearch Serverless

We walk through the following high-level steps:

Preprocess the IMDb data to create documents from every movie record and upload the data into an Amazon Simple Storage Service (Amazon S3) bucket.
Create a knowledge base.
Sync your knowledge base with your data source.
Use the knowledge base to answer semantic queries about the movie catalog.

Prerequisites

The IMDb data used in this post requires a commercial content license and paid subscription to IMDb and Box Office Mojo Movies/TV/OTT licensing package on AWS Data Exchange. To inquire about a license and access sample data, visit developer.imdb.com. To access the dataset, refer to Power recommendation and search using an IMDb knowledge graph – Part 1 and follow the Access the IMDb data section.

Preprocess the IMDb data

Before we create a knowledge base, we need to preprocess the IMDb dataset into text files and upload them to an S3 bucket. In this post, we simulate a customer catalog using the IMDb dataset. We take 10,000 popular movies from the IMDb dataset for the catalog and build the dataset.

Use the following notebook to create the dataset with additional info like actors, director, and producer names. We use the following code to create a single file for a movie with all the information stored in the file in an unstructured text that can be understood by LLMs:

def create_txt_files_imdb(row):
    full_text = ""
    full_text += f"{row['originalTitle']} ({row['titleId']}) was shot in year {int(row['year'])} with rating {row['rating']} and poster url {row['poster_url']}.nn"
    full_text += f"{row['originalTitle']} has genres {', '.join(row['genres'])}.nn"
    full_text += f"{row['originalTitle']} has actors {', '.join(row['Actors'])}.nn"   
    full_text += f"{row['originalTitle']} has directors {', '.join(row['Directors'])}.nn"
    full_text += f"{row['originalTitle']} has producers {', '.join(row['Producers'])}.nn"
    full_text += f"{row['originalTitle']} has keyword {', '.join([x.replace('-',' ') for x in row['keyword']])}.nn"
    full_text += f"{row['originalTitle']} has location {', '.join(row['location'])}.nn"
    full_text += f"{row['originalTitle']} has plot {row['plot']}.nn"
    with open(f"<path>/data/imdb_data/{row['titleId']}.txt","w") as f:
        f.write(full_text)
    return full_text

After you have the data in .txt format, you can upload the data into Amazon S3 using the following command:

aws s3 cp <path to local data> s3://<bucket-name>/<path>/ --recursive

Create the IMDb Knowledge Base

Complete the following steps to create your knowledge base:

On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
Choose Create knowledge base.
For Knowledge base name, enter imdb.
For Knowledge base description, enter an optional description, such as Knowledge base for ingesting and storing imdb data.
For IAM permissions, select Create and use a new service role, then enter a name for your new service role.
Choose Next.

For Data source name, enter imdb-s3.
For S3 URI, enter the S3 URI that you uploaded the data to.
In the Advanced settings – optional section, for Chunking strategy, choose No chunking.
Choose Next.

Knowledge bases enable you to chunk your documents in smaller segments to make it straightforward for you to process large documents. In our case, we have already chunked the data into a smaller size document (one per movie).

In the Vector database section, select Quick create a new vector store.

Amazon Bedrock will automatically create a fully managed OpenSearch Serverless vector search collection and configure the settings for embedding your data sources using the chosen Titan Embedding G1 – Text embedding model.

Choose Next.

Review your settings and choose Create knowledge base.

Sync your data with the knowledge base

Now that you have created your knowledge base, you can sync the knowledge base with your data.

On the Amazon Bedrock console, navigate to your knowledge base.
In the Data source section, choose Sync.

After the data source is synced, you’re ready to query the data.

Improve search using semantic results

Complete the following steps to test the solution and improve your search using semantic results:

On the Amazon Bedrock console, navigate to your knowledge base.
Select your knowledge base and choose Test knowledge base.
Choose Select model, and choose Anthropic Claude v2.1.
Choose Apply.

Now you are ready to query the data.

We can ask some semantic questions, such as “Recommend me some Christmas themed movies.”

Knowledge base responses contain citations that you can explore for response correctness and factuality.

You can also drill down on any information that you need from these movies. In the following example, we ask “who directed nightmare before christmas?”

You can also ask more specific questions related to the genres and ratings, such as “show me classic animated movies with ratings greater than 7?”

Augment your knowledge base with agents

Agents for Amazon Bedrock help you automate complex tasks. Agents can break down the user query into smaller tasks and call custom APIs or knowledge bases to supplement information for running actions. With Agents for Amazon Bedrock, developers can integrate intelligent agents into their apps, accelerating the delivery of AI-powered applications and saving weeks of development time. With agents, you can augment your knowledge base by adding more functionality like recommendations from Amazon Personalize for user-specific recommendations or performing actions such as filtering movies based on user needs.

Conclusion

In this post, we showed how to build a conversational movie chatbot using Amazon Bedrock in a few steps to answer semantic search and conversational experiences based on your own data and the IMDb and Box Office Mojo Movies/TV/OTT licensed dataset. In the next post, we go through the process of adding more functionality to your solution using Agents for Amazon Bedrock. To get started with knowledge bases on Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock.

About the Authors

Gaurav Rele is a Senior Data Scientist at the Generative AI Innovation Center, where he works with AWS customers across different verticals to accelerate their use of generative AI and AWS Cloud services to solve their business challenges.

Divya Bhargavi is a Senior Applied Scientist Lead at the Generative AI Innovation Center, where she solves high-value business problems for AWS customers using generative AI methods. She works on image/video understanding & retrieval, knowledge graph augmented large language models and personalized advertising use cases.

Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using Large Language Models, primarily through Amazon Bedrock and other AWS Cloud services.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

This post was co-written with Ricardo Perdigao, Solution Architecture Manager at Mendix, a Siemens business.

Mendix, a Siemens business, offers the low-code platform with the vision and execution designed for today’s complex software development challenges. Since 2005, we’ve helped thousands of organizations worldwide reimagine how they develop applications with our platform’s cutting-edge capabilities. Mendix allows enterprises to turn ideas into outcomes by delivering sophisticated applications to market faster than ever. Mendix has been named an industry leader by Gartner and Forrester.

In a world where customer-centric innovation is the key to staying ahead of the competition, businesses increasingly turn to advanced technology to revolutionize product offerings. For Mendix, integrating the cutting-edge generative AI capabilities of Amazon Bedrock has been a game changer in redefining our customer experience landscape.

In this post, we share how Mendix is revolutionizing customer experiences using Amazon Bedrock and generative AI.

Overview

In 2016, a new era of innovation began when Mendix announced a strategic collaboration with AWS. By taking advantage of the robust cloud infrastructure of AWS, Mendix was able to provide a secure, scalable, and reliable solution for enterprises across the globe. The primary objective of this partnership was to enable organizations to build, deploy, and manage enterprise-grade applications quickly and effectively.

This unique collaboration combines the agility of the Mendix low-code application development platform with the robustness of AWS Cloud services. This combination facilitates rapid application development and deployment, empowering businesses to respond swiftly to market changes and customer demands, achieving successful digital transformation at an accelerated pace.

However, the rapid evolution of technology brings new challenges. The rise of generative AI presents a unique opportunity to redefine how applications are developed and used. Integrating these advanced AI capabilities into a low-code environment is a complex task, requiring a solution that is innovative, scalable, secure, and easy to use, that nonetheless delivers significant value to users.

Recognizing the importance of staying competitive in this rapidly evolving technological landscape, Mendix is committed to enhancing its platform with seamless AI integrations. Mendix not only offers an AI-enabled low-code application development experience, but also seeks to equip our customers with user-friendly tools necessary for implementing generative AI in the solutions they build.

By combining the power of AI with the accessibility of low-code development, we are setting the stage for unprecedented innovation in the industry, empowering our customers to create AI-enhanced applications with ease and efficiency.

The solution: Integrating generative AI capabilities provided by Amazon Bedrock

As a first step on our journey, we embraced Amazon Bedrock to infuse our low-code platform with generative AI capabilities. Amazon Bedrock offers many ready-to-use AI models. These models excel at writing clear text, creating images from just descriptions, and translating between different languages.

The Mendix AWS Bedrock Connector in the Mendix Marketplace eliminates traditional complexities and streamlines the integration of generative AI within our ecosystem. This pivotal service, accompanied by a wealth of shared knowledge via samples and blog posts, has paved a frictionless path for integration.

Generative AI from Amazon Bedrock is reshaping the landscape for low-code developers, offering a remarkable foundation for rapidly creating sophisticated applications that were previously only possible with extensive development time. These AI models are designed to equip developers with the ability to develop applications that not only learn and adapt but to do so with a previously unseen depth of understanding. As we integrate these technologies into the Mendix environment, we’re ushering in a new era of democratizing generative AI.

Imagine a world where applications are no longer static but can dynamically generate personalized content, such as images and interfaces, by analyzing individual user data like browsing habits, geographic location, and the time of day. This capability of generative AI to tailor experiences to each user promises to boost engagement and satisfaction dramatically. In today’s data-driven enterprise world, where the sheer volume of information can be overwhelming, generative AI stands as a powerful technology, turning complex data into accessible insights, streamlining reports for executives, identifying trends, and predicting future outcomes faster than ever before.

Taking a step further, generative AI also offers actionable recommendations, not just interpretations. This feature is set to transform sectors like customer service, where it can advise service agents on the best subsequent actions or automate routine responses based on a profound comprehension of customer data. By bringing these innovations to the Mendix platform, we’re moving towards a future where applications anticipate and meet user needs proactively, turning every interaction into an opportunity for innovation and a delightful customer experience.

But our vision extends beyond the horizons of today’s achievements. With our sight set on redefining the fabric of low-code application development, we’ve immersed ourselves in pioneering research. Using the Mendix Extensibility framework, we explore generative AI’s potential to transform our industry from the ground up. Our investigative forays have already yielded exciting prospects—from conjuring comprehensive domain models with simple narrative inputs to automating data mapping and sample generation with AI’s interpretative prowess. We’re even on the cusp of enabling Mendix UI setups through intuitive AI-powered dialogs. These nascent concepts—demonstrated in the following video—are still being experimented on. But they promise to herald a new dawn for low-code innovation in the future.

In the following sections, we discuss our needs when selecting the right platform to build generative AI, and how Amazon Bedrock exceeded our expectations.

Ease of implementation

At Mendix, our diverse applications span text generation, summarization, virtual assistance, and multimodal image generation. Amazon Bedrock is our go-to platform for building and implementing generative AI, providing seamless access to cutting-edge generative AI foundational models. Its unified API simplifies experimentation and upgrades, facilitating quick and efficient integration of various models into our systems. This streamlined approach significantly reduces development and deployment efforts, enhancing overall efficiency and quickly innovating on disruptive technologies like generative AI.

Security is our top priority

At Mendix, as we innovate, generative AI is vital for innovation, yet security is critical. With Amazon Bedrock, we customized models using labeled data in Amazon Simple Storage Service (Amazon S3). We also added encryption using AWS Key Management Service (AWS KMS), Amazon Virtual Private Cloud (Amazon VPC), and AWS Private Link to establish private connectivity from a VPC to Amazon Bedrock hardened data, security, and access, fulfilling our stringent enterprise security needs.

As we learned with Amazon Bedrock, a private copy of the base model is launched as we fine-tune a foundation model. Our data is not shared with foundation models from Amazon and other leading AI startups or used to improve the base models. Amazon Titan and other leading AI foundation models from AI21 Labs, Anthropic, Cohere, Meta, and Stability AI do not have access to our data (prompts, completion results), ensuring responsible development and exceeding our security requirements.

Thanks to AWS and Amazon Bedrock, balancing the power of generative AI with robust security measures ensures responsible and safe development, fostering technological advancement with confidence. Amazon Bedrock exceeded our security requirements.

Continual updates and support

With Amazon Bedrock, users can benefit from continual updates and support for the available models. Enterprises like us have access to the latest advancements and improvements in AI technology, allowing us to remain ahead of the curve and adjust to evolving market trends and demands.

We can’t wait to further experiment with the new features of Amazon Bedrock announced at AWS re:Invent 2023, including Agents for Amazon Bedrock and Knowledge Bases for Amazon Bedrock, to accelerate our generative AI innovation on AWS.

Cost-effectiveness

The diverse model offerings in Amazon Bedrock empower you to select cost-effective large language models based on your use case. This flexibility enables companies to optimize AI investments, allocate resources efficiently, and achieve the best return on investment by choosing the most suitable model for each use case.

Conclusion

Mendix’s strategic integration of generative AI using Amazon Bedrock represents a groundbreaking move in democratizing tech innovation. Aligning with our legacy of empowering businesses with agility, Amazon Bedrock enhances our platform with advanced AI, ensuring every Mendix solution is not just current but future-engineered. We’re not merely adapting to low-code development evolution, but pioneering it.

To learn more about how your business can use the power of Amazon Bedrock and AWS to drive innovation and revolutionize customer experiences, see Why Using Amazon Bedrock in Your Mendix Apps is a Must.

About the Authors

Ricardo Perdigao has worked in the Enterprise Software industry for nearly 30 years, bringing a wealth of experience and innovation to his role as a Solutions Architect at Mendix, a high-productivity application development platform that empowers the creation and continuous improvement of mobile and web applications at scale. In 2023, Ricardo was honored with the prestigious Mendix Innovation Award for his groundbreaking research in Generative AI, further cementing his reputation as a forward-thinking leader in technology.

Suresh Patnam is the principal GTM Specialist AI/ML and Generative AI at AWS. He is passionate about helping businesses of all sizes transform into fast-moving digital organizations focusing on data, AI/ML, and generative AI. Suresh holds an MBA from Duke University-Fuqua School of Business. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

In the first part of this three-part series, we presented a solution that demonstrates how you can automate detecting document tampering and fraud at scale using AWS AI and machine learning (ML) services for a mortgage underwriting use case.

In this post, we present an approach to develop a deep learning-based computer vision model to detect and highlight forged images in mortgage underwriting. We provide guidance on building, training, and deploying deep learning networks on Amazon SageMaker.

In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.

Solution overview

To meet the objective of detecting document tampering in mortgage underwriting, we employ a computer vision model hosted on SageMaker for our image forgery detection solution. This model receives a testing image as input and generates a likelihood prediction of forgery as its output. The network architecture is as depicted in the following diagram.

Image forgery mainly involves four techniques: splicing, copy-move, removal, and enhancement. Depending on the characteristics of the forgery, different clues can be used as the foundation for detection and localization. These clues include JPEG compression artifacts, edge inconsistencies, noise patterns, color consistency, visual similarity, EXIF consistency, and camera model.

Given the expansive realm of image forgery detection, we use the Error Level Analysis (ELA) algorithm as an illustrative method for detecting forgeries. We selected the ELA technique for this post for the following reasons:

It is quicker to implement and can easily catch tampering of images.
It works by analyzing the compression levels of different parts of an image. This allows it to detect inconsistencies that may indicate tampering—for example, if one area was copied and pasted from another image that had been saved at a different compression level.
It is good at detecting more subtle or seamless tampering that may be hard to spot with the naked eye. Even small changes to an image can introduce detectable compression anomalies.
It doesn’t rely on having the original unmodified image for comparison. ELA can identify tampering signs within only the questioned image itself. Other techniques often require the unmodified original to compare against.
It is a lightweight technique that only relies on analyzing compression artifacts in the digital image data. It doesn’t depend on specialized hardware or forensics expertise. This makes ELA accessible as a first-pass analysis tool.
The output ELA image can clearly highlight differences in compression levels, making tampered areas visibly obvious. This allows even a non-expert to recognize signs of possible manipulation.
It works on many image types (such as JPEG, PNG, and GIF) and requires only the image itself to analyze. Other forensic techniques may be more restricted in formats or original image requirements.

However, in real-world scenarios where you may have a combination of input documents (JPEG, PNG, GIF, TIFF, PDF), we recommend employing ELA in conjunction with various other methods, such as detecting inconsistencies in edges, noise patterns, color uniformity, EXIF data consistency, camera model identification, and font uniformity. We aim to update the code for this post with additional forgery detection techniques.

ELA’s underlying premise assumes that the input images are in JPEG format, known for its lossy compression. Nevertheless, the method can still be effective even if the input images were originally in a lossless format (such as PNG, GIF, or BMP) and later converted to JPEG during the tampering process. When ELA is applied to original lossless formats, it typically indicates consistent image quality without any deterioration, rendering it challenging to pinpoint altered areas. In JPEG images, the expected norm is for the entire picture to exhibit similar compression levels. However, if a particular section within the image displays a markedly different error level, it often suggests a digital alteration has been made.

ELA highlights differences in the JPEG compression rate. Regions with uniform coloring will likely have a lower ELA result (for example, a darker color compared to high-contrast edges). The things to look for to identify tampering or modification include the following:

Similar edges should have similar brightness in the ELA result. All high-contrast edges should look similar to each other, and all low-contrast edges should look similar. With an original photo, low-contrast edges should be almost as bright as high-contrast edges.
Similar textures should have similar coloring under ELA. Areas with more surface detail, such as a close-up of a basketball, will likely have a higher ELA result than a smooth surface.
Regardless of the actual color of the surface, all flat surfaces should have about the same coloring under ELA.

JPEG images use a lossy compression system. Each re-encoding (resave) of the image adds more quality loss to the image. Specifically, the JPEG algorithm operates on an 8×8 pixel grid. Each 8×8 square is compressed independently. If the image is completely unmodified, then all 8×8 squares should have similar error potentials. If the image is unmodified and resaved, then every square should degrade at approximately the same rate.

ELA saves the image at a specified JPEG quality level. This resave introduces a known amount of errors across the entire image. The resaved image is then compared against the original image. If an image is modified, then every 8×8 square that was touched by the modification should be at a higher error potential than the rest of the image.

The results from ELA are directly dependent on the image quality. You may want to know if something was added, but if the picture is copied multiple times, then ELA may only permit detecting the resaves. Try to find the best quality version of the picture.

With training and practice, ELA can also learn to identify image scaling, quality, cropping, and resave transformations. For example, if a non-JPEG image contains visible grid lines (1 pixel wide in 8×8 squares), then it means the picture started as a JPEG and was converted to non-JPEG format (such as PNG). If some areas of the picture lack grid lines or the grid lines shift, then it denotes a splice or drawn portion in the non-JPEG image.

In the following sections, we demonstrate the steps for configuring, training, and deploying the computer vision model.

Prerequisites

To follow along with this post, complete the following prerequisites:

Have an AWS account.
Set up Amazon SageMaker Studio. You can swiftly initiate SageMaker Studio using default presets, facilitating a rapid launch. For more information, refer to Amazon SageMaker simplifies the Amazon SageMaker Studio setup for individual users.
Open SageMaker Studio and launch a system terminal.
Run the following command in the terminal:
git clone https://github.com/aws-samples/document-tampering-detection.git
The total cost of running SageMaker Studio for one user and the configurations of the notebook environment is $7.314 USD per hour.

Set up the model training notebook

Complete the following steps to set up your training notebook:

Open the tampering_detection_training.ipynb file from the document-tampering-detection directory.
Set up the notebook environment with the image TensorFlow 2.6 Python 3.8 CPU or GPU Optimized.
You may run into issue of insufficient availability or hit the quota limit for GPU instances within your AWS account when selecting GPU optimized instances. To increase the quota, visit the Service Quotas console and increase the service limit for the specific instance type you need. You can also use a CPU optimized notebook environment in such cases.
For Kernel, choose Python3.
For Instance type, choose ml.m5d.24xlarge or any other large instance.

We selected a larger instance type to reduce the training time of the model. With an ml.m5d.24xlarge notebook environment, the cost per hour is $7.258 USD per hour.

Run the training notebook

Run each cell in the notebook tampering_detection_training.ipynb in order. We discuss some cells in more detail in the following sections.

Prepare the dataset with a list of original and tampered images

Before you run the following cell in the notebook, prepare a dataset of original and tampered documents based on your specific business requirements. For this post, we use a sample dataset of tampered paystubs, and bank statements. The dataset is available within the images directory of the GitHub repository.

The notebook reads the original and tampered images from the images/training directory.

The dataset for training is created using a CSV file with two columns: the path to the image file and the label for the image (0 for original image and 1 for tampered image).

Process the dataset by generating the ELA results of each training image

In this step, we generate the ELA result (at 90% quality) of the input training image. The function convert_to_ela_image takes two parameters: path, which is the path to an image file, and quality, representing the quality parameter for JPEG compression. The function performs the following steps:

Convert the image to RGB format and resave the image as a JPEG file with the specified quality under the name tempresaved.jpg.
Compute the difference between the original image and the resaved JPEG image (ELA) to determine the maximum difference in pixel values between the original and resaved images.
Calculate a scale factor based on the maximum difference to adjust the brightness of the ELA image.
Enhance the brightness of the ELA image using the calculated scale factor.
Resize the ELA result to 128x128x3, where 3 represents the number of channels to reduce the input size for training.
Return the ELA image.

In lossy image formats such as JPEG, the initial saving process leads to considerable color loss. However, when the image is loaded and subsequently re-encoded in the same lossy format, there’s generally less added color degradation. ELA outcomes emphasize the image areas most susceptible to color degradation upon resaving. Generally, alterations appear prominently in regions exhibiting higher potential for degradation compared to the rest of the image.

Next, the images are processed into a NumPy array for training. We then split the input dataset randomly into training and test or validation data (80/20). You can ignore any warnings when running these cells.

Depending on the size of dataset, running these cells could take time to complete. For the sample dataset we provided in this repository, it could take 5–10 minutes.

Configure the CNN model

In this step, we construct a minimal version of the VGG network with small convolutional filters. The VGG-16 consists of 13 convolutional layers and three fully connected layers. The following screenshot illustrates the architecture of our Convolutional Neural Network (CNN) model.

Note the following configurations:

Input – The model takes in an image input size of 128x128x3.
Convolutional layers – The convolutional layers use a minimal receptive field (3×3), the smallest possible size that still captures up/down and left/right. This is followed by a rectified linear unit (ReLU) activation function that reduces training time. This is a linear function that will output the input if positive; otherwise, the output is zero. The convolution stride is fixed at the default (1 pixel) to keep the spatial resolution preserved after convolution (stride is the number of pixel shifts over the input matrix).
Fully connected layers – The network has two fully connected layers. The first dense layer uses ReLU activation, and the second uses softmax to classify the image as original or tampered.

You can ignore any warnings when running these cells.

Save the model artifacts

Save the trained model with a unique file name—for example, based on the current date and time—into a directory named model.

The model is saved in Keras format with the extension .keras. We also save the model artifacts as a directory named 1 containing serialized signatures and the state needed to run them, including variable values and vocabularies to deploy to a SageMaker runtime (which we discuss later in this post).

Measure model performance

The following loss curve shows the progression of the model’s loss over training epochs (iterations).

The loss function measures how well the model’s predictions match the actual targets. Lower values indicate better alignment between predictions and true values. Decreasing loss over epochs signifies that the model is improving. The accuracy curve illustrates the model’s accuracy over training epochs. Accuracy is the ratio of correct predictions to the total number of predictions. Higher accuracy indicates a better-performing model. Typically, accuracy increases during training as the model learns patterns and improves its predictive ability. These will help you determine if the model is overfitting (performing well on training data but poorly on unseen data) or underfitting (not learning enough from the training data).

The following confusion matrix visually represents how well the model accurately distinguishes between the positive (forged image, represented as value 1) and negative (untampered image, represented as value 0) classes.

Following the model training, our next step involves deploying the computer vision model as an API. This API will be integrated into business applications as a component of the underwriting workflow. To achieve this, we use Amazon SageMaker Inference, a fully managed service. This service seamlessly integrates with MLOps tools, enabling scalable model deployment, cost-efficient inference, enhanced model management in production, and reduced operational complexity. In this post, we deploy the model as a real-time inference endpoint. However, it’s important to note that, depending on the workflow of your business applications, the model deployment can also be tailored as batch processing, asynchronous handling, or through a serverless deployment architecture.

Set up the model deployment notebook

Complete the following steps to set up your model deployment notebook:

Open the tampering_detection_model_deploy.ipynb file from document-tampering-detection directory.
Set up the notebook environment with the image Data Science 3.0.
For Kernel, choose Python3.
For Instance type, choose ml.t3.medium.

With an ml.t3.medium notebook environment, the cost per hour is $0.056 USD.

Create a custom inline policy for the SageMaker role to allow all Amazon S3 actions

The AWS Identity and Access Management (IAM) role for SageMaker will be in the format AmazonSageMaker- ExecutionRole-<random numbers>. Make sure you’re using the correct role. The role name can be found under the user details within the SageMaker domain configurations.

Update the IAM role to include an inline policy to allow all Amazon Simple Storage Service (Amazon S3) actions. This will be required to automate the creation and deletion of S3 buckets that will store the model artifacts. You can limit the access to specific S3 buckets. Note that we used a wildcard for the S3 bucket name in the IAM policy (tamperingdetection*).

Run the deployment notebook

Run each cell in the notebook tampering_detection_model_deploy.ipynb in order. We discuss some cells in more detail in the following sections.

Create an S3 bucket

Run the cell to create an S3 bucket. The bucket will be named tamperingdetection<current date time> and in the same AWS Region as your SageMaker Studio environment.

Create the model artifact archive and upload to Amazon S3

Create a tar.gz file from the model artifacts. We have saved the model artifacts as a directory named 1, containing serialized signatures and the state needed to run them, including variable values and vocabularies to deploy to the SageMaker runtime. You can also include a custom inference file called inference.py within the code folder in the model artifact. The custom inference can be used for preprocessing and postprocessing of the input image.

Create a SageMaker inference endpoint

The cell to create a SageMaker inference endpoint may take a few minutes to complete.

Test the inference endpoint

The function check_image preprocesses an image as an ELA image, sends it to a SageMaker endpoint for inference, retrieves and processes the model’s predictions, and prints the results. The model takes a NumPy array of the input image as an ELA image to provide predictions. The predictions are output as 0, representing an untampered image, and 1, representing a forged image.

Let’s invoke the model with an untampered image of a paystub and check the result.

The model outputs the classification as 0, representing an untampered image.

Now let’s invoke the model with a tampered image of a paystub and check the result.

The model outputs the classification as 1, representing a forged image.

Limitations

Although ELA is an excellent tool for helping detect modifications, there are a number of limitations, such as the following:

A single pixel change or minor color adjustment may not generate a noticeable change in the ELA because JPEG operates on a grid.
ELA only identifies what regions have different compression levels. If a lower-quality image is spliced into a higher-quality picture, then the lower-quality image may appear as a darker region.
Scaling, recoloring, or adding noise to an image will modify the entire image, creating a higher error level potential.
If an image is resaved multiple times, then it may be entirely at a minimum error level, where more resaves do not alter the image. In this case, the ELA will return a black image and no modifications can be identified using this algorithm.
With Photoshop, the simple act of saving the picture can auto-sharpen textures and edges, creating a higher error level potential. This artifact doesn’t identify intentional modification; it identifies that an Adobe product was used. Technically, ELA appears as a modification because Adobe automatically performed a modification, but the modification was not necessarily intentional by the user.

We recommend using ELA alongside other techniques previously discussed in the blog in order to detect a greater range of image manipulation cases. ELA can also serve as an independent tool for visually examining image disparities, especially when training a CNN-based model becomes challenging.

Clean up

To remove the resources you created as part of this solution, complete the following steps:

Run the notebook cells under the Cleanup section. This will delete the following:
1. SageMaker inference endpoint – The inference endpoint name will be tamperingdetection-<datetime>.
2. Objects within the S3 bucket and the S3 bucket itself – The bucket name will be tamperingdetection<datetime>.
Shut down the SageMaker Studio notebook resources.

Conclusion

In this post, we presented an end-to-end solution for detecting document tampering and fraud using deep learning and SageMaker. We used ELA to preprocess images and identify discrepancies in compression levels that may indicate manipulation. Then we trained a CNN model on this processed dataset to classify images as original or tampered.

The model can achieve strong performance, with an accuracy over 95% with a dataset (forged and original) suited for your business requirements. This indicates that it can reliably detect forged documents like paystubs and bank statements. The trained model is deployed to a SageMaker endpoint to enable low-latency inference at scale. By integrating this solution into mortgage workflows, institutions can automatically flag suspicious documents for further fraud investigation.

Although powerful, ELA has some limitations in identifying certain types of more subtle manipulation. As next steps, the model could be enhanced by incorporating additional forensic techniques into training and using larger, more diverse datasets. Overall, this solution demonstrates how you can use deep learning and AWS services to build impactful solutions that boost efficiency, reduce risk, and prevent fraud.

In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.

About the authors

Anup Ravindranath is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada working with Financial Services organizations. He helps customers to transform their businesses and innovate on cloud.

Vinnie Saini is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada. She has been helping Financial Services customers transform on cloud, with AI and ML driven solutions laid on strong foundational pillars of Architectural Excellence.

5 ways to use Circle to Search

Circle to Search is now available on select premium Android smartphones — here are some helpful ways to use it.Read More

Cardiac Clarity: Dr. Keith Channon Talks Revolutionizing Heart Health With AI

Here’s some news to still beating hearts: AI is helping bring some clarity to cardiology. Caristo Diagnostics has developed an AI-powered solution for detecting coronary inflammation in cardiac CT scans. In this episode of NVIDIA’s AI Podcast, Dr. Keith Channon, the Field Marshal Earl Alexander Professor at the University of Oxford, and the cofounder and chief medical officer at the startup, speaks with host Noah Kravtiz about the technology. Called Caristo, it analyzes radiometric features in CT scan data to identify inflammation in the fat tissue surrounding coronary arteries, a key indicator of heart disease. Tune in to learn more about how Caristo uses AI to improve treatment plans and risk predictions by providing physicians with a patient-specific readout of inflammation levels.

The AI Podcast · Cardiac Clarity: Dr. Keith Channon Talks Revolutionizing Heart Health With AI – Ep. 212

Show Notes

1:56: What is ‌Caristo and how does it work?
7:11 The key signal of a heart attack
10:34 How did Channon come up with the idea of using AI to drive breakthroughs?
22:40 How much has the CT scan changed over the years?
26:01: What’s ahead for Caristo?
30:14: How to take care of your own heart health

Immunai Uses Deep Learning to Develop New Drugs – Ep. 176
What if we could map our immune system to create drugs that can help our bodies win the fight against cancer and other diseases? That’s the big idea behind immunotherapy. The problem: the immune system is incredibly complex. Enter Immunai, a biotechnology company using AI technology to map the human immune system and speed the development of new immunotherapies against cancer and autoimmune diseases.

Overjet on Bringing AI to Dentistry – Ep. 179
Dentists get a bad rap. Dentists also get more people out of more aggravating pain than just about anyone, which is why the more technology dentists have, the better. Overjet, a member of the NVIDIA Inception program for startups, is moving fast to bring AI to dentists’ offices.

Democratizing Drug Discovery With Deep Learning – Ep. 172
It may seem intuitive that AI and deep learning can speed up workflows — including novel drug discovery, a typically years-long and several-billion-dollar endeavor. But, professors Artem Cherkasov and Olexandr Isayev were surprised that no recent academic papers provided a comprehensive, global research review of how deep learning and GPU-accelerated computing impact drug discovery.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Building an early warning system for LLM-aided biological threat creation

We’re developing a blueprint for evaluating the risk that a large language model (LLM) could aid someone in creating a biological threat. In an evaluation involving both biology experts and students, we found that GPT-4 provides at most a mild uplift in biological threat creation accuracy. While this uplift is not large enough to be conclusive, our finding is a starting point for continued research and community deliberation. OpenAI Blog

Singtel, NVIDIA to Bring Sovereign AI to Southeast Asia

Asia’s lion city is roaring ahead in AI.

Singtel, a leading communications services provider based in Singapore, will bring the NVIDIA AI platform to businesses in the island nation and beyond.

The mobile and broadband company is building energy-efficient data centers across Southeast Asia accelerated with NVIDIA Hopper architecture GPUs and using NVIDIA AI reference architectures proven to deliver optimal performance.

The data centers will serve as sovereign national resources — AI factories that process the private datasets of companies, startups, universities and governments safely on shore to produce valuable insights.

Singtel’s first AI services will spin up in Singapore, with future data centers under construction in Indonesia and Thailand. From its hub in Singapore, the company has operations that stretch from Australia to India.

Trusted Engines of AI

The new data centers will act as trusted engines of generative AI. The most transformative technology of our time, generative AI and its ability to amplify human intelligence and productivity are attracting users worldwide.

Nations are creating large language models tuned to their local dialects, cultures and practices. Singtel sits at the center of such opportunities among Southeast Asia’s vibrant Chinese, Indian, Malay and other communities.

Singtel’s initiative supports Singapore’s national AI strategy to empower its citizens with the latest technology. The plan calls for significantly expanding the country’s compute infrastructure as well as its talent pool of machine learning specialists.

For businesses in the region, having a known, local provider of these computationally intensive services provides a safe, easy on-ramp to generative AI. They can enhance and personalize their products and services while protecting sensitive corporate data.

Taking the Green Path

Singtel is committed to democratizing AI and decarbonizing its operations.

Its latest data centers are being built with an eye to sustainability, including in the selection of materials and use of liquid cooling. They adopt best practices to deliver less than 1.3 in PUE, the power usage effectiveness metric for data center efficiency.

Singtel will use its Paragon software platform to orchestrate how the new AI applications work in concert with its mobile and broadband services. The combination will enable edge computing services like powering robots and other autonomous systems from AI models running in the cloud.

A Full-Stack Foundation

The company will offer its customers NVIDIA AI Enterprise, a software platform for building and deploying AI applications, including generative AI. Singtel will also be an NVIDIA Cloud Partner, delivering optimized AI services on the NVIDIA platform.

Because Singtel’s data centers use NVIDIA’s proven reference architectures for AI computing, users can employ its services, knowing they’re optimized for leading AI performance.

Singtel already has hands-on experience delivering edge services with NVIDIA AI.

Last May, it demonstrated a digital avatar created with the NVIDIA Omniverse and NVIDIA NeMo platforms that users could interact with over its 5G network. And in 2021, Singtel delivered GPU services as part of a testbed for local government agencies.

New AI Role for Telcos

Singapore’s service provider joins pioneers in France, India, Italy and Switzerland deploying AI factories that deliver generative AI services with data sovereignty.

To learn more about how Singtel and other telcos are embracing generative AI, register for a session on the topic at NVIDIA GTC. The global AI conference runs March 18-21, starting with a keynote by NVIDIA founder and CEO Jensen Huang.

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 1

With the advent of generative AI, today’s foundation models (FMs), such as the large language models (LLMs) Claude 2 and Llama 2, can perform a range of generative tasks such as question answering, summarization, and content creation on text data. However, real-world data exists in multiple modalities, such as text, images, video, and audio. Take a PowerPoint slide deck, for example. It could contain information in the form of text, or embedded in graphs, tables, and pictures.

In this post, we present a solution that uses multimodal FMs such as the Amazon Titan Multimodal Embeddings model and LLaVA 1.5 and AWS services including Amazon Bedrock and Amazon SageMaker to perform similar generative tasks on multimodal data.

Solution overview

The solution provides an implementation for answering questions using information contained in the text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by LLMs. In this post, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

There are different ways to design a RAG solution that includes images. We have presented one approach here and will follow up with an alternate approach in the second post of this three-part series.

This solution includes the following components:

Amazon Titan Multimodal Embeddings model – This FM is used to generate embeddings for the content in the slide deck used in this post. As a multimodal model, this Titan model can process text, images, or a combination as input and generate embeddings. The Titan Multimodal Embeddings model generates vectors (embeddings) of 1,024 dimensions and is accessed via Amazon Bedrock.
Large Language and Vision Assistant (LLaVA) – LLaVA is an open source multimodal model for visual and language understanding and is used to interpret the data in the slides, including visual elements such as graphs and tables. We use the 7-billion parameter version LLaVA 1.5-7b in this solution.
Amazon SageMaker – The LLaVA model is deployed on a SageMaker endpoint using SageMaker hosting services, and we use the resulting endpoint to run inferences against the LLaVA model. We also use SageMaker notebooks to orchestrate and demonstrate this solution end to end.
Amazon OpenSearch Serverless – OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Titan Multimodal Embeddings model. An index created in the OpenSearch Serverless collection serves as the vector store for our RAG solution.
Amazon OpenSearch Ingestion (OSI) – OSI is a fully managed, serverless data collector that delivers data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we use an OSI pipeline to deliver data to the OpenSearch Serverless vector store.

Solution architecture

The solution design consists of two parts: ingestion and user interaction. During ingestion, we process the input slide deck by converting each slide into an image, generate embeddings for these images, and then populate the vector data store. These steps are completed prior to the user interaction steps.

In the user interaction phase, a question from the user is converted into embeddings and a similarity search is run on the vector database to find a slide that could potentially contain answers to user question. We then provide this slide (in the form of an image file) to the LLaVA model and the user question as a prompt to generate an answer to the query. All the code for this post is available in the GitHub repo.

The following diagram illustrates the ingestion architecture.

The workflow steps are as follows:

Slides are converted to image files (one per slide) in JPG format and passed to the Titan Multimodal Embeddings model to generate embeddings. In this post, we use the slide deck titled Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023, to demonstrate the solution. The sample deck has 31 slides, so we generate 31 sets of vector embeddings, each with 1,024 dimensions. We add additional metadata fields to these generated vector embeddings and create a JSON file. These additional metadata fields can be used to perform rich search queries using OpenSearch’s powerful search capabilities.
The generated embeddings are put together in a single JSON file that is uploaded to Amazon Simple Storage Service (Amazon S3).
Via Amazon S3 Event Notifications, an event is put in an Amazon Simple Queue Service (Amazon SQS) queue.
This event in the SQS queue acts as a trigger to run the OSI pipeline, which in turn ingests the data (JSON file) as documents into the OpenSearch Serverless index. Note that the OpenSearch Serverless index is configured as the sink for this pipeline and is created as part of the OpenSearch Serverless collection.

The following diagram illustrates the user interaction architecture.

The workflow steps are as follows:

A user submits a question related to the slide deck that has been ingested.
The user input is converted into embeddings using the Titan Multimodal Embeddings model accessed via Amazon Bedrock. An OpenSearch vector search is performed using these embeddings. We perform a k-nearest neighbor (k=1) search to retrieve the most relevant embedding matching the user query. Setting k=1 retrieves the most relevant slide to the user question.
The metadata of the response from OpenSearch Serverless contains a path to the image corresponding to the most relevant slide.
A prompt is created by combining the user question and the image path and provided to LLaVA hosted on SageMaker. The LLaVA model is able to understand the user question and answer it by examining the data in the image.
The result of this inference is returned to the user.

These steps are discussed in detail in the following sections. See the Results section for screenshots and details on the output.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This solution uses the Titan Multimodal Embeddings model. Ensure that this model is enabled for use in Amazon Bedrock. On the Amazon Bedrock console, choose Model access in the navigation pane. If Titan Multimodal Embeddings is enabled, the access status will state Access granted.

If the model is not available, enable access to the model by choosing Manage Model Access, selecting Titan Multimodal Embeddings G1, and choosing Request model access. The model is enabled for use immediately.

Use an AWS CloudFormation template to create the solution stack

Use one of the following AWS CloudFormation templates (depending on your Region) to launch the solution resources.

AWS Region	Link
`us-east-1`
`us-west-2`

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the value for MultimodalCollectionEndpoint, which we use in subsequent steps.

The CloudFormation template creates the following resources:

IAM roles – The following AWS Identity and Access Management (IAM) roles are created. Update these roles to apply least-privilege permissions.
- SMExecutionRole with Amazon S3, SageMaker, OpenSearch Service, and Bedrock full access.
- OSPipelineExecutionRole with access to specific Amazon SQS and OSI actions.
SageMaker notebook – All the code for this post is run via this notebook.
OpenSearch Serverless collection – This is the vector database for storing and retrieving embeddings.
OSI pipeline – This is the pipeline for ingesting data into OpenSearch Serverless.
S3 bucket – All data for this post is stored in this bucket.
SQS queue – The events for triggering the OSI pipeline run are put in this queue.

The CloudFormation template configures the OSI pipeline with Amazon S3 and Amazon SQS processing as source and an OpenSearch Serverless index as sink. Any objects created in the specified S3 bucket and prefix (multimodal/osi-embeddings-json) will trigger SQS notifications, which are used by the OSI pipeline to ingest data into OpenSearch Serverless.

The CloudFormation template also creates network, encryption, and data access policies required for the OpenSearch Serverless collection. Update these policies to apply least-privilege permissions.

Note that the CloudFormation template name is referenced in SageMaker notebooks. If the default template name is changed, make sure you update the same in globals.py

Test the solution

After the prerequisite steps are complete and the CloudFormation stack has been created successfully, you’re now ready to test the solution:

On the SageMaker console, choose Notebooks in the navigation pane.
Select the MultimodalNotebookInstance notebook instance and choose Open JupyterLab.
In File Browser, traverse to the notebooks folder to see the notebooks and supporting files.

The notebooks are numbered in the sequence in which they’re run. Instructions and comments in each notebook describe the actions performed by that notebook. We run these notebooks one by one.

Choose 0_deploy_llava.ipynb to open it in JupyterLab.
On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook deploys the LLaVA-v1.5-7B model to a SageMaker endpoint. In this notebook, we download the LLaVA-v1.5-7B model from HuggingFace Hub, replace the inference.py script with llava_inference.py, and create a model.tar.gz file for this model. The model.tar.gz file is uploaded to Amazon S3 and used for deploying the model on SageMaker endpoint. The llava_inference.py script has additional code to allow reading an image file from Amazon S3 and running inference on it.

Choose 1_data_prep.ipynb to open it in JupyterLab.
On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook downloads the slide deck, converts each slide into JPG file format, and uploads these to the S3 bucket used for this post.

Choose 2_data_ingestion.ipynb to open it in JupyterLab.
On the Run menu, choose Run All Cells to run the code in this notebook.

We do the following in this notebook:

We create an index in the OpenSearch Serverless collection. This index stores the embeddings data for the slide deck. See the following code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
      "index.knn": true
  },
  "mappings": {
      "properties": {
          "vector_embedding": {
              "type": "knn_vector",
              "dimension": 1024,
              "method": {
                  "name": "hnsw",
                  "engine": "nmslib",
                  "parameters": {}
              }
          },
          "image_path": {
              "type": "text"
          },
          "metadata": {
              "properties": {
                  "slide_filename": {
                      "type": "text"
                  },
                  "model_id": {
                      "type": "text"
                  },
                  "slide_description": {
                      "type": "text"
                  }
              }
          }
      }
  }
}
"""
index_body = json.loads(index_body)
try:
  response = os_client.indices.create(index_name, body=index_body)
  logger.info(f"response received for the create index -> {response}")
except Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")

We use Titan Multimodal Embeddings model to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path of the image file) are stored in a JSON file and uploaded to Amazon S3. Note that a single JSON file is created, which contains documents for all the slides (images) converted into embeddings. The following code snippet shows how an image (in the form of a Base64 encoded string) is converted into embeddings:

def get_multimodal_embeddings(bedrock: botocore.client, image: str) -> np.ndarray:
    body = json.dumps(dict(inputImage=image))
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.FMC_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response.get("body").read())
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)
    except Exception as e:
        logger.error(f"exception while image(truncated)={image[:10]}, exception={e}")
        embeddings = None

    return embeddings

This action triggers the OpenSearch Ingestion pipeline, which processes the file and ingests it into the OpenSearch Serverless index. The following is a sample of the JSON file created. (A vector with four dimensions is shown in the example code. The Titan Multimodal Embeddings model generates 1,024 dimensions.)

[
  {
    "image_path": "s3://<your-bucket-name>/path/to/file1.json",
    "metadata": {
      "slide_filename": "mypowerpoint1.pptx",
      "model_id": "amazon.titan-embed-image-v1",
      "slide_description": "This is a test slide deck"
    },
    "vector_embedding": [
      657.6052386529958,
      0.8865137233123771,
      763.870264592026
    ]
  }
]

Choose 3_rag_inference.ipynb to open it in JupyterLab.
On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook implements the RAG solution: we convert the user question into embeddings, find a similar image (slide) from the vector database, and provide the retrieved image to LLaVA to generate an answer to the user question. We use the following prompt template:

prompt_template: str = """Pretend that you are a helpful assistant that answers questions about content in a slide deck. 
  Using only the information in the provided slide image answer the following question.
  If you do not find the answer in the image then say I did not find the answer to this question in the slide deck.

  {question}
"""

The following code snippet provides the RAG workflow:

# create prompt and convert to embeddings
question: str = "As per the AI/ML flywheel, what do the AWS AI/ML services provide?"
prompt = prompt_template.format(question=question)
text_embeddings = get_text_embeddings(bedrock, question)

# vector db search
vector_db_response: Dict = find_similar_data(text_embeddings)

# download image for local notebook display
s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path')
logger.info(f"going to answer the question="{question}" using the image "{s3_img_path}"")

!aws s3 cp {s3_img_path} .
local_img_path = os.path.basename(s3_img_path)
display(Image(filename=local_img_path))

# Ask LLaVA
data = {
    "image" : s3_img_path,
    "question" : prompt,
    "temperature" : 0.1,
}
output = predictor.predict(data)
logger.info(f"Image={s3_img_path}nQuestion: {question}nAnswer: {output}nn")

Results

The following is a compilation of some user questions and responses generated by our implementation. The Question column captures the user question, and the Answer column is the textual response generated by LLaVA. Image is the k-nearest (k=1) slide match returned by the OpenSearch Serverless vector search. This slide is interpreted by LLaVA to provide the answer.

Multimodal RAG results

Question	Answer	Image
How does Inf2 compare in performance to comparable EC2 instances? I need numbers.	According to the slide deck, Inf2 instances by AWS Inferentia2 offer up to 4x higher throughput and 10x lower latency compared to comparable EC2 instances.
As per the AI/ML flywheel, what do the AWS AI/ML services provide?	The AWS AI/ML services provide better $/perfer capabilities, new capabilities, and investment in innovation.
Compared to GPT-2, how many more parameters does GPT-3 have? What is the numerical difference between the parameter size of GPT-2 and GPT-3?	According to the slide, GPT-3 has 175 billion parameters, while GPT-2 has 1.5 billion parameters. The numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion.
What are quarks in particle physics?	I did not find the answer to this question in the slide deck.

Feel free to extend this solution to your slide decks. Simply update the SLIDE_DECK variable in globals.py with a URL to your slide deck and run the ingestion steps detailed in the previous section.

Tip

You can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data. The following screenshot shows an OpenSearch dashboard GET example.

Clean up

To avoid incurring future charges, delete the resources you created. You can do this by deleting the stack via the CloudFormation console.

Additionally, delete the SageMaker inference endpoint created for LLaVA inferencing. You can do this by uncommenting the cleanup step in 3_rag_inference.ipynb and running the cell, or by deleting the endpoint via the SageMaker console: choose Inference and Endpoints in the navigation pane, then select the endpoint and delete it.

Conclusion

Enterprises generate new content all the time, and slide decks are a common mechanism used to share and disseminate information internally with the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks. You can use this solution and the power of multimodal FMs such as the Titan Multimodal Embeddings model and LLaVA to discover new information or uncover new perspectives on content in slide decks.

We encourage you to learn more by exploring Amazon SageMaker JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service, and building a solution using the sample implementation provided in this post.

Look out for two additional posts as part of this series. Part 2 covers another approach you could take to talk to your slide deck. This approach generates and stores LLaVA inferences and uses those stored inferences to respond to user queries. Part 3 compares the two approaches.

About the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Manju Prasad is a Senior Solutions Architect within Strategic Accounts at Amazon Web Services. She focuses on providing technical guidance in a variety of domains, including AI/ML to a marquee M&E customer. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup.

Archana Inapudi is a Senior Solutions Architect at AWS supporting strategic customers. She has over a decade of experience helping customers design and build data analytics and database solutions. She is passionate about using technology to provide value to customers and achieve business outcomes.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital native customers.

Background

MobileDiffusion

Diffusion UNet

Image decoder

One-step sampling

Results

Conclusion

Acknowledgments

Lightning Talks

Solution overview

Introduction to Knowledge Bases for Amazon Bedrock

Prerequisites

Preprocess the IMDb data

Create the IMDb Knowledge Base

Sync your data with the knowledge base

Improve search using semantic results

Augment your knowledge base with agents

Conclusion

About the Authors

Overview

The solution: Integrating generative AI capabilities provided by Amazon Bedrock

Ease of implementation

Security is our top priority

Continual updates and support

Cost-effectiveness

Conclusion

About the Authors

Solution overview

Prerequisites

Set up the model training notebook

Run the training notebook

Prepare the dataset with a list of original and tampered images

Process the dataset by generating the ELA results of each training image

Configure the CNN model

Save the model artifacts

Measure model performance

Set up the model deployment notebook

Create a custom inline policy for the SageMaker role to allow all Amazon S3 actions

Run the deployment notebook

Create an S3 bucket

Create the model artifact archive and upload to Amazon S3

Create a SageMaker inference endpoint

Test the inference endpoint

Limitations

Clean up

Conclusion

About the authors

Show Notes

You Might Also Like

Subscribe to the AI Podcast

Trusted Engines of AI

Taking the Green Path

A Full-Stack Foundation

New AI Role for Telcos

Solution overview

Solution architecture

Prerequisites

Use an AWS CloudFormation template to create the solution stack

Test the solution

Results

Tip

Clean up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.