This blog post is co-written with Dr. Ebtesam Almazrouei, Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII.
United Arab Emirate’s (UAE) Technology Innovation Institute (TII), the applied research pillar of Abu Dhabi’s Advanced Technology Research Council, has launched Falcon LLM, a foundational large language model (LLM) with 40 billion parameters. TII is a leading global research center dedicated to pushing the frontiers of knowledge. TII’s team of scientists, researchers, and engineers work to deliver discovery science and transformative technologies. TII’s work focuses on breakthroughs that will future-proof our society. Trained on 1 trillion tokens, TII Falcon LLM boasts top-notch performance while remaining incredibly cost-effective. Falcon-40B matches the performance of other high-performing LLMs, and is the top-ranked open-source model in the public Hugging Face Open LLM leaderboard. It’s available as open-source in two different sizes – Falcon-40B and Falcon-7B and was built from scratch using data preprocessing and model training jobs built on Amazon SageMaker. Open-sourcing Falcon 40B enables users to construct and customize AI tools that cater to unique users needs, facilitating seamless integration and ensuring the long-term preservation of data assets. The model weights are available to download, inspect and deploy anywhere.
Starting June 7th, both Falcon LLMs will also be available in Amazon SageMaker JumpStart, SageMaker’s machine learning (ML) hub that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get started with ML. You can deploy and use the Falcon LLMs with a few clicks in SageMaker Studio or programmatically through the SageMaker Python SDK. To deploy and run inference against Falcon LLMs, refer to the Introduction to SageMaker JumpStart – Text Generation with Falcon LLMs example notebook.
Dr. Ebtesam Almazrouei, Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII, shares:
“We proudly announce the official open-source release of Falcon-40B, the world’s top-ranking open-source language model, developed by TII. Falcon-40B has surpassed renowned models like LLaMA-65B, StableLM, RedPajama, and MPT on the public leaderboard maintained by Hugging Face, demonstrating its exceptional performance without specialized fine-tuning.”
“This impressive achievement reflects the UAE’s dedication to push the boundaries of AI innovation,” continues Dr. Almazrouei. “By releasing Falcon-40B as an open-source model, we provide researchers, businesses, and organizations with the opportunity to leverage its powerful capabilities across various sectors. Falcon-40B’s open-source release empowers organizations to harness its exceptional capabilities and drive advancements in AI-driven solutions. It represents a significant milestone in our commitment to fostering AI innovation and exemplifies the profound scientific contributions of the UAE. To explore Falcon-40B’s remarkable potential, please visit FalconLLM.tii.ae. Join us in leveraging the power of Falcon-40B to shape the future of AI and revolutionize industries.”
In this post, we dive deep with Dr. Almazrouei about Falcon LLM training on SageMaker, data curation, optimization, performance, and next steps.
A new generation of LLMs
LLMs are software algorithms trained to complete natural text sequences. Due to their size and the volume of training data they interact with, LLMs have impressive text processing abilities, including summarization, question answering, in-context learning, and more.
In early 2020, research organizations across the world set the emphasis on model size, observing that accuracy correlated with number of parameters. For example, GPT-3 (2020) and BLOOM (2022) feature around 175 billion parameters, Gopher (2021) has 230 billion parameters, and MT-NLG (2021) 530 billion parameters. In 2022, Hoffman et al. observed that the current balance of compute between model parameters and dataset size was suboptimal, and published empirical scaling laws suggesting that balancing the compute budget towards smaller models trained on more data could lead to better performing models. They implemented their guidance in the 70B parameter Chinchilla (2022) model, that outperformed much bigger models.
LLM training on SageMaker
SageMaker is a collection of managed APIs for developing, training, tuning, and hosting machine learning (ML) models, including LLMs. Numerous customers rely on SageMaker for their LLM workloads, such as Stability AI, AI21 Labs, and LG AI. SageMaker Training provisions compute clusters with user-defined hardware configuration and code. Compute jobs are billed per run, pro-rated to the second, meaning that users are not charged for GPU capacity when not using the service. TII used transient clusters provided by the SageMaker Training API to train the Falcon LLM, up to 48 ml.p4d.24xlarge instances, cumulating in 384 NVIDIA A100 GPUs. Now, TII is training the next Falcon LLM and scaled their training to 3,136 A100 GPU (392 ml.p4d instances).
An unprecedented amount of custom innovations went into all layers of the project in order to raise the bar of science quality and training speed. In the next sections, we describe the optimizations TII conducted at all layers of the deep learning (DL) training system.
Scalable data curation
Latest-generation LLMs get their strength from the size and quality of training data. The team put specific care into the craft of a high-quality trillion-token dataset. Several SageMaker Training CPU jobs transformed petabytes of cheap, scalable web data into a curated, safe training dataset. Automated systems filtered and deduplicated the data; for example, ML classifiers were used to filter profanity. CPU jobs running on ml.c5.18xlarge (72 vCPUs, 144 GB RAM) were instantiated in a few API calls via SageMaker Training to run data transformation tasks. The team used both single-instance and multi-instance CPU jobs for difference use cases. Some of these jobs used hundreds of parallel share-nothing architecture (SNA) jobs, each on a single machine, and for tasks requiring inter-worker synchronization, the team launched multi-instance jobs, cumulating in dozens of instances and thousands of vCPUs. Anecdotally, on a downstream dataset preparation task, the team went up to 257 ml.c5.18xlarge in a single SageMaker Training job, cumulating in 18,504 vCPU and 37 TB of memory.
Maximizing training throughput
To minimize both training costs and time-to-market, the team pursued several directions of optimization to accelerate the training speed proportional to training tokens processed per second and measured in TFLOPs/GPU. The team used a fully custom 3D-parallel LLM training framework, featuring custom optimized layers written in compiled GPU code. The team went as far as writing their own custom matrix multiplication implementation to gain further speed! The team also developed logic that adapts parallel communication to the underlying network topology. During their initial scaling experiments, TII was able to reach 166 TFLOPs/GPU on a 147B model on 256 GPUs, and 173 TFLOPs/GPU on a 13B model on 16 GPUs, in our knowledge the fastest-known model TFLOPs achieved in the cloud at the time of the test in late 2022.
LLM training is storage intensive; several terabytes of training data need to be channeled to the training cluster, and several terabytes of model checkpoints regularly travel back from the cluster to the permanent storage. Checkpoints also need to reach the training cluster as fast as possible in the event of job restart. In traditional high-performance computing (HPC), computing nodes are connected to distributed file systems, which provide high-performance I/O and throughput via a POSIX-like interface. In AWS, customers regularly use the Amazon FSx for Lustre file system for this purpose (for more details, refer to Speed up training on Amazon SageMaker using Amazon FSx for Lustre and Amazon EFS file systems), and we also documented the self-managed use of BeeGFS in a distributed computer vision case study. Due to their focus on costs and operational simplicity, the team decided not to implement and operate file system servers, but instead took up the challenge of building exclusively on top of serverless object storage Amazon Simple Storage Service (Amazon S3). A custom S3 dataset class was built using the AWS SDK for Python (Boto3), and provided satisfactory performance while enabling the scientists to iterate autonomously on I/O engineering and model science within the same codebase.
An LLM project rarely consists of a single training job; numerous jobs are needed to conduct initial tests and experiences. Over the course of the main production training, several jobs may be chained, for example to update configuration or software versions, deploy patches, or recover from failures. Scientists from TII conducted significant engineering to build custom clients adapted to LLM training. A launcher client was built on top of the SageMaker Training SDK in order to pack together multiple functionalities in one command, for example code versioning, Docker image building, and job launch. Additionally, an AWS Lambda serverless compute function was designed to watch, monitor, and intervene on jobs as needed.
Using Slack bots for inference quality audits
Towards the end of training, the team deployed the model on an internal SageMaker Hosting GPU endpoint for real-time interaction. The team went as far as creating a Slack bot to dialog with, to get realistic feedback and run qualitative quality audits of the model.
Training and performance monitoring
Training an LLM requires large amounts of computational resources, including CPU, GPU, and memory resources. Therefore, TII needed to monitor the performance and idle time of the training job to ensure optimal utilization of the computational resources and their cost-effectiveness.
To build an automated monitoring solution, TII used Amazon CloudWatch alarms to monitor the utilization GPU, CPU, and memory for the training jobs. CloudWatch collects raw data and processes it into readable, near-real-time metrics from the underlying container instances being using in the SageMaker Training job. After that, we set thresholds for each of these metrics, and if any metric falls below the threshold, an alarm is triggered. This alarm notifies TII’s team of the low resource utilization, allowing them to take corrective actions to rectify resource utilization constraints.
In addition to monitoring resource utilization, TII could also monitor the idle time of the training job resources. If the training job resources were idle for a prolonged period of time, it could indicate a bottleneck at any stage of the training cycle and require manual investigation. In some instances, the resource utilization was still relatively optimal, but the training process itself wasn’t progressing. For these cases, TII integrated CloudWatch alarms with Lambda functions to query and read the generated training logs, then take automatic actions based on either the generated error or the idleness of the log generation process (cluster is halted). The alarm triggers an action to stop the training job, which ensures that TII doesn’t incur unnecessary costs when the resources were not being utilized.
Using SageMaker paired with proprietary, custom innovation, TII was able to train a model that is state-of-the-art in multiple dimensions: technological breakthrough, science quality, training speed, and also operational simplicity.
“Our Falcon LLM illustrates the technology leadership of the UAE, and paves the way for AI-powered innovation in the region. In line with the UAE National AI Strategy 2031, the UAE’s participation in global technological advancements like Falcon LLM is a critical component in our journey towards a knowledge-based economy. The UAE chooses to actively involve itself in the broader conversation by investing in and developing AI solutions that will help create new economic, social, and educational opportunities. As part of this commitment, the open-source release of Falcon LLM showcases the UAE’s dedication to fostering collaboration, promoting transparency, and supporting innovation and research in the field of AI. By making Falcon LLM open source, we aim to enable widespread access to its advanced tech capabilities and empower researchers and organizations worldwide. This significant step exemplifies the UAE’s commitment to driving advancements in AI and solidifies its position as a leader in the global AI community. Next steps include contributing to further advancements in the field of AI and advanced technologies, with new models on the horizon, and promoting the utilization of advanced AI tech within UAE organizations and businesses.”
– Dr. Almazrouei
About the Authors
Dr. Ebtesam Almazrouei is Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII. Her work focuses on delivering AI and advanced tech solutions across multiple industries from healthcare, telecommunication, education, energy, and security. Dr. Almazrouei plays a pivotal role in building LLMs and stepping up the UAE’s capability in this space, leading the team behind building Falcon LLM. In addition, she led the development of Noor, the world’s largest Arabic LLM to date.
Will Badr is a Sr. Manager AI/ML Solutions Architects based in Dubai – UAE who works as part of the global Amazon Machine Learning team. Will is passionate about using technology in innovative ways to positively impact the community. In his spare time, he likes to go diving, play soccer and explore the Pacific Islands.
Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.