Abhinav Venigalla (Guest Author)

Author Archives: Abhinav Venigalla (Guest Author)

Cloudflare R2 and MosaicML enable training LLMs on any compute, anywhere in the world, with zero switching costs

Cloudflare R2 and MosaicML enable training LLMs on any compute, anywhere in the world, with zero switching costs
Cloudflare R2 and MosaicML enable training LLMs on any compute, anywhere in the world, with zero switching costs

Building the large language models (LLMs) and diffusion models that power generative AI requires massive infrastructure. The most obvious component is compute – hundreds to thousands of GPUs – but an equally critical (and often overlooked) component is the data storage infrastructure. Training datasets can be terabytes to petabytes in size, and this data needs to be read in parallel by thousands of processes. In addition, model checkpoints need to be saved frequently throughout a training run, and for LLMs these checkpoints can each be hundreds of gigabytes!

To manage storage costs and scalability, many machine learning teams have been moving to object storage to host their datasets and checkpoints. Unfortunately, most object store providers use egress fees to “lock in” users to their platform. This makes it very difficult to leverage GPU capacity across multiple cloud providers, or take advantage of lower / dynamic pricing elsewhere, since the data and model checkpoints are too expensive to move. At a time when cloud GPUs are scarce, and new hardware options are entering the market, it’s more important than ever to stay flexible.

In addition to high egress fees, there is a technical barrier to object-store-centric machine learning training. Reading and Continue reading