The Foundation Model Development Cheatsheet

The pace of foundation model releases and progress has continued to grow rapidly over the past few years, with many new models released from organizations of all kinds worldwide. In addition to releasing models themselves, it's also important to make the tools to create these models - large-scale training libraries, data processing and creation tooling, and more - widely available. In April 2023 we released the Pythia model suite, the first LLMs with a fully released and reproducible technical pipeline from start to finish. We are excited to see other organizations following suit, with the LLM360 project releasing Amber later that year and AI2’s OLMo as fully-transparent artifact releases across the entire language model development process. Additionally, many other orgs have released new tools for underserved aspects of the development pipeline. Without full-pipeline transparency, accountability for undisclosed design decisions is prevented, and independent research and auditing are limited in their ability to draw robust conclusions or accurately assess harms.

As a continuation of EleutherAI’s mission to lower barriers to entry of research and provide mentorship and educational resources about large-scale AI model development, we have collaborated with researchers from MIT, AI2, Hugging Face, Stanford, Princeton, Masakhane, MLCommons, and more to release “The Foundation Model Development Cheatsheet”, a quick-start guide to familiarize new developers with useful tools and resources for developing new open models. The topics covered span the entire model development cycle, from data collection to licensing and release practices, and are aimed to give a jumping-off point and high level survey of all the important steps for responsibly and successfully developing new models. We hope that the Cheatsheet will be a useful learning resource and reference for newer developers to be exposed to not just the technical aspects of model creation, which rightfully receives much attention already, but also the crucially important good practices around responsible development practices and release management.

We hope the Cheatsheet will be a useful entry point into responsible and well-documented model development, and help raise awareness of these crucial issues. You can read the paper for full details, or explore the collection of resources interactively via the interactive website. It is intended as a living resource–all are welcome to submit new resources and be recognized for their contributions!