Very interesting claims from #RedPajama. It seems they are about to build a competitive LLM from scratch, with everything to train these models fully reproducibly, from open training data. If true, highly relevant for FAIR research on / with LLMs.
"The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations, if the open community can close the quality gap between open and closed models. Recently, there has been much progress along this front. In many ways, AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.
We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. RedPajama has three key components:
Pre-training data, which needs to be both high quality and have broad coverage
Base models, which are trained at scale on this data
Instruction tuning data and models, which improve the base model to make it usable and safe
Today, we are releasing the first component, pre-training data."
Source: www.together.xyz/blog/redpajam…