#RedPajama

PressMind Labspressmind
2025-12-18

Adobe w tarapatach – pozew o wykorzystanie pirackich książek w AI

Czy „etyczny” AI da się zbudować na cudzych książkach? Adobe właśnie boleśnie sprawdza, ile kosztuje skrót przez czyjąś bibliotekę.

Czytaj dalej:
pressmind.org/adobe-w-tarapata

Ilustracja przedstawiająca salę sądową z książkami i ekranem o tematyce AI.
2025-02-11
The #RedPajama #LLM is so painfully close to being truly #OpenSource. Just a few tweaks needed:
- Dropping CommonCrawl/C4 entirely
- Fixing the Gutenberg crawler to stick to public domain books
- Filtering arXiv to return only CC-By(-SA) papers
huggingface.co/datasets/togeth…
GripNewsGripNews
2023-10-31

🌘 RedPajama-Data-v2:一個包含30萬億標記的開放數據集,用於訓練大型語言模型 - Together AI
➤ RedPajama-Data-v2數據集:30萬億標記的開放數據集用於訓練大型語言模型
together.ai/blog/redpajama-dat
RedPajama-Data-v2是一個包含30萬億標記的數據集,從84個CommonCrawl傾印中涵蓋5種語言,並附帶40多個預先計算的數據質量標註,可用於進一步過濾和加權。這是迄今為止針對LLM訓練專門發布的最大公共數據集。
+ 這個數據集對於語言模型的訓練非常有用,提供了大量的高質量數據。
+ 這是一個很棒的資源,對於研究和開發語言模型的人來說非常有價值。

Tero Keski-Valkamatero@rukii.net
2023-05-06

Releasing 3B and 7B #RedPajama-#INCITE family of models including base, instruction-tuned & chat models — #TOGETHER

"The biggest takeaway is the demonstration that performant #LLMs can be built quickly by the open-source community. This work builds on top of our 1.2 trillion token RedPajama dataset, EleutherAI’s #Pythia training code, #FlashAttention from #Stanford and #Together, the #HELM benchmarks from Stanford #CRFM and generous support from #MILA, #EleutherAI & #LAION for compute time on the #Summit #supercomputer within the INCITE program award 'Scalable Foundation Models for Transferable Generalist AI'. We believe these kind of open collaborations, at larger scales, will be behind the best #AI systems of the future. "

together.xyz/blog/redpajama-mo

verwaltungslabor.digitalvrwltngslbr@norden.social
2023-05-04

@MoKaKi Wie schön wäre es doch, wenn auch Open-Source-Sprachmodelle oder deutsche Unternehmen berücksichtigt oder gar priorisiert werden könnten #AlephAlpha #DeepL #RedPajama

Erik JonkerErikJonker
2023-04-21

Positive that opensource LLMs and AI like StableLM and RedPajama are gaining traction. Really important as alternatives to the completely closed and not-transparant solutions from Microsoft, Google and OpenAI.
github.com/stability-AI/stable
together.xyz/blog/redpajama

2023-04-18

@survey I'm excited about Large Language Models and open source. This isn't the best example, but #RedPajama: news.ycombinator.com/item?id=3

2023-04-18

NEW #LLaMA Rebuilt From Scratch - FULL #OpenSource

youtube.com/watch?v=uF86vcwM6J

A video for everyone who is too lazy to read the announcement themselves (like me lol).

together.xyz/blog/redpajama

#AI #LLM #GPT4 #Together #RedPajama

Lambert Hellerlambo@libranet.de
2023-04-17

Very interesting claims from #RedPajama. It seems they are about to build a competitive LLM from scratch, with everything to train these models fully reproducibly, from open training data. If true, highly relevant for FAIR research on / with LLMs.

"The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations, if the open community can close the quality gap between open and closed models. Recently, there has been much progress along this front. In many ways, AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.

We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. RedPajama has three key components:

Pre-training data, which needs to be both high quality and have broad coverage

Base models, which are trained at scale on this data

Instruction tuning data and models, which improve the base model to make it usable and safe

Today, we are releasing the first component, pre-training data."

Source: www.together.xyz/blog/redpajam…

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst