HUGE dataset released for open source use
HUGE dataset released for open source use

together.ai
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

30T tokens, 20.5T in English, allegedly high quality, can't wait to see people start putting it to use!
Related github: https://github.com/togethercomputer/RedPajama-Data
Looks like the dedupped dataset is about 5T tokens. Nothing to sneeze at for sure.
I thought they claim the dedupped dataset is the 20.5T number, where did you see 5T? either way that would still be awesome, especially when you consider the theory that quality is most limited by datasets and llama2 was trained on 2T.. this could be huge
Maybe I misread it but this was the source of the 5T remark…
https://news.ycombinator.com/item?id=38077521#38080442