HUGE dataset released for open source use
HUGE dataset released for open source use

together.ai
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

30T tokens, 20.5T in English, allegedly high quality, can't wait to see people start putting it to use!
Related github: https://github.com/togethercomputer/RedPajama-Data