RedPajama v2 Open Dataset with 30T Tokens for Training LLMs
RedPajama v2 Open Dataset with 30T Tokens for Training LLMs

together.ai
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

There is a discussion on Hacker News, but feel free to comment here as well.