“AI2 Releases Dolma: Largest Open-Source Text Dataset, Promoting Transparency in AI Research”

Aug 22, 2023


The Allen Institute for Artificial Intelligence (AI2) has released the Dolma dataset, which is the largest open-source text dataset ever created. The dataset contains 3 billion tokens and will serve as the foundation for AI2’s open natural language model, OLMo. The goal of releasing the dataset is to promote transparency in AI research and development and provide details beyond basic statistics. AI2 aims to counter the opacity of datasets used by major companies like OpenAI and Meta by providing maximal transparency into how the data was obtained and processed.


The Allen Institute for Artificial Intelligence (AI2) has released the Dolma dataset, which is a significant milestone in the field of AI research and development. The dataset contains a massive 3 billion tokens and will be used to build AI2’s open natural language model, OLMo. The release of the dataset aims to promote transparency in AI research and address concerns regarding the opacity of datasets used by major companies. AI2 believes that by making both the data and model open, external research can be streamlined, and accountability can be enhanced.

Main Points:

– The Dolma dataset released by AI2 is the largest open-source text dataset ever created, with 3 billion tokens.
– The dataset incorporates various sources such as web pages, academic publications, books, encyclopedias, and code.
– Personal information was removed during preprocessing to protect privacy, and users can request the removal of personal information.
– AI2 aims to counter the opacity of datasets used by major companies by providing maximal transparency into how the data was obtained and processed.
– The release of Dolma is part of AI2’s efforts to build OLMo, an open natural language model, in a transparent and open manner.
– The openness of the dataset and model will empower researchers to build ethical alternatives and inspect Dolma’s contents.


The release of the Dolma dataset by AI2 marks a significant step towards promoting transparency in AI research and development. With its 3 billion tokens, the dataset provides a valuable resource for developing large language models and generative AI tools. AI2’s goal is to address concerns regarding the opacity of datasets used by major companies and provide maximal transparency into data sources and training methods. By making both the data and model open, AI2 hopes to streamline external research and enhance accountability in AI development.