patternMinor

As far as we know, what does GPT-4's training data look like?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

knowwhatlookliketraininggptdoesfardata

Problem

I was asking ChatGPT local history questions, because I knew many of the answers and could test the robot. But lots of details were wrong and while it can write citations, it won't disclose its training data i.e., I can't see its blind spots.

Solution

OpenAI does not fully disclose what the training set was. This is a space where there is heavy competition, and I imagine they probably view these details as part of their competitive advantage. What is known about GPT-3 and GPT-4 are disclosed in papers from the authors.

The GPT-4 paper says:

"GPT-4 is a Transformer-style model [39] pre-trained to predict the next token in a document, using both publicly
available data (such as internet data) and data licensed from third-party providers."

GPT-4 Technical Report, OpenAI, arXiv:2303.08774.

The paper on GPT-3 contains more details:

"Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset [RSR+19] constituting nearly a trillion words."

Language Models are Few-Shot Learners, Tom Brown et al, arXiv:2005.14165.

Table 2.2 goes on to show all of the datasets they used, including the Common Crawl (processed appropriately), WebText, Wikipedia, and more.

To learn more about these training datasets, you can learn more about Common Crawl, or read the references in the GPT-3 paper.

Don't be surprised if ChatGPT gets some facts wrong. That is what happens, given the current technology. This is known as the problem of hallucination, and it is an ongoing challenge to address.

Context

StackExchange Computer Science Q#159361, answer score: 3

Revisions (0)

No revisions yet.