Washington Post Analysis of the Content That Trained Google’s ‘C4’ LLM Data Set

The Washington Post:

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT).

Scroll the bottom and they have a tool that lets you search for the ranking of a particular website. Daring Fireball is #24,293; Kottke.org is right behind at #25,310; Six Colors is #38,783; Stratechery is #57,283. MacRumors is way up at #761. MacRumors’s forums are at #45, and Apple Insider’s forums are at #211. The New York Times is #4, and The Washington Post itself #11.

Wednesday, 19 April 2023