Artificial intelligence (AI) models are trained on web pages that include biased and sometimes racist information, as well as copyrighted content. It is the main conclusion of an investigation by The Washington Post, which has analyzed various data sets used to train AI.
Specifically, it has focused on Google’s Colossal Clean Crawled Corpus (C4) model, which hosts 15 million websites that are used to train some high-profile AI, such as Google’s T5 or Facebook’s LLaMA.* 100004*
In collaboration with researchers at the Allen Institute for AI, The Washington Post categorized these web pages using Similarweb and found that about a third of them could not be classified because they did not appear on the internet.
Once the sieve was done, it ranked the remaining 10 million websites based on the number of tokens—fragments of text used to process information—that appeared from each one in this data set.
The newspaper has acknowledged that most of these websites belonged to sectors such as business, industry, technology, news, art, entertainment, content creation, software development, science and health.
Confidential data and protected works
According to their investigations, some of these sites provided the AI with access to sensitive user data. This is the case of Kickstarter and Patreon, which allow this technology to know the ideas of artists, which raises concerns that the technology could turn this work into suggestions for users.
With this, he has recalled the existing problems with the copyright of these works and the collective lawsuit of a group of artists, last January, against three companies dedicated to digital art -Stability AI, DeviantArt and Midjourney- for infringing copyright in the development of artistic works with the Stable Diffusion tool.
biased information
On the other hand, this newspaper has warned that these AI models are also trained with chatbots that share biased information that could lead to the spread of prejudice, propaganda and misinformation without users being able to access the original source.* 100016*
Researchers have also focused on the religious content that the AI is trained on, determining that of the top 20 religious websites, 14 are Christian, two Jewish, one Muslim, one Jehovah’s Witness, and one mormon.
To exemplify the type of information offered by these web pages, take as a reference the one belonging to the Californian evangelical church Grace To You, which recently advised women to continue submitting to their abusive fathers and husbands, and to avoid reporting them to the authorities.
Regarding the Muslim religion, The Washington Post has also denounced the bias in some linguistic models, giving as an example that an investigation published in Nature found that ChatGPT completed the sentence Two Muslims entered a… with acts of violence in 66% of the cases.
previous filters
This newspaper also recalls that Google heavily filtered the data before sending it to the AI, removing duplicate text and profanity. With this, he has qualified that companies use high-quality data sets to adjust these models in order to protect users from unwanted content.
Also, apply filters to remove content associated with a blacklist, such as racial slurs or obscenities. But it does not properly filter non-sexual LGBTQ content and also sometimes allows pornographic content and Nazi symbology.
The Washington Post clarifies that Google’s C4 began collecting data in April 2019 in conjunction with the nonprofit organization CommonCrawl, which it says tries to prioritize the most important and reputable sites, but doesn’t try to avoid content with licenses or copyrights.