AI companies train language models on YouTube’s archive − making family-and-friends videos a privacy risk

Ryan McGrady, UMass Amherst and Ethan Zuckerman, UMass Amherst

27 June 2024 at 8:23 am·6-min read

Your kid's silly video could be fodder for ChatGPT. <a href="https://www.gettyimages.com/detail/photo/front-view-of-young-teenager-girls-friends-outdoors-royalty-free-image/1280737244" rel="nofollow noopener" target="_blank" data-ylk="slk:Halfpoint/iStock via Getty Images;elm:context_link;itc:0;sec:content-canvas" class="link ">Halfpoint/iStock via Getty Images</a> — Your kid's silly video could be fodder for ChatGPT. Halfpoint/iStock via Getty Images

The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have begun using YouTube videos to train their text-based AI models. But what does the YouTube archive actually include?

Our team of digital media researchers at the University of Massachusetts Amherst collected and analyzed random samples of YouTube videos to learn more about that archive. We published an 85-page paper about that dataset and set up a website called TubeStats for researchers and journalists who need basic information about YouTube.

Now, we’re taking a closer look at some of our more surprising findings to better understand how these obscure videos might become part of powerful AI systems. We’ve found that many YouTube videos are meant for personal use or for small groups of people, and a significant proportion were created by children who appear to be under 13.

Bulk of the YouTube iceberg

Most people’s experience of YouTube is algorithmically curated: Up to 70% of the videos users watch are recommended by the site’s algorithms. Recommended videos are typically popular content such as influencer stunts, news clips, explainer videos, travel vlogs and video game reviews, while content that is not recommended languishes in obscurity.

Some YouTube content emulates popular creators or fits into established genres, but much of it is personal: family celebrations, selfies set to music, homework assignments, video game clips without context and kids dancing. The obscure side of YouTube – the vast majority of the estimated 14.8 billion videos created and uploaded to the platform – is poorly understood.

Illuminating this aspect of YouTube – and social media generally – is difficult because big tech companies have become increasingly hostile to researchers.

We’ve found that many videos on YouTube were never meant to be shared widely. We documented thousands of short, personal videos that have few views but high engagement – likes and comments – implying a small but highly engaged audience. These were clearly meant for a small audience of friends and family. Such social uses of YouTube contrast with videos that try to maximize their audience, suggesting another way to use YouTube: as a video-centered social network for small groups.

Other videos seem intended for a different kind of small, fixed audience: recorded classes from pandemic-era virtual instruction, school board meetings and work meetings. While not what most people think of as social uses, they likewise imply that their creators have a different expectation about the audience for the videos than creators of the kind of content people see in their recommendations.

Fuel for the AI machine

It was with this broader understanding that we read The New York Times exposé on how OpenAI and Google turned to YouTube in a race to find new troves of data to train their large language models. An archive of YouTube transcripts makes an extraordinary dataset for text-based models.

There is also speculation, fueled in part by an evasive answer from OpenAI’s chief technology officer Mira Murati, that the videos themselves could be used to train AI text-to-video models such as OpenAI’s Sora.

The New York Times story raised concerns about YouTube’s terms of service and, of course, the copyright issues that pervade much of the debate about AI. But there’s another problem: How could anyone know what an archive of more than 14 billion videos, uploaded by people all over the world, actually contains? It’s not entirely clear that Google knows or even could know if it wanted to.

Kids as content creators

We were surprised to find an unsettling number of videos featuring kids or apparently created by them. YouTube requires uploaders to be at least 13 years old, but we frequently saw children who appeared to be much younger than that, typically dancing, singing or playing video games.

In our preliminary research, our coders determined nearly a fifth of random videos with at least one person’s face visible likely included someone under 13. We didn’t take into account videos that were clearly shot with the consent of a parent or guardian.

Our current sample size of 250 is relatively small – we are working on coding a much larger sample – but the findings thus far are consistent with what we’ve seen in the past. We’re not aiming to scold Google. Age validation on the internet is infamously difficult and fraught, and we have no way of determining whether these videos were uploaded with the consent of a parent or guardian. But we want to underscore what is being ingested by these large companies’ AI models.

Small reach, big influence

It’s tempting to assume OpenAI is using highly produced influencer videos or TV newscasts posted to the platform to train its models, but previous research on large language model training data shows that the most popular content is not always the most influential in training AI models. A virtually unwatched conversation between three friends could have much more linguistic value in training a chatbot language model than a music video with millions of views.

Unfortunately, OpenAI and other AI companies are quite opaque about their training materials: They don’t specify what goes in and what doesn’t. Most of the time, researchers can infer problems with training data through biases in AI systems’ output. But when we do get a glimpse at training data, there’s often cause for concern. For example, Human Rights Watch released a report on June 10, 2024, that showed that a popular training dataset includes many photos of identifiable kids.

The history of big tech self-regulation is filled with moving goal posts. OpenAI in particular is notorious for asking for forgiveness rather than permission and has faced increasing criticism for putting profit over safety.

Concerns over the use of user-generated content for training AI models typically center on intellectual property, but there are also privacy issues. YouTube is a vast, unwieldy archive, impossible to fully review.

Models trained on a subset of professionally produced videos could conceivably be an AI company’s first training corpus. But without strong policies in place, any company that ingests more than the popular tip of the iceberg is likely including content that violates the Federal Trade Commission’s Children’s Online Privacy Protection Rule, which prevents companies from collecting data from children under 13 without notice.

With last year’s executive order on AI and at least one promising proposal on the table for comprehensive privacy legislation, there are signs that legal protections for user data in the U.S. might become more robust.

Have you unwittingly helped train ChatGPT?

The intentions of a YouTube uploader simply aren’t as consistent or predictable as those of someone publishing a book, writing an article for a magazine or displaying a painting in a gallery. But even if YouTube’s algorithm ignores your upload and it never gets more than a couple of views, it may be used to train models like ChatGPT and Gemini.

As far as AI is concerned, your family reunion video may be just as important as those uploaded by influencer giant Mr. Beast or CNN.

This article is republished from The Conversation, a nonprofit, independent news organization bringing you facts and trustworthy analysis to help you make sense of our complex world. It was written by: Ryan McGrady, UMass Amherst and Ethan Zuckerman, UMass Amherst

Read more:

My work - and the work we refer to in this article - is supported by the MacArthur Foundation, the Ford Foundation, the Knight Foundation and the National Science Foundation. I am on the board of several nonprofit organizations, including Global Voices, but none are directly connected to politics.

Ryan McGrady does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

Deadline
Former ‘Bachelorette’ Star Katie Thurston Posts About Her Rape Aftermath On Social Media
Katie Thurston is talking about what happened when she reported she was raped. In a carousel post to Instagram on Saturday, Thurston talked about the aftermath of her ordeal, but gave no specifics on what happened. In a statement to ET, she said, “I’ve had ample support from my community which allowed me to stay …
BuzzFeed
These 50 Gay Tweets From This Year So Far Were So Funny, They Went Viral
"When men put on those lil corset belts at the gym, like, okay diva!"
BuzzFeed
18 Wholesome Tweets Because We Could All Use A Moment To Decompress After The Presidential Debate
<3 This is a very safe space <3
The Independent
Are TikTokers and Facebook sleuths drowning out actual leads in the hunt for Jay Slater?
Investigators say armchair sleuths spreading memes and misinformation are harming missing teenager’s family
The Telegraph
Woman arrested after prison officer filmed allegedly having sex with inmate
A woman has been arrested after footage was widely circulated that appeared to show a prison officer having sex with an inmate.
Hello!
Elizabeth Hurley, 59, is a goddess in tiny string bikini
Elizabeth Hurley showed off her incredible physique in a yellow string bikini as she posed for Instagram – and the Gossip Girl actress, 59, looked better than ever. See photos.
People
Tom Cruise and Son Connor Spend Time Together in London During Rare Outing
The actor is a father to three children
The Guardian
Who could replace Joe Biden? Here are six possibilities
With Biden not yet officially endorsed as Democratic presidential candidate, it is in theory open to the party to choose another candidate
INSIDER
Israel has never been impressed with its version of the $1 billion Patriot air defense system. Now it could offload up to 8 to Ukraine.
Ukraine has long coveted more of the $1 billion US-manufactured Patriot air defense systems.
The Independent
‘Panic mode’ Democrats begin calling for Biden to step aside after ‘horrible’ debate performance against Trump
‘Need to have Harris take over. Cleanest option,’ one Democrat strategist told The Independent
Cosmopolitan
The Advice Matt Damon Reportedly Gave Ben Affleck as Things "Started Falling Apart" With J.Lo
Here's the advice Matt Damon reportedly gave Ben Affleck when things started "falling apart" with J.Lo.
The Telegraph
‘Trillion dollar trainwreck’: US super stealth fighter is eating the next generation
All of a sudden, the US Air Force is considering cancelling a multibillion-dollar effort to develop a new stealth fighter. Citing the high cost of the so-called “Next-Generation Air Dominance” programme and the competing demands of other projects, USAF leaders have warned they may have no choice but to cancel NGAD – and find other ways of winning control of the air in future wars.
People
Griff Is Giving Away the Dress She Wore to Open for Taylor Swift: ‘Wanted to Pass It Down’
The singer revealed she is giving away her "But Daddy I Love Him"-inspired dress she wore when she opened at the Eras Tour on June 22
The Smart Investor
4 Singapore Stocks Paying Out Dividends in July
We feature four Singapore stocks that are doling out dividends in July. The post 4 Singapore Stocks Paying Out Dividends in July appeared first on The Smart Investor.
The Guardian
‘Sex in an LA spa was strangely wholesome, like an extension of the wellness experience’: This is how we do it in America
Rob used to be hyper-monogamous – but then he met Mikey and discovered a whole world of experimentation
The Independent
How David Beckham got ‘revenge’ on Prince Harry for Meghan Markle snub
The prince allegedly ignored the footballer after agreeing to meet him
KameraOne
Chance reunion: Uber driver reconnects with old friend after 20 years
An Uber driver has shared heartwarming footage after he got an unexpected surprise when he unknowingly picked up passenger that turned out to be an old friend he hadn't seen in over 20 years. The serendipitous moment posted on June 19 has since gone viral, bringing smiles to many online users, with nearly 10 million views and nearly 2 million likes.
Evening Standard
Transfer news LIVE! First Arsenal signing; Chelsea in new Isak bid; Gordon to Liverpool twist; Man Utd latest
Stay up to date with the latest deals, updates, rumours and gossip during the summer window
Simply Recipes
The Easiest Way To Cut Watermelon, According to a Food Editor
It's got me eating watermelon way more often now.
The Independent
Apple says changing iPhone batteries will become easier, as it explains how it tries to make devices last
Company explains its ‘principles for designing for longevity’ as it announces new changes

Bulk of the YouTube iceberg

Fuel for the AI machine

Kids as content creators

Small reach, big influence

Have you unwittingly helped train ChatGPT?

Latest stories