A Major AI Training Data Set Contains Millions of Examples of Personal Data

A Major AI Training Data Set Contains Millions of Examples of Personal Data

A recent study has found that a major AI training dataset, DataComp CommonPool, contains millions of examples of personal data, including images of passports, credit cards, birth certificates, and other sensitive documents. This dataset, with 12.8 billion data samples, is one of the largest existing datasets of publicly available image-text pairs used to train generative text-to-image models.

The study estimates that hundreds of millions of images containing personally identifiable information (PII) are circulating within these datasets. Researchers found thousands of images of identity-related documents, including driver's licenses and social security numbers, as well as résumés with sensitive information like disability status, background check results, and racial identity. Over 800 verified job applications linked to real individuals were found, along with thousands of validated identity documents.

The dataset's automated blurring algorithm missed an estimated 102 million faces, raising concerns about the potential for these images to be used for facial recognition or other purposes without individuals' consent. The inclusion of PII in these datasets poses significant privacy risks for individuals, particularly since many downstream models are trained on this exact data.

While platforms like Hugging Face offer tools for individuals to request data removal, effectiveness relies on users being aware their data is present. Furthermore, DataComp CommonPool builds upon the foundation laid by LAION-5B, another large-scale dataset used to train popular AI image generators, meaning vulnerabilities present in one dataset are likely replicated in the other. As AI continues to evolve, ensuring the protection of personal data and addressing these vulnerabilities will be crucial.

About the author

TOOLHUNT

Effortlessly find the right tools for the job.

TOOLHUNT

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to TOOLHUNT.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.