Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»Why AI Is Training on Its Own Garbage (and How to Fix It)
    AI Tools

    Why AI Is Training on Its Own Garbage (and How to Fix It)

    AwaisBy AwaisApril 8, 2026No Comments7 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Why AI Is Training on Its Own Garbage (and How to Fix It)
    Share
    Facebook Twitter LinkedIn Pinterest Email

    in AI for a while, you are probably an LLM/Agent/Chat user, but have you ever asked yourself how these tools will be trained in the near future, and what if we have already used up the data we need to train models? Many theories say that we are running out of high-quality, human-generated data to train our models.

    New content goes up every day, that’s a reality, but an increasing share of what gets added daily is itself AI-generated. So if you keep training on public web data, you’re eventually training on the outputs of your own predecessors. The snake eating its tail. Researchers call this phenomenon Model Collapse, where AI models start learning from the errors of their predecessors until the whole system degrades into nonsense.

    But what if I told you we aren’t actually running out of data? We’ve just been looking in the wrong place.

    In this article, I am going to break down the key insights from this brilliant paper.

    The Web We Already use and the Web That Matters

    Most of us consider the web as a unique source of information. In reality, there are at least two.

    There is the Surface Web: the indexed, public world like what we find on Reddit, Wikipedia, and news sites. This is what we’ve already scraped and overused for years to train the mainstream AI models of today. Then, there is what we call the Deep Web, and here I’m not talking about the “Dark Web” or anything illegal.

    The Deep Web is simply everything behind a login or a firewall. It refers to anything online that isn’t publicly indexed. It could be your hospital’s patient portal, your bank’s internal dashboard, enterprise document archives, private databases, and years of email sitting behind a login screen. Normal, boring, but incredibly valuable data.

    Many studies suggest the Deep Web is orders of magnitude larger than the surface web. More importantly, it is crucially better quality data. Compared to surface web content, which can be noisy, full of misinformation, and strongly SEO optimized. Also, it increasingly contains content deliberately designed to mislead or poison AI models. Deep web data, like medical records or verified financial documents or others internal databases, tends to be clean, authenticated, and organized by people who care about its quality.

    The problem? I think you can guess it, it’s private. You can’t just extract a million medical records without considering all the legal and ethical catastrophes you are going to cause.

    The PROPS Framework

    This is where a new framework called PROPS (Protected Pipelines) comes in. Introduced by Ari Juels (Cornell Tech), Farinaz Koushanfar (UCSD), and Laurence Moroney (former Google AI Lead), PROPS acts as a bridge between this sensitive data and the AI models that need it.

    The brilliance of PROPS is that it doesn’t ask you to “hand over” your data. Instead, it uses Privacy-Preserving Oracles. Think of an oracle as a “trusted middleman” that can look at your data, verify it’s real, and then tell the AI model what it needs to know without ever showing the model the raw information.

    These concepts of props can sounds magical as it can solve a lot of issues related to data availability that AI models face today. But how does this work exactly? Let’s take an example of a medical company that wants to train a diagnostic tool on real health records. Under the PROPS framework:

    1. Permission: As a user, you log into your own health portal and authorize a specific use for your data.
    2. The Oracle: Think of the Oracle as a digital notary. It goes to your private portal (like your hospital database) to verify that your data is real. Instead of copying your files, it simply tells the AI system: “I have seen the original documents, and I testify they are authentic.” It provides proof of the truth without ever handing over the private data itself. Tools already exist for this, like DECO. It’s a protocol that lets users prove that they pulled a specific piece of data from a web server over a secure TLS channel.
    3. The Secure Enclave: This is a “black box” inside the computer’s hardware where the actual training happens. We put the AI model and your private data inside and “lock the door.” No human or developer can see what is happening inside. The AI “studies” the data and leaves with only the model weights. The raw data stays locked inside until the session is over.
    4. The Result: The model trains on the data inside that box. Only the updated “weights” (the learning) come out. The raw data is never seen by human eyes.

    The contributor knows exactly what they’re agreeing to, and they can be rewarded for participating in a way that’s calibrated to how valuable their specific data actually is. It’s a genuinely different relationship between data owners and AI systems.

    But why bother with this instead of Synthetic Data?

    Some might ask: “Why bother with this complex setup when we can just generate synthetic data?”

    The answer is that synthetic data is a diversity killer. By definition, synthetic data generation reinforces the middle of the bell curve. If you have a rare medical condition that affects only 0.01% of the population, a synthetic data generator will likely smooth you out as “noise.”

    Models trained on synthetic data become progressively worse at serving outliers. PROPS solves this by creating a secure way for real people with rare conditions or unique backgrounds to “opt-in.” It turns data sharing from a privacy risk into a “data marketplace.” where valuable data gets the compensation it deserves.

    It’s not just about training, inference matters too

    Most discussions focus on training, but PROPS has an equally interesting application on the inference side.

    For example, getting a loan today involves a lot of document submission: bank statements, pay stubs, and tax returns. In a PROPS-based system, they suggest the use of a Loan Decision Model (LDM):

    1. You authorize the LDM to talk directly to your bank.
    2. The bank confirms your balance via a privacy-preserving oracle.
    3. The LDM makes a decision.
    4. The result? The lender gets a verified “Yes” or “No” without ever touching your private documents. This eliminates the risk of data leaks and makes it nearly impossible for people to use fraudulent, photoshopped documents.

    What’s actually stopping this from happening in 2026?

    It simply comes down to scale and infrastructure.

    The most robust version of PROPS requires training to happen inside a hardware-backed secure enclave (like Intel SGX or NVIDIA’s H100 TEEs). These work well at a small scale, but getting them to work for the massive GPU clusters needed for frontier LLMs is still an open engineering problem. It requires massive clusters to work in perfect, encrypted sync.

    The researchers are clear: PROPS isn’t a finished product yet. It’s a persuasive proof-of-concept. However, a lighter-weight version is deployable today. Even without full hardware guarantees, you can build systems that give users meaningful assurance, which is already an improvement over asking someone to email you a PDF.

    My Own Final Thoughts

    PROPS isn’t really a “new” technology; it’s a new application of existing tools. Privacy-preserving oracles have been used in the blockchain and Web3 space (like Chainlink) for years. The insight here is recognizing that the same tools can solve the AI data crisis.

    The “data crisis” isn’t a lack of information; it’s a lack of trust. We have more than enough data to build the next generation of AI, but it’s locked behind the doors of the Deep Web. The snake doesn’t have to eat its tail; it just needs to find a better garden.

    👉 LinkedIn: Sabrine Bendimerad

    👉 Medium: https://medium.com/@sabrine.bendimerad1

    👉 Instagram: https://tinyurl.com/datailearn

    fix Garbage Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

    April 8, 2026

    The Universal Research and Scientific Agent

    April 8, 2026

    Grounding Your LLM: A Practical Guide to RAG for Enterprise Knowledge Bases

    April 8, 2026

    [2604.06091] Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

    April 8, 2026

    Expert Level Tasks with Rubrics-Based Evaluation

    April 8, 2026

    Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

    April 8, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    Kimchi Tuna Melts Recipe | Epicurious

    April 8, 2026

    A tuna melt needs no improving, but for those looking to change up the diner…

    How High-Growth Companies Actually Measure Marketing

    April 8, 2026

    HubSpot rebrands its flagship conference from Inbound to Unbound

    April 8, 2026

    LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

    April 8, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Marketing forecast fundamentals every growth team needs

    April 8, 2026

    Meta simplifies Pixel setup with official Google Tag Manager template

    April 8, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.