3 Key Risks for Data/AI Teams
The risk and regulatory landscape for AI is developing rapidly, particularly in the US. For data teams tasked with managing AI projects, these risks are like the risks presented by data companies up until this point, but at a much larger scale. Web scraping has exploded into a constant cycle of content gathering and recycling in a feedback loop with self-improving code that is reducing the quality of the content itself. It's a strange time. Suddenly there are new tools like llms.txt competing with robots.txt and billion-dollar tech companies are using BitTorrent and buying paper copies of books... In this post, we cover the top three issues in managing AI-specific risk for data teams, based on conversations with enterprise users, their vendors, and service providers.
1. PRODUCT QUALITY. While Glacier sees the world through a risk lens, we must acknowledge that the most significant risk remains the quality and performance of the popular AI tools. In other words, are these tools any good and are they being adopted with enough scale and control to get better? Unfortunately, our experience shows that adoption is less rapid and less successful than much of media reporting and speculation. Enterprise users are more comfortable deploying AI first to speed up administrative work in compliance, reporting, communications, HR, etc. But we are convinced the tech itself is improving rapidly and is here to stay. This is the reason our other two risks (below) matter.
Potential mitigations: There is no easy fix from a user perspective. Perhaps the best approach is to build a few recurring tasks into one’s schedule that relate to AI to continuously evaluate progress. Many firms are attempting to catalogue AI vendors proactively, ahead of any real commercial engagement.
2. DATA RIGHTS MANAGEMENT. Third-party data acquisition has shifted from a 1x1 process between a buyer and seller (where a licensee negotiates for data collected from a discrete set of mostly fixed sources) to a more open-ended proposition. Research tools like Perplexity, Hebbia, enterprise ChatGPT, and even new browsers have made it nearly impossible to discern underlying data sources, ownership of them, or rights to them. Unlike (mostly) static commercial data products, LLMs are engineered for growth through user IP and endless incorporation of new sources in training. Conventional diligence consisting of a dialogue between a user and a provider may still satisfy certain legal requirements for regulated data users, but it will prove futile for practical purposes in many cases. Perhaps most significantly, data rights are a multi-directional problem - enterprise users remain largely unwilling to contribute their data to training models despite wanting to use licensed data in new ways in their own environments.
Potential mitigations: A human layer of review remains essential. Models like ChatGPT, Mistral, and Claude have made strides in showing their sources with deep research features; however, we think data teams can also take steps to mitigate sourcing risk by using new tools to train models on-premises or via federated models (which apparently allow for additional control over training dataset elements). Data teams may eventually classify their own data according to sensitivity levels, with a certain class of data shielded entirely from any AI tools.
3. COPYRIGHT. Recent US copyright decisions are a mixed bag for the major AI companies. Arguably these decisions track a reasonable application of existing law to the three distinct phases of data collection, processing, and distribution, with the first and third remaining the most controversial (whereas training itself has been found to be defensible). Data collection, though, is a multi-billion-dollar problem for companies like Anthropic, OpenAI, and Microsoft. Statutory damages per work under copyright law (which can reach $150,000 in some cases) can erase billions in revenue when many thousands of works are copied.
Potential mitigations: Services like Claude and ChatGPT have become better at suppressing and filtering copyrighted outputs. Users can improve their own prompts, as well. But the fundamental risk will remain with OpenAI and others until questions around data collection are resolved. (Many have pointed out that this is simply a massive licensing problem.)
Of course there are other major risks to consider, such as cybersecurity threats from prompt injection attacks. We continue to believe that human review of data and AI tools across teams is essential to managing these risks.
Don D'Amico
Founder & CEO, Glacier Network
©2025 Glacier Network LLC d/b/a Glacier Risk (“Glacier”). This post has been prepared by Glacier for informational purposes and is not legal, tax, or investment advice. This post is not intended to create, and receipt of it does not constitute, a lawyer-client relationship. This post was written by Don D’Amico without the use of generative AI tools. Don is the Founder & CEO of Glacier, a data risk company providing services to users of external and alternative data. Visit www.glaciernetwork.co to learn more.