AI and extraction from the open scientific commons

open science

commons

Author

Zack Batist

Published

May 5, 2025

There’s a pretty clear tension between researchers’ obligations to publish their work (ideally, in a widely accessible manner) and the use of these materials to train AI. On one hand, researchers produce knowledge for communal benefit, and on the other hand their work is being used for some pretty egregious purposes. A lot has already been written about this tension, but I’m not really satisfied with any explanations or recommendations I’ve read so far. Chief among my concerns is that an impasse is often contrived by framing it through technical, legal or procedural outlooks which take for granted several assumptions about the status quo. So I’ll jump in with my own take, as a scholar of sicentific practice whose research is about the formation and management of information commons among scientists.

I don’t see this as a novel disruption imposed by the AI industry, but a continuation of a broader social phenomenon. Specifically, I look at this in terms of shifting social norms with regards to the governance of knowledge commons. When researchers do their work, they produce, rely on, or otherwise participate in information commons. Moreover, their engagement with the commons is scaffolded by technical and administrative systems and by collaborative norms and expectations.

For instance, while citing sources, researchers will access prior work made available to them by their library or on the publisher’s website, they identify — based on norms established through their earlier education — when and how to cite sources, and if they fail to comply with these norms they are either corrected by reviewers, criticized for bad behaviour or formally sanctioned by their institutions on charges of plagiarism. Researchers expect to be able to access and cite any prior work produced in their field, and expect to be cited when their contributions are being used by others. Contributing to and extracting from the information commons of scholarly literature is therefore scaffolded by commitments to a collective enterprise, which constitute norms and expectations instilled through participation in a community of practice.

Moreover, it should be emphasized that commons of all kinds do not have to be, and usually are not, egalitarian. Communities devise norms that determine who can access them and in what ways. As such, the boundaries that effectively limit access are essentially social in nature, but are enforced through technical and legalistic mechanisms. In other words, access to a commons is deemed either acceptable or unacceptable based on the actors’ relationship to the commons, and more specifically whether they commit to the norms that govern access.

With regards to AI’s access to academic research outputs, a significant concern is that the commitments and values which govern access are heterogeneous and in flux, especially in light of the evolution of the open science movement. Although this is probably a simplification, I consider there to be two primary camps in open science. One is driven by the design of technical infrastructures and national policy mandates that facilitate and enforce information sharing. This is driven by a transactional vision of information sharing, and considers information as something that can be disembodied, recombined, and easily recontextualized. The other camp is more concerned with reforming science as a more inclusive humanistic enterprise. This means distributing material resources more equitably and removing barriers to participate as a scientist. Whereas the former considers the problems that science faces as technical and legalistic in nature, the latter recognize the root social problems that underlie the mechanisms through which those problems are enforced.¹

The problem is that open infrastructures and policy mandates, which are largely devised by the technical camp, are incommensurate with how science actually works in practice, and impose new commitments for participating in the commons that are not valued by the communities that actually make these commons possible.² And to a greater extreme, they undermine the established norms that research communities have developed for themselves.³ Specifically, community norms, which previously served to establish the boundaries that govern who can access the commons and in what ways, have essentially been undermined through claims of universal access, or claims that there should be no boundaries whatsoever.

But this claim of removing boundaries is a happy lie we tell ourselves — even the most hardcore open bros expect their work to be cited, and often want their work to be used in a way that is commensurate with their intentions, and not misrepresented. These are very reasonable expectations, and most researchers will adhere to them due to their common upbringing within a community of practice that instilled these norms. But outside actors who are not familiar with this decorum, or who simply refuse to adhere to these rules, are acting out of line, at least in the minds of those who expect or who are trying to foster respectful forms of community engagement.

Determinig who counts as an outsider or an insider does matter. However, these distinctions are slippery and shifting. Moreover, they are not really tied to identity, but to the manner in which they engage with the community and its norms and values. It could be helpful to complicate a few common dichotomies in order to explore this further.

For example, we might ask about when it is or is not acceptable to enforce or not enforce copyright. Academics share PDFs of their papers all the time even when they have no legal right to do so, and at the same time react intensely when Meta engages in the same practices. However these are not the same things: academic piracy is deemed good when it provides access to underserved community members, and bad when it is purely extractive.⁴ In other words, the reason why Aaron Shwartz is lionized for his piracy is due to his service to the research community, whereas Meta is an outside and threatening actor in it for themselves and themselves alone. One is a community member who enhances the commons by contributing to it, the other is a notoriously predatory entity that extracts while providing little in return. Another more ambiguous example that lies between these extremes is the Internet Archive, whose work is both informed by professional archival principles while also being driven by a tech-bro attitude regarding how to resolve fundamental questions concerning access to copyrighted work; in developing the National Emergency Library to grant access to reading material during the Covid-19 pandemic, they succumbed to a technical means of resolving an underlying social problem.

Another example is a scenario that happens from time to time, when computer scientists or physicists publish high-profile papers in Science or Nature that claim to “solve” archaeological problems using large and integrated datasets that they know nothing about. Any archaeologist who reads these works will intuitively reject them based on the fact that the authors take the data at face value, without understanding the complicated and storied histories of the datasets they rely on, which can only be truly understood through experience working as an archaeologist. And when archaeologists publish similar papers, with care and concern for evaluating the data as potentially mismatched with their intended analytical use-case, and while taking into account ethical and epistemoligical concerns, these works are considered more legitimate and are treated with kinder consideration. The difference is that in the latter scenario, archaeologists are engaging with the data in meaningful ways, and account for the decisions, actions and circumstances of the data’s creation, which they understand through their common upbringing as archaeologists. Computer scientists, on the other hand, view the data as neutral and abtract representations that can be mixed and matched with ease, which is appealing in that field — but any archaeologist knows that this is not true and that pretending that it is produces bad or wrong research outcomes.

Another emerging ambiguity it using AI locally and on-device, while respecting privacy, and while using it as a tool to address legitimate research questions — in other words, using AI within the bounds of established professional decorum, values and principles, as determined by a discipline or community of practice. Using AI for the sake of it, or without care for its impact on the work, is unacceptable due to the sense that this behaviour signals a disconnection with disciplinary norms and expectations for the sake of easy hype, which is typically celebrated by AI industry shills and marks you as serving their ends.

I’m not really sure how to end this post. It’s tempting to apply the label of extraction, but I don’t think I know enough about this emerging framework to make that connection. And I think that, at the same time, extraction is a bit of a misnomer, since it draws an insider/outsider dichotomy that I don’t believe is consistently warranted. We must also acknowledge that we did this, we put our data out there, we linked them, we made them FAIR.

But it does seem that the treatment of scholarly commons as materials that can be justifiably exploited in dependent on the notion of universal access. It brings to mind the idea of UNESCO World Heritage sites, which claim exotic built environments for the sake of all humanity, often at the expense of those who are made to live within and around them.⁵ When we make our data Findable, Accessible, Interoperable and Reusable, we imagine certain outcomes and use-cases, we anticipate who we are making our data FAIR for. But it’s important to recognize that the infrastructures we rely on to share our data have distinct and overlapping goals. We need to ask what open science is for, and who it is for. In other words, we need to consider the scientific commons as a social commons, which include expectations and boundaries.

Footnotes

This is evident in the mechanisms that each employ to try and resolve their target of concer: the former see publishing as the business of typesetting and copyright law, which they could hack to render moot using automated publishing workflows and by encouraging use of open licensing agreements; viewed as merely technical systems, these could be resolved through technical means. Whereas the latter tend to be concerned with experimental publishing and pushing the boundaries regarding what constitutes legitimate media for scholarly commmunication, collaboration evaluation and review. One is shallow, the other deep.↩︎
As an aside, it’s frustrating that this is commonly framed as the culture not yet having “cought up” to the brave new world of total open science.↩︎
Whether or not this is warranted is topic for another discussion — but the fact that this is happening is important for framing the concern over AI’s access to research outputs.↩︎
I wonder if this would have been framed differently if Meta decided to seed the libgen torrents or take a stand against restrictive copyright laws altogether, rather than claiming that their use-case is exceptional.↩︎
This also brings to mind how digitally-crafted reconstructions of the Palmyra gate have been shopped around by non-archaeologists as a form of digital colonialism.↩︎