14 Comments
User's avatar
Special interest stacks's avatar

Thank you for your post. I found it really interesting. Provenance is an area I'm really interested in, especially related to archival responsibilities to provenance.

In reading your piece, I found myself thinking about the idea that provenance is something fixed, linear, and definable in a narrow sense. Some thoughts below.

First, there’s the assumption that LLMs are engaging with concepts and ontology and they’re doing semantic or ontological prediction rather than just text prediction. It feels like they're operating at the level of meaning, but in practice they are predicting patterns in language. That distinction matters, because we risk attributing conceptual depth where there may only be patterned continuation.

Second, we often forget that the texts LLMs are trained on are already saturated with semantic assumptions and ontological framings. They are not neutral. They carry embedded perspectives, institutional voices, inherited biases, and particular ways of constructing knowledge. So what the model produces is shaped through multiple layers of already-shaped discourse. That matters.

Third, provenance is an expansive concept although isn't always been treated that way. It’s often assumed to be a fixed lineage that stops once a record is captured. But provenance is multifaceted, evolving, and ongoing. It continues over time through reuse, reinterpretation, regeneration, and circulation. It’s not simply about origin; it’s about how meaning is continuously shaped and reshaped.

You've talked about truth and knowledge which is also relational. I find it helpful to distinguish between facts, knowledge, and truth. Facts concern whether something exists or can be verified. Knowledge is structured and evolving. Truth is partial and perspectival. In an LLM context, what seems especially important is clarity about what something is purported to be. Does it exist? Is it defensible? That’s different from assuming that knowledge somehow decays. From my perspective, knowledge doesn’t decay; it evolves.

I’m also cautious about assuming that citation-based, textual lineage is the only valid model of provenance. Different knowledge systems—oral histories, story, law, community custodianship—have long-standing and rigorous ways of establishing responsibility, evidence, and continuity. Citation is one epistemic model among many.

If we define provenance too narrowly, we risk flattening complex epistemological traditions into a single framework of traceable textual origin. What’s needed is a deeper philosophical engagement with how LLMs participate in evolving knowledge systems, how it signals authority, how context, content, and structure interact over time, and what responsibility means within those shifting systems.

The issue is less about fixity and more about defensibility: how we know what we know, under what framework, and with what accountability.

Jessica Talisman's avatar

The topic is expansive and I only covered an aspect or lens into what is a very complex space, from the library science perspective. However, even on oral history projects, to draw upon one of your examples, materializes in digital systems with a lineage, identifying itself as a primary resource and maintaining a chain of custody that gives credit to the primary source and capture derivative works/

Having worked on one of the largest digital oral history projects in the late 1990s and then later, having contributed to the largest ephemeral collections of art catalogs and flyers, these digital systems must architect their own versions of provenance and rights management.

Once these representations of knowledge enter digital ecosystems, the works and knowledge must be protected to preserve the original works and capture evolutions through the natural evolutions of knowledge.

The approaches and architectures may vary in representations, but the priorities and ethos are aligned. And none are perfect.

Tammy Lee's avatar

This really resonates.

Provenance is not an academic detail. It is the structure that lets information hold as knowledge. When that chain disappears, what looks like intelligence is often just context stripped away.

Your point about architecture is key. This is not about bad actors or better prompting. If origin and relationships are not preserved by design, decay is built in.

Thank you for naming this so clearly.

Jessica Talisman's avatar

Thank you, Tammy. I believe all foundational or frontier models have the same sort of architecture. I barely touched on the breadth of issues related to misinformation and disinformation but unreliable information has its own societal consequences.

Eugenio's avatar

Jessica, this piece articulates with remarkable precision a problem we've been working to solve architecturally in southern Brazil.

We're a small team building what we call the PATOS–Lector–PRISMA (PLP) infrastructure — a normative information architecture for pharmaceutical knowledge management. Your diagnosis that "where provenance ends, knowledge decays" maps almost exactly to the gap we identified: most existing systems (RAG pipelines, clinical decision support, conversational agents) collapse three epistemologically distinct operations into a single layer — document preservation, semantic interpretation, and contextual presentation. The result is precisely what you describe: provenance becomes irrecoverable.

Our response is a three-layer architecture:

- PATOS — a sovereign document preservation layer (17,000+ regulatory documents with explicit versioning and bidirectional traceability). No compression, no blending, no statistical distribution across parameters. The source remains the source.

- Lector — machine-assisted reading with explicit human curation, producing what we call Evidence Packs: typed assertions anchored to primary sources, with epistemic boundaries and curatorial decisions documented. This is where we operationalize your point about knowledge requiring "processing through verification systems" — the interpretive act itself becomes traceable.

- PRISMA — contextual presentation through what we call the RPDA framework (Regulatory, Prescription, Dispensing, Administration), which refracts the same informational core into distinct professional views governed by context, risk, and responsibility.

The key insight we share with your analysis: provenance is not metadata you append — it's an architectural property you either preserve structurally or lose irreversibly. What we add is that preserving provenance is necessary but insufficient; you also need the assertions derived from sources to be fully traceable to the interpretive act that produced them, the curator who validated them, and the epistemic boundaries that constrain them.

Your observation about "vibe citing" resonates deeply in the pharmaceutical domain — where a hallucinated drug interaction or a fabricated contraindication isn't an academic inconvenience but a patient safety risk.

We're currently preparing a paper formalizing this architecture (grounded in OAIS, Buckland's tripartite information, hermeneutic theory, speech act theory, and evidence-based medicine). Your earlier piece on context graphs vs. semantic layers was already part of our theoretical framework — this new one strengthens the provenance argument considerably.

From Porto Alegre, Brazil — glad to see the same structural diagnosis emerging independently across continents.

Jessica Talisman's avatar

This is fantastic work—thank you for sharing your work and findings.

I’d be interested in reading your paper, preprint or thereafter. I am collecting use cases and examples to feature here and n my book. Please do keep me posted and/or direct me towards the best way to follow your work!

Chris Despopoulos's avatar

A marvelous article, and I learned so much. I only have one nit to pick. You say:

"…the retrieval layer only tells you where the model looked, instead of where the model’s understanding came from. The parametric layer — the model’s trained weights, where the vast majority of its knowledge lives — is a provenance-free zone"

The model has no understanding. For the model there is no knowledge. We mistake plausible structures of tokens for understanding or even infirmation, only because these structures resemble artifacts of understanding or information. There's no substitute for the real thing. We must not loose the distinction, nor shoukd we ignore the active role human understanding plays (and is appropriated) in granting these machines powers they don't have.

Thank you very much for the clarity you bring!

Jessica Talisman's avatar

Thank you, Chris. Agree. There is no machine understanding as you point out. Ultimately, humans own knowledge and create knowledge.

Aaron Tay's avatar

Great post and thanks for referencing me.

You wrote about RAG

"But the retrieval layer only tells you where the model looked, instead of where the model’s understanding came from. The parametric layer — the model’s trained weights, where the vast majority of its knowledge lives — is a provenance-free zone"

This is obviously correct but how is this different from human understanding? If you ask me now when and how I exactly understood how RAG works I would be hard pressed to tell you which of the many treatments of it, I read that made me finally understand how RAG works (likely my understanding deepened over time which further complicates matter)

For most things we also cannot

Zane Hall's avatar

Seems like one more proof of technological singularity. Or, more likely, uncontrollability.

ZH

Joanna Miller's avatar

All of this should be intuitive and completely obvious. But alas....😳

"Provenance is a Tool of Power!" Print it on the t-shirts. You now have merch.

Jessica Talisman's avatar

💗 it is. It’s an information literacy deficit?

Joanna Miller's avatar

Yes. The era of informational illiteracy.

sadok's avatar

Your analysis of provenance as a structural condition of knowledge—rather than a mere archival technique—touches a deeper ontological fault line in current AI systems. The problem, as you frame it, is not simply that sources are lost, but that the generative architecture itself dissolves the relational fabric through which knowledge acquires legitimacy, continuity, and intelligibility.

In my own philosophical work on event and temporality, I approach a related question from a different angle: knowledge is not a static object that carries provenance as an external attribute, but an event emerging within a structured field of relations. When those relations are reduced to probabilistic patterning without preserving their formative dynamics, what decays is not only attribution, but the very ontological coherence of what counts as knowledge.

I believe that part of the solution may lie in rethinking knowledge infrastructures in explicitly event-relational terms—where temporal emergence, structural dependency, and generative linkage are modeled as intrinsic rather than supplementary features. Provenance, in such a framework, would not be appended metadata but a trace of the constitutive processes that bring knowledge into presence.

The challenge, then, may not be merely to reattach sources to outputs, but to redesign our systems so that the dynamics of formation remain structurally legible.