The Ontology Pipeline®
A Semantic Knowledge Management Framework
Introduction
Organized, structured and semi-structured semantic knowledge systems are rapidly emerging as essential for achieving highly performant LLMs and AI systems. Many technologists either view semantic knowledge systems as overly simplistic, limited to text labels or annotations— or see them as too labor intensive to justify investment. In Response to this confusion, there has been an over reliance on out-of-the-box public taxonomies, such as the Google Product Taxonomy, resulting in homogenous knowledge ecosystems— a cookie-cutter approach that fails to meet the needs of a domain or business.
Failure to model and represent an organization’s unique aspects and attributes is evident in AI implementation failures and challenges in deriving context and meaning from knowledge management systems, such as knowledge graphs.
Indeed, a semantic knowledge system requires investment in both people and tools but does not need to be limited to simple annotations or labor-intensive human efforts. Due to lack of existing frameworks and workflows, the construction and maintenance of semantic knowledge management systems often functions as a black box, making it difficult for businesses to determine the exact scale of investment in both long term and short term. Additionally, direct returns on investment can be challenging to measure, as the benefits of a semantic knowledge system often tend to be secondary, and primarily reflected in the success of RAG implementations, entity management systems, and information retrieval metrics.
By structuring knowledge management stages and phases with logical sequences and workflows, a formal semantic knowledge management program can be scoped, scaled, and positioned as both a product and unique domain of expertise. Establishing a robust and dynamic knowledge management program enables organizations to invest in this infrastructure with greater confidence, hiring people and leveraging machines to refine data and information systems while guiding the construction and maintenance of semantically rich ecosystems.
Enter The Librarians
The Library and Information Science domain provides principled, logical methods and frameworks for organizing and structuring data into information, which, when made accessible to all users (human and machine) becomes knowledge. Librarians have codified these methodologies and strategies to manage complex information ecosystems, enhancing accuracy and reliability in information retrieval. Having successfully leveraged machine learning (ML) and artificial intelligence (AI) as workflow and cataloging tools for over a decade, the library science domain offers repeatable methodologies for building scalable and extensible semantic knowledge management systems that drive accuracy and reliability in AI systems.

Taking a lesson from librarians, technology domains must prioritize knowledge management workflows and processes to support reliable and resilient semantic knowledge management systems, positioning these systems as valuable investments for businesses.
The Ontology Pipeline®
Derived from the workflows of librarians, the Ontology Pipeline® offers a systematic methodology for constructing semantic knowledge management systems. The pipeline consists of iterative building blocks, with each phase preparing for the next stage of building. This iterative process incorporates data cleaning and preparation tasks into the semantic engineering workflow. Starting with a controlled vocabulary, data and information are cleaned, structured and defined, supported by metadata standards that normalize entity-value pairs within the data ecosystem.
Subsequently, the data and information are prepared to be structured as a taxonomy, a hierarchy defined by parent–child relations.
The taxonomy serves as the foundation for constructing a thesaurus, where equivalent, transverse and matching relations are encoded using a lightweight, upper ontology. The thesaurus prepares the base ontology structure necessary to model a more complex and dynamic ontology designed to support descriptive context and semantic reasoning. Finally, with all the building blocks in place, the combined effort results in a semantic RDF knowledge graph, consisting of the required semantic elements or layers of the graph.
The Ontology Pipeline® is designed to codify a framework for the architecture and construction of semantic knowledge management systems. This tangible workflow and process enable more accurate estimation of investment requirements, both human and machine. Without the Ontology Pipeline®, organizations struggle to understand the levels of effort and investment needed, let alone how to derive metrics to measure benefits, system performance and returns on investments.
When the business understands the components of a semantic system, leaders and stakeholders gain better visibility into its requirements. With clear goals and outcomes in sight, organizations can build and maintain semantic knowledge management systems with confidence, knowing these investments are crucial to supporting data infrastructure, data transformation and AI initiatives.
Controlled Vocabulary
A controlled vocabulary is the first building block in constructing semantic knowledge systems. It works in tandem with data cleaning and preparation tasks. As part of the vocabulary construction and refinement workflow, data must be deduplicated, merged, and defined to create a clean, disambiguated vocabulary fit for its purpose. Deduplication involves reconciling synonymous terms to clarify the vocabulary. Best practices include creating definitions for each term or concept in the controlled vocabulary list, ensuring all users and stakeholders share a common understanding for the chosen terms.
In this example from NASA’s controlled vocabulary index, Mission to Planet Earth has a definition. The indicator UF means “Use For”, and MTPE is the abbreviated acronym for Mission to Planet Earth. This illustrates duplicates, near duplicates, and synonyms can be resolved when architecting a controlled vocabulary. A controlled vocabulary is the first step toward creating unified domain vocabularies, fostering alignment across people and machine systems while supporting a shared and common understanding.
Metadata Standards
Now that we have controlled vocabularies, we develop metadata standards to encode the common understanding or “aboutness” of data and information. Metadata standards provide schema-based control for databases and information systems. Metadata elements define the fields necessary to describe data and information assets. These elements are categorized by their characteristics: STRUCTURAL (for machine readability), DESCRIPTIVE (for context) and ADMINISTRATIVE ( for asset maintenance and lineage). While these are the most common types of metadata elements, practitioners and stakeholders have expanded these categories to include subject areas such as social or provenance metadata. Regardless of the types of metadata elements, each must be well-defined, with clear direction on what each element type is designed to handle.
With well-defined metadata standards, each metadata element is prepared to accept controlled vocabulary values. A metadata standard can support various workflows and data streams, often serving as the foundational architecture for entity value systems and concept models. The controlled vocabulary provides the controlled or allowable values to pair with metadata elements. By implementing a metadata standard, a natural framework emerges for entity reconciliation, vocabulary management, and schema-based validation matrices to enforce vocabulary control and semantics. Additionally, metadata standards offer valuable insight into extended vocabulary needs, helping to scope future requirements in subsequent Ontology Pipeline® steps.
In this example, the metadata element, TITLE, has the value of Mad Men Season 5: Plot Predictions. The metadata element TYPE specifies article, indicating the content type. By normalizing the metadata element and its expected allowable values, the metadata standard and controlled vocabulary work together to describe the asset, providing context and meaning.
Taxonomy
Often, the creation of controlled vocabularies and metadata schemas is overlooked in the rush to build taxonomies. Without proper data hygiene— resolving synonyms and duplicate concepts, as we do when building controlled vocabularies and metadata schemas—a taxonomy can be difficult to build and quickly becomes unwieldy. A taxonomy takes the controlled vocabulary and transforms it into a hierarchical structure. This marks the beginning of creating relationships between concepts, from broad to narrow or parent–child relations, leading toward a more mature, ontology-rich system. These relationships serve as helpful classification structures for machine learning algorithms, can be used for front-end navigation, and assist in organizing assets. Additionally, taxonomies are often used in tagging and annotation systems.
Taxonomies are often built and maintained in spreadsheets. However, managing and scaling a taxonomy in spreadsheet format can become unwieldy.. Additionally, a spreadsheet-based taxonomy often lacks the machine readable, semantic encoding structure necessary to support a fully developed semantic knowledge management system.
To build a resilient taxonomy that is both human and machine-readable and optimized for AI systems, it is best to invest in a semantic middleware tool, such as Graphwise or TopQuadrant. This allows an emerging taxonomy to be structured using an upper ontology like SKOS and aligned to standards that inform the middleware validation matrices, which check for structural integrity and help resolve issues such as recursive loops and relationship clashes. While seemingly innocuous, the introduction of faulty logic, even within a base structure like taxonomy, can lead to the proliferation of errors throughout semantic knowledge systems. Therefore, it is recommended that a taxonomy be constructed according to ISOs and ontology logic, with validation matrices in place, to ensure the soundness of the taxonomic structure.
Taxonomy guidelines and validation usually includes basic ontological reasoning based upon standards. These guidelines and standards include:
ISO 25964-1, Information and documentation — Thesauri and interoperability with other vocabularies
ISO 25964-2, Information and documentation — Thesauri and interoperability with other vocabularies — Part 2: Interoperability with other vocabularies
In turn, these standards and guidelines align with The Ontology Pipeline®, documenting similar architectures and frameworks to support semantic logical reasoning, ontology construction, and validation matrices. Ultimately, duringthe taxonomy build and management phase of the pipeline, it is critical to understand the impact of design decisions. These decisions include:
How many levels should be the taxonomy represent?
What is the determined level of granularity?
Will localization be enabled alongside the taxonomy?
Will the controlled vocabulary be deprecated or remain in use when the taxonomy is deployed?
How will new concepts and terms be integrated into the taxonomy?
Building and structuring the taxonomy with ontological reasoning, and according to standards and guidelines, will ensure that the taxonomy is prepared for the next phase of the Ontology Pipeline® workflow.
Thesaurus
While the thesaurus can often be interchangeable with a taxonomy in terms of the order of operations, I prefer to mature a taxonomy into a thesaurus by extending the ontologies used to structure the taxonomy. I always structure taxonomies to evolve into thesauri because it is the crucial first step toward building a semantic, ontology-based knowledge management system.
A thesaurus handles ambiguity by forming associative relationships between terms, going beyond the parent-child relationships within a hierarchy.A thesaurus further encodes these structures, often using a lightweight and mid-level ontology like SKOS-XL or the Simple Knowledge Ontology System. Much like a traditional thesaurus used to discover near terms or synonymous concepts in vocabulary management, an ontologically encoded thesaurus matures a controlled vocabulary, metadata standard, and taxonomy to support resolution at scale, while enforcing the logical reasoning embedded within ontologies.
In this example, you will see BT which means (broader term) SYN which is a synonym) and NT (narrower term). A thesaurus architected with ontologies and machine-readable encoding structure supports interoperability and is highly useful for helping both machines and humans understand context and meaning. With context and meaning, a thesaurus marks the establishment of semantic reasoning and primitive knowledge management, now ready for the next step in the Ontology Pipeline®.
Ontology
Having followed the logical steps and completed the specific stages of the Ontology Pipeline®, data and information are now primed for the ontology build. With a basic ontological structure in place to describe vocabulary control, the taxonomy hierarchy, and the thesaurus, we can now add domain ontologies and standard, open ontologies to further expose direct and indirect relations, nuanced and descriptive contexts.
Ontologies enhance vocabularies by describing and contextualizing relationships between concepts and with the introduction of logical reasoning. By assigning classes, properties, relations and attributes, ontologies establish rule bases that define how concepts behave in the wild, ensuring a level of coherence within complex information systems.
Machines love ontologies because of their high-fidelity disambiguation and description, which bring clarity to machine understanding for tasks such as information retrieval, entity management, concept discovery, and RAG implementations for AI systems.
If built without established controlled vocabularies, metadata standards, taxonomies and thesauri, ontology construction can be very difficult, as the underlying data and information structures lack integrity and may suffer from data quality issues. It is nearly impossible to build an ontology with messy, undefined vocabularies because introducing logic becomes extremely difficult when the underlying data itself is not logically structured. Building ontologies is like writing a story that defines domains, complex systems and the relationships between all the characters, places, things and concepts.
Knowledge Graph
And finally, we arrive at knowledge graphs, the current Rosetta Stone of a semantic knowledge management system and the Ontology Pipeline’s® visualization layer. The last step in the Ontology Pipeline® involves synthesizing the four building blocks to culminate in a knowledge graph, which is essentially a knowledge management tool. Knowledge graphs are an assemblage of controlled vocabularies, metadata schemas, taxonomies, thesauri and ontologies. The Ontology Pipeline® naturally presents a layered knowledge graph, making it easier to troubleshoot broken logic while providing the control planes necessary to scale and extend the graph.
Knowledge graphs represent the synthesis of all stages of the knowledge modelling process, now visualized through clear, graphical representations. Because knowledge management can be complex, the visualization serves as an organizational language and communication interface, creating opportunities for teaching and learning across a domain and within the organization.
Knowledge graphs allow stakeholders, users and contributors to interact and query a semantic knowledge management system, demystifying the value proposition of semantics and surfacing knowledge as a first-class citizen. They utilize query languages like SPARQL and SHACL, enabling precise discovery data and information across the graph. The only dependencies are the logical reasoning rules introduced by the encoded system. A knowledge graph is highly flexible, with its limitations defined by the complexity of the logical rule bases and data encoded within.
The Ontology Pipeline®: A Semantic Knowledge Management Framework
By building a semantic knowledge management system using the logical steps outlined in the Ontology Pipeline® framework, organizations can document time, effort, and cost estimates to secure funding and organizational support for future iterations and maintenance. As many organizations struggle to justify investment in semantic knowledge management systems, a proven and repeatable framework enables them to project costs and develop metrics to prove value.
When implementing a semantic system, it’s crucial to highlight the ancillary and integrated benefits of the semantic build. By following the Ontology Pipeline’s® rigorous, iterative process—cleaning, preparation, reconciliation, modeling, testing, enrichment, enforcement, reporting and measurement — these steps are are naturally incorporated. Additionally, since large language models (LLMs) require clean, well-structured, semantically enriched data to provide accurate and reliable results, who better ensure data quality than semantic engineers?











🎯
Very insightful. Thank you for your valuable explanations.
Some minor typos in the first visual:
- knowledgsystems
- elemennts
- deseribes
I am hoping the read and learn more about how to practically getting started. Do you have any software recommendations which are free to use for learning purposes? Do you know any prototypes / implementations which show how an LLM directly benefits from the knowledge graph instance?
All the best and again a huge thank you!
Alexander