Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results

MAKING AI BEHAVE:
Using Knowledge Domains to
Produce Useful, Trustworthy Results
Marjorie M.K. Hlava
Chief Scientist
Access Innovations, Inc.
mhlava@accessinn.com

Abstract
In today's highly charged atmosphere of anxiety and anticipation about AI, and especially LLMs,
one of the biggest concerns is how to ensure that it returns accurate results (meaning both true
and pertinent to its audience). This is particularly important to scholarly, scientific, and other
technical organizations, whose constituents are often in very specific domains, such as
medicine, engineering, history, biology, chemistry, etc. One extremely useful tool to incorporate in
an AI-based process in such cases is a comprehensive and well-structured knowledge domain
which is based on a controlled vocabulary.
The next Access Innovations webinar, coming up at noon Eastern on Tuesday, March 26, is
"MAKING AI BEHAVE: Using Knowledge Domains to Produce Useful, Trustworthy Results." It's
based on the extensive experience and history Access Innovations has in the development and
implementation of domain-specific thesauri, taxonomies, ontologies, and knowledge graphs, and
their use of them with AI. They have over 70 knowledge domains covered, which they employ in
sophisticated search, auto-tagging, and AI-based solutions for their clients. These are all
available for immediate deployment, so you don't have to start from scratch to develop the ability
to accurately tag your content to ensure proper and effective use by AI tools and systems.

Google bans AI chatbot Gemini from
answering election questions: ‘Try Google
Search’ By Reuters Published March 12,
2024, 12:03 p.m. ET
Microsoft AI Research Introduces
Generalized Instruction Tuning (called
GLAN): A General and Scalable
ArtiEcial Intelligence Method for
Instruction Tuning of Large Language
Models (LLMs) By Tanya Malhotra March 2, 2024
News Corp in ‘advanced’ talks with AI
firms on deals to license content, CEO says
By Social Links forThomas Barrabi
Published Feb. 8, 2024, 2:17 p.m. ET
Synthetic Data (Almost) from Scratch:
Generalized Instruction Tuning for
Language Models
[Submitted on 20 Feb 2024] https://arxiv.org/abs/2402.13064
Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun
Wang, Xingxing Zhang, Haoyang Huang, Shaohan
Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian
Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang
Sui, Benyou Wang, Wai Lam, Furu Wei
Daily Deluge
Google AI Introduces Croissant: A
Metadata Format for Machine
Learning-Ready Datasets
By Dhanshree Shripad Shenwai -
March 12, 2024

Marjorie M.K. Hlava
• Expert in taxonomies, metadata, their application and data science.
• Her groundbreaking work has earned her numerous awards and 2 patents
with 21 claims granted
• Margie standards work includes
• Dublin Core Z39.85
• DOI Syntax Z39.84
• CrEdit Z39.104
• Thesaurus ANSI/NISO Z39.19 Thesauri and other controlled vocabularies
• many others
• Convener of the ISO - 25964 the International Standard on Controlled
Vocabularies
• Founder, Chairman, Chief Scientist of Access Innovations, Inc.

”large language models will not only mirror but magnify any problems with
the data sets, problems that many organizations may not realize they have."
Amplifying hidden biases and gaps seems like a real danger

What we will cover today
• Definitions
• Getting us to speak the same language
• Quick review of options
• Why Taxonomies with LLM’s?
• Where do they fit?
• What are some available Knowledge Domains?
• Two Approaches
• Summary

What we will NOT cover
• Big topic
• Video
• Politics / Elections
• Recent sensations
• All the tool sets
• Regulatory actions
• Programming aspects
• Business cases
https://www.nature.com/articles/d41586-024-00661-
0?utm_source=Live+Audience&utm_campaign=adeec3770a-briefing-dy-
20240313&utm_medium=email&utm_term=0_b27a691814-adeec3770a-51734080

National Information Standards Organization
– Controlled vocabulary
• "a carefully selected list of words and phrases, which are used to tag
units of information (document, images, videos, etc.) in order to
describe their content. This list is carefully selected and managed by
experts in a particular subject domain or field.”
• Ensure consistency and precision in indexing and retrieval
• Domain specific

NISO – Thesaurus
• "a controlled and structured vocabulary in which concepts are
represented by terms, organized so that relationships between
concepts are made explicit, and and preferred terms are
accompanied by lead-in entries for synonyms or quasi-synonyms.”
• Hierarchical, Equivalence (Synonyms) and Associative (Related)
• Structured
• Organizes concepts
• Facilitates access to information
• Provides standardized terminology and relationships between terms
(concepts), synonyms, or quasi-synonyms

NISO - Taxonomy
• "a structured, hierarchical representation
of concepts or terms within a specific
domain, organized to show relationships
between concepts or terms.”
• Formal frameworks for representing and
organizing knowledge
• Just the hierarchy
• But now often used interchangeably with
thesaurus and ontology
Image + https://www.thoughtworks.com/insights/blog/data-science-ontology

Radial graph
and
Hierarchical
display
Both are
taxonomy
displays
https://www.hedden-information.com/taxonomies-vs-ontologies/

What are the steps to implement
taxonomy in generative AI? 1 of 2
• Define the Taxonomy Structure:
• Identify the key concepts, categories, and relationships relevant to the domain or problem the generative AI
system will address.
• Design a hierarchical taxonomy structure that organizes these concepts into categories and subcategories.
• Define relationships between categories to capture semantic connections.
• Collect and Preprocess Data:
• Gather a corpus of text data relevant to the domain or problem. This could include documents, articles, or
any other textual resources.
• Preprocess the text data to clean it, remove noise, and standardize formatting. This may involve tasks like
tokenization, stemming, and removing stop words.
• Annotate Data with Taxonomy Labels:
• Manually or semi-automatically annotate the text data with labels corresponding to the taxonomy
categories. This step involves mapping text excerpts or documents to the appropriate categories in the
taxonomy.

What are the steps to implement
taxonomy in generative AI? 2 of 2
• Train the Generative AI Model:
• Choose or develop a generative AI model suitable for the task at hand, such as a language model based on transformers
architecture (e.g., GPT).
• Prepare the annotated data for training, ensuring that each input is associated with its corresponding taxonomy labels.
• Train the generative AI model on the annotated data, incorporating the taxonomy labels as part of the training process. This
allows the model to learn the relationships between textual inputs and taxonomy categories.
• Incorporate Taxonomy into Model Inference:
• After training, integrate the taxonomy structure into the generative AI model's inference process.
• When generating text or responses, use the taxonomy to guide the model's outputs. For example, you can constrain the
generation process to ensure that the generated text aligns with the taxonomy categories.
• Evaluate and Iterate:
• Evaluate the performance of the generative AI system using metrics relevant to the task, such as accuracy, coherence, and
relevance.
• Collect feedback from users or domain experts to identify areas for improvement.
• Iterate on the model and taxonomy design based on the evaluation results and feedback, making adjustments as
necessary to enhance performance.
• Deploy and Monitor:
• Deploy the generative AI system with taxonomy support in a production environment or as part of an application.
• Monitor the system's performance and user interactions, gathering data for further refinement and optimization.
• Collaboration between domain experts, data scientists, and AI engineers is crucial for the success

Knowledge Domain
• Refers to a specific area or field of knowledge
• subject matter, concepts, theories, methodologies, and practices
• Cohesive and organized body of knowledge with a scope and
boundaries
• Vary widely in size and complexity
• Covid 57 terms
• JSTOR 57,000 terms
• Library of Congress – 208,000 terms
• quantum mechanics or medieval literature
• Established disciplines or sub-disciplines
• Each with theories, methods, and research traditions
• Frameworks for understanding and investigating phenomena within
specific areas
• Represent scholars, researchers, practitioners, SME’s contributing to
knowledge within those domains

Available =
already built
• Government resources
• Most agencies
• May need formatting
• NASA, DTIC, DOE, NAL, EPA,
NLM etc
• Sign up for updates
• License-able
• TaxoBank
• Access Innovations
• Others

Knowledge Domains
• Taxonomies, thesauri, or authority files
• Pre-Built
• Knowledge Domains
• full term records
• hierarchical, equivalence, and associative relationships, as well as scope notes
where appropriate.
• hierarchy only.
• NISO Z39.19 and ISO 25964 standards compliant
• Formats,
• 22 options
• Excel/CSV
• SKOS-2
• Etc.

Applied Science
Art
Behavioral Science
Biological Science
Business
Chemical – MAI Chem
Communications
Computer Science
COVID
Economics
Educational Curriculum
Geography
Health and Safety
Health Science
History
Information Science
Language Arts
Law
Linguistics
Literature and Drama
Mathematics
NewsThes
Nursing
Philosophy
Physical Education and Recreation
Physical Sciences
Political Science
Psychology
Religion
Science
Social Sciences
General Purpose Taxonomies

These products can be SKOS downloads
Astronomy
Clinical Drugs
DTIC – Defense Technical Information Center
Environment – GEMET
ERIC – Education Resource Information
Center
JSTOR
NASA
National Agricultural Library
Occupational Safety and Health
PLOS

CPT – Current Procedural Terminology
HCPCS – Healthcare Common Procedure
Coding System
ICD11 – International Classification of
Diseases
Kew Medicinal Plant Names (MPNS)
MeSH – Medical Subject Headings
Suspect Cell Lines
Taxogene – the Human Geonome
These products are
available as SaaS

Knowledge Graph
• A knowledge graph
• structured representation of knowledge
• captures relationships between entities or concepts in a specific domain
• Nodes represent entities or concepts
• Edges represent relationships between these entities
• Using semantic technologies and linked data principles
• Integrate information from multiple sources
• Supports inference of new knowledge based on existing connections
• Enable context-aware information
• data and its relationships
• Gives precise querying and analysis
• Supports discovery of implicit connections and patterns within the data
• For organizing, navigating, and leveraging large volumes of interconnected
data
• Facilitate extraction of insights and the generation of new knowledge
Image = https://ahrefs.com/blog/google-knowledge-graph/

Does a knowledge graph need a
controlled vocabulary?
• Consistency
• Interoperability
• Facilitates Search and Discovery
• Semantic Enrichment
• Domain Understanding

Knowledge Graphs with Generative AI
• Contextual Understanding:
• Provide a structured representation of relationships between entities and concepts
• Generates more relevant and contextually appropriate responses
• Content Generation:
• Source of structured data and information for generative AI systems
• Use in training process, to learn from the structured relationships encoded in the graph to generate more coherent
and accurate outputs
• Ensure that the generated documents adhere to domains principles and conventions
• Entity Linking and Disambiguation:
• Identify and disambiguate entities mentioned in text
• Let's AI models accurately link mentions of entities to their corresponding entries in the graph,
• Reduces ambiguity
• Improves the quality of generated outputs
• Personalization and Customization:
• Customize to specific domains or use cases
• Generate personalized outputs for use needs and preferences
• By Provides more relevant and useful content

How to Link Knowledge Graph to Generative AI – 1 of 2
• Define Knowledge Graph Schema:
• Identify the entities, relationships, and properties
• Design a schema for both structure and semantics domain
• Acquire and Process Data:
• Gather data sources
• Preprocess the data to extract entities, relationships, and properties
• Convert them into a format suitable for loading into the knowledge graph.
• Build and Populate the Knowledge Graph:
• Use a graph database to create and populate the knowledge graph.
• Load the processed data into the knowledge graph,
• Ensure that entities are represented as nodes, relationships as edges, and
properties as attributes.

How to Link Knowledge Graph to Generative Ai – 2 of 2
• Integrate Knowledge Graph with Generative AI:
• Querying the knowledge graph for relevant information
• Incorporating it as input during the model training process.
• Ensure the model can access and utilize the structured knowledge represented
in the graph.
• Training and Fine-Tuning:
• Train or fine-tune the generative AI model using the knowledge graph-enhanced
data.
• Supervised learning with labeled data
• Unsupervised learning to discover patterns and relationships within the data.
• Generate Outputs and Evaluate:
• Use generative AI system to generate outputs based on user queries or input
data.
• Evaluate the quality, relevance, and coherence of the generated outputs,
• Iterate and Refine:
• Iterate on the implementation, incorporating feedback and making
improvements
• Continuously refine the knowledge graph and generative AI model
• User interactions, new data, and evolving requirements.

Ontology – 1 of 2
• “a formal and explicit specification of a conceptualization defines the
terms, concepts, and relationships within a particular domain of
knowledge."
• Meaningless jumble
• Formal frameworks for representing and organizing knowledge
• Terms or Concepts:
• Represent the entities, classes, or categories within the domain of interest
• Each term (concept or entity) is defined with a precise meaning
• Relationships:
• Define the connections between terms in the ontology
• Relationships can represent various types of connections such as hierarchical (subclass),
part-whole, or associative relationships
• Axioms or Constraints:
• Rules or constraints for properties (behavior) of the terms and relationships in the ontology
• Axioms help ensure the consistency and coherence of the ontology

Ontology – 2 of 2
• Often uses RDF = Resource Description Framework (defines
itself by reference or inclusion)
• One thing (the subject a.k.a. resource) has a relationship (the
predicate a.k.a. edge) with another thing (the object a.k.a.
resource)
• Thing (a resource) and each edge is a given relationship
(either reports to or works for), which is known as a predicate
• Machine-readable
• Facilitates interoperability, reasoning, and semantic
understanding across different systems and applications
• Connect things not strings

Tagging /Indexing
• The process of associating metadata or
descriptive keywords with digital content
• All content types
• Text based
• Identify Things
• People, places, objects, entities
• Identify concepts
• Keywords, descriptors, terms, subject headings,
classification systems and codes, thesaurus terms
• Provide consistent tagging and accurate and
comprehensive retrieval of content items

NLP, ML, AI is not new
• Automation of human activity – around for over 100 years
• Mechanical automation – Jacquard Looms (1804)
• Herman Hollerith 1890 Census (punch cards)
• IBM _ Thomas Watson Group – 1920’s
• Sputnik 1957
• Space Race and Cold war
• 1964 – COSATI
• TEST = Thesaurus of Engineering, Scientific and technical Terms
• Automated retrieval of documents
• Dialog, NASA Recon 1973

Basic algorithms
• Boole, George 1815 – 1864
• Boolean algebra, is basic to the
design of digital computer
circuits.
• Bayes, Thomas 1701-1761
• Richard Price 1723 – 1791
• describes the probability of
an event, based on prior
knowledge of conditions that
might be related to the event
• Beyond reasonable doubt
Wikipeida

AI Building blocks - NLP
• Symbolic
• 1940, Alan Turing published an article titled "Computing Machinery and
Intelligence"
• Jabberwacky Chat box 1997
• Statistical
• 1990’s machine translation – ERTRANS
• Rule based approaches
• 2000’s World Wide Web – HTML
• Neural
• Vectors
• N-grams

“AI”+ GenAI
• Start with enriched content (tagged)
• Tell (feed to) GenAI
• GenAI puts new rules in the inference engine
• Search results get better
• Repeat, repeat, repeat

Understand the data
you are feeding a GenAI
• Identify cancerous skin lesions
in images
• 100% accurate!
https://sites.mitre.org/aifails/turning-lemons-into-lemon/

ChatGPT Static training
• Example of Generative AI – one of MANY
• ChatGPT is two-year-old data
• Took a lot of Manpower to train
• Need to constantly refresh –
• retrain the model…
• To keep current
• So what is the answer?
• How are the models getting trained / fine tuned?
• A special kind of NLP
• Use to enhance existing data sets

GLAN (Generalized Instruction Tuning)
• Breaks human knowledge into domains, sub-fields, and
disciplines
• Taxonomy is divided into subjects
• syllabus created for each subject (Branch)
• specific essential themes
• “GLAN these ideas to produce a variety of instructions that
closely resemble the design of the human educational system”
• Curriculum outline – Like NICEM Knowledge Domain

Flexible, scalable, and all-purpose
approach
• Produces instructions on an enormous scale
• Task-agnostic
• Spanning a wide range of disciplines
• The input taxonomy has been created with minimal human effort through LLM
prompting and verification
• Can add new fields or skills
• Adaptable, the dataset can be expanded and changed without having to start
from scratch
• Wide range of instructions covering every possible combination of human
knowledge and abilities
• Includes coding, logical reasoning, mathematical reasoning, academic tests, and
general instruction
• No need for task-specific training data for these particular tasks
• Add new domains or proficiencies by adding a new node to its taxonomy

HuixiangDou (Baseline)
• Problems with Chat systems using LLM
• Flooding of the system
• Irrelevant responses
• Lack of answer precision
• Answer
• Fine tuning the system
• Continuous updates
• Identifying the key points of problems
• Handling multiple target points simultaneously
• More focused approach to handling queries
• How?
• Keywords from the taxonomy
• Applied as an incoming filter
• Added to content responses
• Constant additions based on logs

HuixiangDou: A
Domain-
Specific
Knowledge
Assistant
Powered by
Large Language
Models
https://www.marktechpost.com/20
24/01/31/shanghai-ai-lab-presents-
huixiangdou-a-domain-specific-
knowledge-assistant-powered-by-
large-language-models-llm/

Technology
• It is a tool, not the focus
• Might need shiny new piece of technology
• the technology is generally in the chorus
• not a main character
• Too many companies lead with technology and
• do not spend the time understanding their users or aligning their
strategy
• Any company that has 1000s of SharePoint or Teams sites where
people still can’t find the information needs knows this
• Most large corps have 5 search software systems
• On the shelf
• “Does not work”
• Because the data was not enriched

Organizational model
• Taxonomy and data modeling
• Essential component of this investment
• Data must be
• Well sourced
• Managed
• Maintained
• Essential for the AI
• Both ethical and performance reasons
• Ignore data quality at your peril
• It is hard work
• Does not fit two-week sprint
• Get executives to agree on strategy and structure model
• Without a coherent model, governance, data pipeline, and resourcing there
is no strategic value to an AI initiative
•

It’s the Data Stupid!
• Data is their core asset
• Without the data the rest of the initiative is nothing
• It is the essential component the strategy
• Do enrichment metadata
• SUBJECT metadata
• Use taxonomies, ontologies, and other models
• The large language models will not only mirror but magnify any
problems with the data sets, problems that many organizations may
not realize they have. (Gary Carlson - Factors)

Why a taxonomy?
• Matches your content
• Scales with the content as it increases
• Extensive synonymy – use any of the word term options
• The concept is the unit of thought
• Disambiguation
• Mercury
• Lead
• Built in feedback loops to keep current with content
• Prevents hallucinations
• Misunderstandings of multiple word meanings (Nonsensical output)
• Happens when the model is not trained on your content (Factual contradiction)
• Query goes against the rules of the system (Prompt contradiction)

Why tag / index at all?
• Disambiguation
• Search and retrieval is accurate
• Promote taxonomy term first searching
• In the inverted index search controlled terms first
• Then go to full text if needed
• Use in search response consistency and integrity
• Recommendation engines using tag sets not vectors

Why Auto Tagging?
• Fast
• Sub-second versus 70 seconds per tag
• Able to add more tags quickly in same sub second time
• More depth
• Always goes to the most specific level of tagging
• No misspellings
• Consistency
• No editorial drift – people tend to use same tags over and over
• Do not need as many subject experts
• Replicable results – no black box

Dump in the data to the AI vortex

Approach A – Hybrid - Leverage Gen AI - 1
• Send question to ChatGPT
• Use autocomplete from the taxonomy
• Find more words and concepts
• Tailored to the specific domain or topic
• Extract the semantic context of the query
• MAI tags the query with taxonomy terms
• Send those concepts to the Generative AI system
• Read the answers from the Generative AI system
• Might submit more than once
• But answer will change each time

Approach A – Hybrid - Enrich the content -2
• Send the same query to your own content
• Use the same terms
• Answer will be consistent since it is on tagged actual text
• Keeps your data out of the LLM and secure
• Use the LLM to get a general answer
• Use your content to get the specific and reliable answer
• Combine the two to get a quick summary of the material

Bludgeon your data
Bludgeon your data

Approach B
• Gather large amount of text
• License the Gen Ai of your choice
• Integrate several different systems for optimal results
• OpenAI, Bard, Claude, DALL-E, MidJourney, etc.
• Convert data
• Load to large servers
• Train the data model
• Use series of refining questions
• Clarify the user’s intent interactively
• Convert the text query into a hybrid search query
• Summarize, classify, and display search results in different, easily distinguishable
categories (using an AI classification model), e.g., most relevant answer, most
recent answer, most trustworthy source, etc.

Taxonomy Priority (Semantic) Enrichment

Approach C – Taxonomy Priority – 1 of 2
• Organize and enrich first, then train the data
• Learn where the outliers are
• Allow for new input from outside resources over time
• Taxonomy Development:
• Use existing or create new
• Identify key concepts, categories, and relationships
• Hierarchical taxonomy structure
• Define relationships between different categories
• Represent the semantic connections between concepts
• Get a Taxonomy Tool to create and manage the taxonomy efficiently.
• Tools like Protégé, Data Harmony, or custom-built solutions can be used for this
purpose

Approach C – Taxonomy Priority – 2 of 2
• Data Preprocessing:
• Data Collection – gather documents
• Data Cleaning – remove noise, irrelevant information, and formatting
inconsistencies
• Entity Extraction – extract entities, concepts, and terms from the text
data and link
• Taxonomy Integration:
• Map extracted entities and concepts between text data and the
taxonomy structure
• Index the data using the taxonomy to enable efficient retrieval and
querying

How Can
Taxonomies Help
LLM?
• Understanding Input
• Content Organization
• Knowledge
Representation
• Query Expansion
• Quality Control

Can Taxonomies
make LLM
Behave?
• Guiding Decision-Making
• Enhancing Understanding
• Improving Consistency
• Facilitating Interpretability
• Supporting Compliance

Thank you for
your attention
Questions?
• Marjorie M.K Hlava
• Chief Scientist
• Access Innovations, Inc.
• mhlava@accessinn.com

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results

Recommended

Recommended

More Related Content

Similar to Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results

Similar to Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results (20)

More from Access Innovations, Inc.

More from Access Innovations, Inc. (20)

Recently uploaded

Recently uploaded (20)

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results