AI Summarized Content

Language Models are Open Knowledge Graphs .. but are hard to mine! | Towards Data Science

This article walks through a research idea: using pre-trained transformer language models (like BERT/GPT) to help build knowledge graphs from text with less manual work. It explains the paper's MaMa (Match & Map) pipeline, then stress-tests it through hands-on experiments—showing what works, what breaks, and where humans still need to tune things. The big takeaway: language models can suggest "facts," but getting reliable graphs still depends heavily on anchor selection, entity linking, and threshold tuning.

1. Getting started: why this paper is exciting (and suspicious) 🙂

The author opens by diving into a recent research paper about building knowledge graphs from text using transformer-based language models. The paper they focus on is:

"Language Models are Open Knowledge Graphs"

What grabs the author immediately is the bold promise that knowledge graphs can be built from a pre-trained model "without human supervision."

"...the authors claim that the 'paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision.'"

At the same time, the author admits they've learned to be cautious with claims like that, so they decide to test the pipeline component-by-component to see how "hands-free" it really is—especially on domain-specific text (healthcare).

To keep the rest of the article understandable, the author quickly grounds the basics: a knowledge graph (KG) is essentially a network of connected "things" (entities) and "links" (relations) that represent facts about the world. In plain terms: instead of storing knowledge as paragraphs of text, you store it as structured connections like:

(Robert De Niro) — diagnosed with → (prostate cancer)

2. The usual industry pipeline for building knowledge graphs from text

Before introducing the new LM-based approach, the article describes the standard "classic" pipeline researchers and companies often use.

Chronologically, it typically goes like this:

Parse and enrich the text with NLP metadata such as:
- POS tags (part-of-speech labels like noun/verb)
- Named entities (people, places, organizations, medical terms, etc.)
- Dependency parse trees (who depends on whom grammatically)
Mention detection: identify spans of text that might be entities (like "Robert De Niro" or "prostate cancer").
Entity linking: map each mention to a specific entity in the target knowledge graph (like a Wikidata or Wikipedia concept), resolving ambiguity.
Relation extraction: infer what relationship connects two entities (like "diagnosed with"), often using hand-built features or rules.

Finally, you output triples (subject, predicate, object) that can be added into a graph.

The author highlights the biggest downside: this process is slow, expensive, and very human-dependent.

"The biggest pain point with most of these approaches is the time and effort that they demand from technical specialists who could be data scientists, linguists and knowledge base developers..."

They point out that teams often need weeks or months to craft domain dictionaries, linguistic rules, and logic—and even then they may still miss information because real-world language varies so much.

3. The paper's idea: language models as "open knowledge graphs"

Next, the article transitions into the featured paper:

"Language Models are Open Knowledge Graphs" by Chenguang Wang, Xiao Liu, Dawn Song.

The author explains the paper's core hypothesis in a simple way:

Transformer language models (BERT/GPT-style) already absorb a lot of world knowledge during training.
That stored knowledge can be "mined" back out as structured facts.

The paper tests this idea on TAC KBP datasets and Wikidata, and reports promising results for KG population (filling a KG with facts).

3.1 The MaMa approach: Match & Map

The proposed system has two key steps: Match and Map (MaMa).

The big shift from older pipelines is how candidate facts are found:

Instead of heavily relying on hand-built linguistic features,
the method uses the model's attention mechanisms to infer candidate triples.

In simpler terms: attention weights are signals inside transformers that indicate which words "pay attention" to which other words. The paper tries to use those attention patterns to guess which words form a (head, relation, tail) fact.

The pipeline then:

selects anchor terms (like noun chunks from spaCy),
generates candidate triples via a scoring/beam-search process,
and uses an entity linker (like REL) to map mentions into a real KG (like Wikidata).

The author notes an important reality check: even if the language model helps propose candidate facts, the full pipeline includes many additional moving pieces (spaCy, REL, Flair, thresholds, constraints, etc.), and those choices matter a lot.

4. The experiment setup: what the author tests

Now the article shifts from explaining the paper to actually trying it out.

The author selects:

one example that overlaps general + healthcare content (a Robert De Niro cancer-related passage),
and one example that is more explicitly healthcare-focused (like conjunctivitis).

They begin with the default implementation, then tweak pipeline steps and observe changes.

They also share code links, including the original repository and a fork where they added:

a Dash wrapper for visualization,
support for scispaCy (healthcare-oriented NLP),
and the ability to swap among Hugging Face transformer models.

5. Experiment results, step-by-step (what changes actually helped)

5.1 Out-of-the-box: very few facts extracted

The author first runs the default pipeline:

spaCy en_core_web_md for noun chunks (anchors)
bert-large-cased as the language model
REL for entity linking to a Wikidata-based KG
Input: a multi-sentence paragraph about Robert De Niro surviving prostate cancer

The result is underwhelming: it finds only one triple:

head: Robert_De_Niro
relation: diagnose
tail: Prostate_cancer
confidence: 0.16

So, yes—it found a real fact, but it missed most of the other meaningful content.

5.2 Adding Named Entities as anchors: more triples, but more noise too

Next, the author tweaks anchor selection.

Instead of using only noun chunks, they also include named entities from a Named Entity Recognizer (NER) (removing overlaps). This change produces many more triplets for the same De Niro passage.

However, the author also observes a classic tradeoff: more candidate mentions can create more entity linking errors. They give concrete examples where REL links common phrases to bizarre Wikipedia concepts, like mapping "many people" or "60 years old" to unrelated entities.

The author then tries the same anchor strategy on a healthcare paragraph about conjunctivitis ("pink eye"), but the default spaCy model still doesn't produce much.

So they switch to scispaCy, which is trained on scientific/biomedical text, and suddenly the extracted triplets increase significantly—though some are incorrect.

They explain two reasons why incorrect triplets appear:

The triplet filtering threshold is very low by default (0.005), so many weak relations get through.
Confidence depends on the language model's training exposure—general LMs may not "know" specialized medical terms well.

They also note that switching between different scispaCy NER models changes results a lot.

This leads to a major takeaway:

"Observation: It is clear that the same Language Model produced many more triplets when we leveraged a relevant named entity model to pick the anchor terms instead of just the noun chunks or using the default NERs from SpaCy. This shows that it is not easy to just plug and play, and it is necessary for a data scientist/NLP expert to pick the right method upfront to identify anchor text, which may need some experimenting or training."

In other words: anchor detection is not "free." It's a tuning decision that heavily shapes what the system can extract at all.

5.3 Choosing the right entity linker: general Wikipedia vs medical knowledge bases

Next, the article moves to entity linking, which is where mentions get mapped into actual KG nodes.

The author first uses REL, configured with a Wikipedia corpus year (wiki_2019). For common medical concepts like conjunctivitis, it performs reasonably well because those concepts exist in Wikipedia.

But then they test a more technical biomedical sentence:

"Phenylketonuria (PKU) and mild hyperphenylalaninemia (MHP) are allelic disorders caused by mutations in the gene encoding phenylalanine hydroxylase (PAH)."

Using REL, some entities get mapped incorrectly to irrelevant Wikipedia pages like "Mutant_(Marvel_Comics)"—obviously wrong for biomedical text.

When the author switches to the scispaCy entity linker, and maps concepts into a medical KG (they choose UMLS), the disambiguation becomes much more accurate.

They summarize this sharply:

"Observation: Having access to a relevant entity linker can save a lot of time and effort by identifying the right entities in the target KG. It plays a key role in being able to extract as much automatically, leaving less work for humans that would otherwise need to correct entities and relations"

So even if the LM produces decent candidate facts, bad linking can ruin the graph.

5.4 Thresholds and language models: why filtering changes the graph more than expected

Finally, the author explores two tightly connected knobs:

the language model choice
the triplet confidence threshold (used to filter which relations are kept)

They discuss the paper's comparison: BERT-like models often work better for relation extraction than GPT-style models because BERT is bidirectional (it learns using context on both sides).

For conjunctivitis text, the author compares:

bert-large-cased (general)
BlueBERT (domain-specific biomedical model from Hugging Face)

They test thresholds:

0.005
0.05
0.1

At the lowest threshold (0.005), both models produce almost identical graphs. But as the threshold increases, differences become clearer—especially in confidence levels.

One subtle but important surprise: raising the threshold doesn't just remove low-confidence triplets; it can actually change scores and alter which triplets come out because of how the beam search scoring behaves with fewer candidates.

They capture that learning here:

"Observation: ...The difference starts showing when we move to a higher threshold of 0.05. The domain specific model ... displayed more confidence overall in its triplets... Initially, I had assumed that the threshold would just remove some existing triplets and keep the rest as is, but that was a wrong assumption..."

And they end with the practical implication:

"Overall, there is a tradeoff between threshold and number of triplets, and it is up to the NLP expert to pick the right threshold, which does change the extracted triplets significantly."

So again: not plug-and-play—thresholding is a real modeling decision.

6. Conclusion: language models help, but the pipeline still needs expert tuning

The author closes on a balanced note. Yes, it's promising that language models can generate candidate facts, but the experiments show you still need careful pipeline design to get useful knowledge graphs out.

They emphasize that success depends on much more than "just run an LM," including:

Identifying the right anchor terms (noun chunks vs named entities, and domain-specific NERs)
Constraining relation types (the author notes this exists but wasn't deeply covered earlier)
Using a relevant entity linker (Wikipedia-based vs medical KG-based like UMLS)
Tuning thresholds that strongly affect which triples appear—and even their scores

Overall message: language models can act like "open knowledge graphs," but mining them into clean, correct triples still takes real engineering and domain-aware choices.

7. References and data sources

The article ends by listing the key resources used:

Original paper PDF and code repo
Hugging Face transformers
REL (paper + GitHub)
scispaCy (GitHub + NER article)
Dash/Plotly visualization resources
Data sources including conjunctivitis info pages, UMLS downloads, and a Kaggle medical transcriptions dataset

Summary completed: 5/18/2026, 4:27:39 PM

Source:View original

Need a summary like this?

Get instant summaries with Harvest

⚡

5-second summaries

AI-powered analysis

📱

All devices

Web, iOS, Chrome

🔍

Smart search

Rediscover anytime

Start Summarizing

Try Harvest