
This article walks through a research idea: using pre-trained transformer language models (like BERT/GPT) to help build knowledge graphs from text with less manual work. It explains the paper's MaMa (Match & Map) pipeline, then stress-tests it through hands-on experiments—showing what works, what breaks, and where humans still need to tune things. The big takeaway: language models can suggest "facts," but getting reliable graphs still depends heavily on anchor selection, entity linking, and threshold tuning.
The author opens by diving into a recent research paper about building knowledge graphs from text using transformer-based language models. The paper they focus on is:
"Language Models are Open Knowledge Graphs"
What grabs the author immediately is the bold promise that knowledge graphs can be built from a pre-trained model "without human supervision."
"...the authors claim that the 'paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision.'"
At the same time, the author admits they've learned to be cautious with claims like that, so they decide to test the pipeline component-by-component to see how "hands-free" it really is—especially on domain-specific text (healthcare).
To keep the rest of the article understandable, the author quickly grounds the basics: a knowledge graph (KG) is essentially a network of connected "things" (entities) and "links" (relations) that represent facts about the world. In plain terms: instead of storing knowledge as paragraphs of text, you store it as structured connections like:
Before introducing the new LM-based approach, the article describes the standard "classic" pipeline researchers and companies often use.
Chronologically, it typically goes like this:
Parse and enrich the text with NLP metadata such as:
Mention detection: identify spans of text that might be entities (like "Robert De Niro" or "prostate cancer").
Entity linking: map each mention to a specific entity in the target knowledge graph (like a Wikidata or Wikipedia concept), resolving ambiguity.
Relation extraction: infer what relationship connects two entities (like "diagnosed with"), often using hand-built features or rules.
Finally, you output triples (subject, predicate, object) that can be added into a graph.
The author highlights the biggest downside: this process is slow, expensive, and very human-dependent.
"The biggest pain point with most of these approaches is the time and effort that they demand from technical specialists who could be data scientists, linguists and knowledge base developers..."
They point out that teams often need weeks or months to craft domain dictionaries, linguistic rules, and logic—and even then they may still miss information because real-world language varies so much.
Next, the article transitions into the featured paper:
"Language Models are Open Knowledge Graphs" by Chenguang Wang, Xiao Liu, Dawn Song.
The author explains the paper's core hypothesis in a simple way:
The paper tests this idea on TAC KBP datasets and Wikidata, and reports promising results for KG population (filling a KG with facts).
The proposed system has two key steps: Match and Map (MaMa).
The big shift from older pipelines is how candidate facts are found:
In simpler terms: attention weights are signals inside transformers that indicate which words "pay attention" to which other words. The paper tries to use those attention patterns to guess which words form a (head, relation, tail) fact.
The pipeline then:
The author notes an important reality check: even if the language model helps propose candidate facts, the full pipeline includes many additional moving pieces (spaCy, REL, Flair, thresholds, constraints, etc.), and those choices matter a lot.
Now the article shifts from explaining the paper to actually trying it out.
The author selects:
They begin with the default implementation, then tweak pipeline steps and observe changes.
They also share code links, including the original repository and a fork where they added:
The author first runs the default pipeline:
The result is underwhelming: it finds only one triple:
So, yes—it found a real fact, but it missed most of the other meaningful content.
Next, the author tweaks anchor selection.
Instead of using only noun chunks, they also include named entities from a Named Entity Recognizer (NER) (removing overlaps). This change produces many more triplets for the same De Niro passage.
However, the author also observes a classic tradeoff: more candidate mentions can create more entity linking errors. They give concrete examples where REL links common phrases to bizarre Wikipedia concepts, like mapping "many people" or "60 years old" to unrelated entities.
The author then tries the same anchor strategy on a healthcare paragraph about conjunctivitis ("pink eye"), but the default spaCy model still doesn't produce much.
So they switch to scispaCy, which is trained on scientific/biomedical text, and suddenly the extracted triplets increase significantly—though some are incorrect.
They explain two reasons why incorrect triplets appear:
They also note that switching between different scispaCy NER models changes results a lot.
This leads to a major takeaway:
"Observation: It is clear that the same Language Model produced many more triplets when we leveraged a relevant named entity model to pick the anchor terms instead of just the noun chunks or using the default NERs from SpaCy. This shows that it is not easy to just plug and play, and it is necessary for a data scientist/NLP expert to pick the right method upfront to identify anchor text, which may need some experimenting or training."
In other words: anchor detection is not "free." It's a tuning decision that heavily shapes what the system can extract at all.
Next, the article moves to entity linking, which is where mentions get mapped into actual KG nodes.
The author first uses REL, configured with a Wikipedia corpus year (wiki_2019). For common medical concepts like conjunctivitis, it performs reasonably well because those concepts exist in Wikipedia.
But then they test a more technical biomedical sentence:
"Phenylketonuria (PKU) and mild hyperphenylalaninemia (MHP) are allelic disorders caused by mutations in the gene encoding phenylalanine hydroxylase (PAH)."
Using REL, some entities get mapped incorrectly to irrelevant Wikipedia pages like "Mutant_(Marvel_Comics)"—obviously wrong for biomedical text.
When the author switches to the scispaCy entity linker, and maps concepts into a medical KG (they choose UMLS), the disambiguation becomes much more accurate.
They summarize this sharply:
"Observation: Having access to a relevant entity linker can save a lot of time and effort by identifying the right entities in the target KG. It plays a key role in being able to extract as much automatically, leaving less work for humans that would otherwise need to correct entities and relations"
So even if the LM produces decent candidate facts, bad linking can ruin the graph.
Finally, the author explores two tightly connected knobs:
They discuss the paper's comparison: BERT-like models often work better for relation extraction than GPT-style models because BERT is bidirectional (it learns using context on both sides).
For conjunctivitis text, the author compares:
They test thresholds:
At the lowest threshold (0.005), both models produce almost identical graphs. But as the threshold increases, differences become clearer—especially in confidence levels.
One subtle but important surprise: raising the threshold doesn't just remove low-confidence triplets; it can actually change scores and alter which triplets come out because of how the beam search scoring behaves with fewer candidates.
They capture that learning here:
"Observation: ...The difference starts showing when we move to a higher threshold of 0.05. The domain specific model ... displayed more confidence overall in its triplets... Initially, I had assumed that the threshold would just remove some existing triplets and keep the rest as is, but that was a wrong assumption..."
And they end with the practical implication:
"Overall, there is a tradeoff between threshold and number of triplets, and it is up to the NLP expert to pick the right threshold, which does change the extracted triplets significantly."
So again: not plug-and-play—thresholding is a real modeling decision.
The author closes on a balanced note. Yes, it's promising that language models can generate candidate facts, but the experiments show you still need careful pipeline design to get useful knowledge graphs out.
They emphasize that success depends on much more than "just run an LM," including:
Overall message: language models can act like "open knowledge graphs," but mining them into clean, correct triples still takes real engineering and domain-aware choices.
The article ends by listing the key resources used:
Get instant summaries with Harvest