Asli amil baba in Karachi asli amil baba in Lahore
HiTIME project
1. HiTiME project description
Christian Roosendaal (christian.roosendaal@gmail.com),
Vyacheslav Tykhonov (vty@iisg.nl),
HiTiME System developers IISH Amsterdam
2. HiTiME prototype data flow
Source NER
data
NER
1. NER Training sets from
IISH archives
CMS
(Drupal, WordPress, …,)
6.
2. Entity Recognize module
● Retrieve document tokens
● Send to NER by telnet
● If token is recognized entity → store in DB
Input DB
5.
3.
7. Meanings module
Processing module ● Look for sequences of entities
● Check for new documents ● Replace with known composite entities
● Split into words 4.
● Store in DB
Knowledge Base
3. IISH systems integration
OCR application
LINKS ● Scans, posters,
Database with 8000+ professions archives
● Create training sets
Evergreen library
HiTiME application System
- Persons ● Create training sets for
- Organizations authority records
- Locations ● Improve MARC21
- Dates ● records
- Professions
PID service
Knowledge base ● Store entities
Export data to e.g.
RDF, OWL, XML External applications
Search ● BWSA
search.iisg.nl ● Timeline
Clio-Infrastructure ● Improve metadata ● Visual Mets
● Infrastructure to store data from different systems
● Extend
● Connect dates and locations with datasets
● Find relevant documents in time/location domain
functionality with
● Visualize trends relevant to documents
new filters
4. System design
● HiTiME core checks for new or updated documents in input database
● Input database can be any type of database with timestamps
doc_id last_modified data
Document 1 12-13-12 12:04 “Petrus Alma is
great...”
Document 2 12-13-12 11:37 “...”
Input
data HiTiME core
doc_id last_modified data
Document 1 12-13-12 12:04 <person>Petrus
KB Alma</person> is
great...”
Document 2 12-13-12 11:37 “...”
5. Database design (1/2)
Example string: “Petrus Alma is great”
Split text into words and store words separately in table:
doc_id word_id word
0 0 Petrus
0 1 Alma
0 2 is
0 3 great
Store coordinates of each word in coordinate table:
doc_id sentence_id position word_id meaning_flag identity_id
0 0 0 0 0
0 0 1 1 0
0 0 2 2 0
0 0 3 3 0
6. Database Design (2/2)
Processing of text by NER.
Output of NER: Store in decision table:
word_id NER Frog Heidel UCTO Decision
“Petrus” → PERS
“Alma” → PERS 0 PERS PERS
“is” → 0 1 PERS PERS
“great” → 0
Update meaning_flag in coordinate table:
doc_id sentence_id position word_id meaning_flag identity_id
0 0 0 0 1
0 0 1 1 1
0 0 2 2 0
0 0 3 3 0
7. Improvement : Integration of FROG,
UCTO and HeidelTime
● Prototype only uses NER, and crude methods to split raw text
into sentences and words
● Splitting can be made more reliable with UCTO and FROG
● Time expressions are not recognized in prototype → HeidelTime
8. Improvement: Disambiguation of
recognized entities (1/2)
Word NER Frog Heidel ... Decision
Amsterdam LOC LOC
Amsterdam is a location. Seems right, but what if the text means the VOC ship “Amsterdam”?
9. Improvement: Disambiguation of
recognized entities (2/2)
NER can be trained to improve accuracy. By making
use of differently trained NER's
we can build an Expert System:
Word NER Frog Heidel NER2 NER3 Decision
Amsterdam LOC SHIP BAND ?
Final decision can be made based on priorities of trained models.
Our idea is to assign lowest priorities to wide scope models.
Ships
Amsterdam (VOC ship), an 18th century cargo ship
MS Amsterdam, a cruise ship owned and operated by Holland America
Line
Music
Amsterdam (band), a pop band from the United Kingdom
"Amsterdam" (Jacques Brel song), a song by Jacques
Brel
10. Improvement: “composite” entities
(1/2)
In our prototype:
“Petrus Alma is great”
Recognized as person Recognized as person
Should be:
“Petrus Alma is great”
Recognized as one person
11. Improvement: “composite” entities (2/2)
Possible solution: Keep track of known entities in separate entities table:
Search for sequences of recognized entities in coordinate table:
doc_id sentence_id position word_id meaning_flag identity_id
0 0 0 0 1 0
0 0 1 1 1 0
0 0 2 2 0
0 0 3 3 0
“Petrus Alma”
Compare these sequences with entities in entities table:
identity_id name type
0 Petrus Alma PERS Final decision about entity:
1 Aron van Dam PERS
identity_id name type
2 Frederik Feringa PERS
0 Petrus Alma PERS