SlideShare a Scribd company logo
Generating Metadata by
Machine
BEA 2015
Friday, May 29, 11:30-12:20
Room 1E10
Presenters
Moderator
• Pat Payton, Senior Manager Publisher Relations, Bowker
Speakers
• Randi Park, Publishing Officer, The World Bank
• Hassan Zaidi, Digital Publishing Officer, International Monetary Fund
• Jim Bryant, CEO, Trajectory Inc.
Terminology
• Automated or Machine Indexing
– Process of assigning index terms against a set
vocabulary or taxonomy without human intervention
– Full text or bibliographic records
– Multiple vocabularies/rule sets allow for complex text
analysis
• Optical Character Recognition (OCR)
– Machine conversion of an image to text
– PDF of book content
• Extensible Markup Language (XML)
– Set of rules for encoding documents
– Both machine readable and human readable
2
Experience with semantic
metadata creation
Randi Park
Rpark@worldbankgroup.org
WORLD BANK PUBLICATIONS
ABOUT THE WORLD BANK
4
• The World Bank Group is the world’s largest
source of funding and technical assistance for
developing countries.
• Through its five institutions, the Bank Group
partners with developing countries to reduce
poverty, increase economic growth, and
improve the quality of life.
• Comprised of 188 member countries with
offices in 120 countries around the world.
around the world.
Our Twin Goals
End Extreme Poverty within a Generation &
Boost Shared Prosperity
Likeotherpublishersinsomerespects but...
• Publishing arm of a larger institution, with institutional
imperatives
• Open access
o Dissemination trumps revenue
• Research is performed by in-house economists and experts in
other fields, by development practitioners working on the ground,
and by external contributors.
• Our publishing outputs are meant to enrich the development
debate, inform policies, and support the development goals of our
client countries.
We are a “Knowledge Bank”
The World Bank is the largest source of development knowledge
PopularAnnualsandFlagships
7
Two platforms: The World Bank eLibrary and the Open Knowledge Repository (OKR)
Mobileapplications
Topics wecover=29
• Plus 5 Regions, Countries and Keywords
Metadata strategy
Primary Purpose
• Supports user-centered
discovery in WB electronic
products
• Semantic fields often exposed
and browseable
• Complimented by full text
search and filtering
• Book, chapter and article level
abstracts, topics, regions,
countries, keywords
• Books do not inherit chapter
semantics
Secondary Re-purpose
• Search and discovery services
• Aggregators
• Retail sales channels, both print
and electronic
Ourexperiencewithmachinegenerated
metadata
Set up
• Customized our enterprise system as much as was practical
Pros
• Reasonable solution when
there is a huge corpus
• Fast throughput
• Inexpensive to run after labor-
intensive set up
• PDF source for extraction of
topics, subtopics, countries,
regions, keywords
• XML output easily
transformed
Cons
• Set up effort/cost
• Inconsistent use of keyword
terms, depending on how
they were used in the text
anti-corruption/anticorruption
decision-making/decision making
policy-making/policy making
• Abstracts must be written by
humans
• False hits due to footnotes,
references, names, etc..
Presentworkflow –humangenerated
Pros
• Book and chapter level
including abstracts
• Able to manage keyword
vocabulary using pick-lists
with additions as needed
• More accurate, author
provides book level draft, EP
team does sense check
• New rules and terms can be
added any time with little set-
up
Cons
• Cost per book/chapter
• Capacity
• Inconsistencies between
legacy (edited machine-
generated) and newer content
to be addressed
• Single version of keywords
may not be ideal for all
channels (ie more keywords
for discovery services)
Future
• Interested in using technology to improve
discovery for direct users and in discovery
services
• Full text XML and ePub available for indexing
• Institutional need to implement new taxonomy
and full text search for over 200k documents
Randi Park
Rpark@worldbankgroup.org
WORLD BANK PUBLICATIONS
Introduction: IMF Publications
Objectives: Establish digital publishing program 2010-2011
• New IMF eLibrary
• Digital distribution
• Digital production
• New metadata management system
• Create metadata to a granular level (chapters and articles) ***
Digitization and Metadata Challenges
2010-2011
Digitization and Metadata Challenges:
2010-2011
New Challenges – New Solutions
Manual vs. Machine
•Metadata quality
•Time factor
•Cost of labor comparison
Challenge: Cataloging to a granular level (keywords,
countries, topics and sub-topics)
New challenges – New solutions
Do the Math
IMF example:
• 12, 000 titles containing 60,000 chapters/articles (assumes an
average of 5 per title),
• 15 minutes to catalog each chapter/article with keywords etc,
• 15,000 hours/40 (per week) hours =375 weeks
• 375 weeks/52 = 7 years of work for one cataloger.
If you pay just $30 per hour to a cataloger, the overall cost would be
$450,000. Not to mention new content is being created daily.
Automation allows us to slash the time it takes to catalog our
content, saving us time and money.
Machine in Action
Machine in Action
Machine in Action
Results on eLibrary
Super keywords or
specific subjects
Browsing the IMF eLibrary
Browse by Topics
Simple Search - Type a word or phrase into the
search bar at the top of every page…
…or Advanced Search allows
multiple concepts and filters
Search within results to search
within publications using a single
word or phrase.
Select Content Type (Books and
Journals/Chapters and Articles),
Countries/Region, Topics,
Languages, or Date.
Type a word in the Starts with box
to go to the first title that begins
with the word.
Sort by Title, Date, Source or
Author.
Change the number of Items per
page.
Keywords
Read on screen
in HTML
Read on a
variety of
devices
Citation
tools
Click on a title from the results page to go to the publication
landing page.
Related documents
Related
documents
• New IMF eLibrary was delivered in March 2011
• Digital distribution: Distribute IMF contents to 35 channels
in various digital formats
• Digital production: Have an established workflow to
generate XML based contents, ePubs, Mobi and PDF ebooks
• New metadata management system. MetaLogic is a full
functioning metadata management system
• Create metadata to a granular level (all chapters and
articles have individual ) ***
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Generating Metadata By Machine
BEA May 29, 2015 11:30 – 12:20
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Natural Language Processing: Processing & Analysis
38
Natural language analysis tools process English language text input, transforming
each sentence into data that can be used for search and analysis.
Identify the base forms of words.
Identify parts of speech.
Identify names of companies, people, places, etc.
Describe the structure of sentences in terms of phrases and word dependencies.
Indicate which noun phrases refer to the same entities.
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Attributes/Entities that Characterize A Book
39
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
“Outstanding”words(5) breathtaking,thrilled,superb
hell,rape,(more unmentionables)“Catastrophic”words(-5)
torture,fraud,(unmentionables)“Damned”words(-4)
woeful,worsen,kill“Terrible”words(-3)
worthless,travesty,threaten“Upset”words(-2)
numb, provoke,pushy“No”words(-1)
validate,safe,adequate“Yes”words(1):
strengthen,rich,funky“Welcome”words(2)
praise,marvelous,impressive
winning,stunning
“Happy”words(3)
“Wow”words(4)
40
Each wordisgivena numericvalue
basedon itssubjectivemeaning.
“Positive”wordsrangeona positive
scale;“Negative”wordsrangeon a
negativescale.
Trajectory’sAnalyticsEngineuses
thesevaluestocomputethebook’s
sentimentcurveacrosssentence,
paragraph,chapterandentirebook.
Thissentiment“fingerprint”atan
aggregatelevelyieldsaunique
pictureofthebook.
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
41
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
42
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Trajectory Index
43
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Keyword Analysis and Comparison
44
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Keyword Translation into Local Languages
45
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Recommendations
46
™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Thank You
47
2015BEA – BOOTH 1347
United States:
50 Doaks Lane
Marblehead, Massachusetts
01945 United States
info@trajectory.com
www.trajectory.com
China:
No. 3, 8 ChuangYe Road
HaidanDistrict,
Beijing, China100085
Q & A
Generating Metadata by Machine
BEA 2015
Friday, May 29, 11:30-12:20
Room 1E10

More Related Content

Viewers also liked

конкурс захисник вітчизни1
конкурс захисник вітчизни1конкурс захисник вітчизни1
конкурс захисник вітчизни1
Надежда Прутская
 
Teachers and learners
Teachers and learnersTeachers and learners
Teachers and learners
Heather Nash
 
Unidad4 actividad3 luis alirio lombo_e
Unidad4 actividad3 luis alirio lombo_eUnidad4 actividad3 luis alirio lombo_e
Unidad4 actividad3 luis alirio lombo_e
lomborugo
 
Как закрыть вакансию в digital
Как закрыть вакансию в digitalКак закрыть вакансию в digital
Как закрыть вакансию в digital
peeet
 
Acit15 389 wsn_aqmc_link
Acit15 389 wsn_aqmc_linkAcit15 389 wsn_aqmc_link
Acit15 389 wsn_aqmc_link
Mohamed Fezari
 
New Updated 7.62 Resume8
New Updated 7.62 Resume8New Updated 7.62 Resume8
New Updated 7.62 Resume8
Christopher Forbes
 
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...
Cristian Randieri PhD
 
Empresario 1
Empresario 1Empresario 1
Empresario 1
peraltaedwinney
 
Academy PRO: HTML5 API multimedia
Academy PRO: HTML5 API multimediaAcademy PRO: HTML5 API multimedia
Academy PRO: HTML5 API multimedia
Binary Studio
 
Using Enterprise Social Networks to nurture employee engagement and advocacy
Using Enterprise Social Networks to nurture employee engagement and advocacyUsing Enterprise Social Networks to nurture employee engagement and advocacy
Using Enterprise Social Networks to nurture employee engagement and advocacy
The Employee Engagement Alliance
 
статут школи
статут школистатут школи
статут школи
irjkf70
 
Trabalho em grupo administração de empresas
Trabalho em grupo administração de empresasTrabalho em grupo administração de empresas
Trabalho em grupo administração de empresas
Jailton Barbosa
 
Квіткова композиція
Квіткова композиціяКвіткова композиція
Квіткова композиція
Надежда Прутская
 

Viewers also liked (13)

конкурс захисник вітчизни1
конкурс захисник вітчизни1конкурс захисник вітчизни1
конкурс захисник вітчизни1
 
Teachers and learners
Teachers and learnersTeachers and learners
Teachers and learners
 
Unidad4 actividad3 luis alirio lombo_e
Unidad4 actividad3 luis alirio lombo_eUnidad4 actividad3 luis alirio lombo_e
Unidad4 actividad3 luis alirio lombo_e
 
Как закрыть вакансию в digital
Как закрыть вакансию в digitalКак закрыть вакансию в digital
Как закрыть вакансию в digital
 
Acit15 389 wsn_aqmc_link
Acit15 389 wsn_aqmc_linkAcit15 389 wsn_aqmc_link
Acit15 389 wsn_aqmc_link
 
New Updated 7.62 Resume8
New Updated 7.62 Resume8New Updated 7.62 Resume8
New Updated 7.62 Resume8
 
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...
 
Empresario 1
Empresario 1Empresario 1
Empresario 1
 
Academy PRO: HTML5 API multimedia
Academy PRO: HTML5 API multimediaAcademy PRO: HTML5 API multimedia
Academy PRO: HTML5 API multimedia
 
Using Enterprise Social Networks to nurture employee engagement and advocacy
Using Enterprise Social Networks to nurture employee engagement and advocacyUsing Enterprise Social Networks to nurture employee engagement and advocacy
Using Enterprise Social Networks to nurture employee engagement and advocacy
 
статут школи
статут школистатут школи
статут школи
 
Trabalho em grupo administração de empresas
Trabalho em grupo administração de empresasTrabalho em grupo administração de empresas
Trabalho em grupo administração de empresas
 
Квіткова композиція
Квіткова композиціяКвіткова композиція
Квіткова композиція
 

Similar to BEA 2015 Generating Metadata by Machine Final

BEA 2015 Generating Metadata by Machine
BEA 2015 Generating Metadata by MachineBEA 2015 Generating Metadata by Machine
BEA 2015 Generating Metadata by Machine
Bowker
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
Nathan McMinn
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
testbest6
 
Semantics Helps Connect the Dots
Semantics Helps Connect the DotsSemantics Helps Connect the Dots
Semantics Helps Connect the Dots
Alicia Harapko
 
A quick overview of Eaagle
A quick overview of EaagleA quick overview of Eaagle
A quick overview of Eaagle
Eaagle
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
Inside Analysis
 
Selling Text Analytics to your boss
Selling Text Analytics to your bossSelling Text Analytics to your boss
Selling Text Analytics to your boss
Ramkumar Ravichandran
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
How to Use Artificial Intelligence by Microsoft Product Manager
 How to Use Artificial Intelligence by Microsoft Product Manager How to Use Artificial Intelligence by Microsoft Product Manager
How to Use Artificial Intelligence by Microsoft Product Manager
Product School
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
santoshi mangalgi
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
 
Full-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMs
Full-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMsFull-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMs
Full-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMs
Information Development World
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...
Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...
Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...
nbalagot1
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
Dr. Haxel Consult
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
Boston Institute of Analytics
 
Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...
Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...
Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...
Michael Meinhardt
 
Automatic and rapid generation of massive knowledge repositories from data
Automatic and rapid generation of massive knowledge repositories from dataAutomatic and rapid generation of massive knowledge repositories from data
Automatic and rapid generation of massive knowledge repositories from data
SIKM
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
rohitcse52
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
Diana Maynard
 

Similar to BEA 2015 Generating Metadata by Machine Final (20)

BEA 2015 Generating Metadata by Machine
BEA 2015 Generating Metadata by MachineBEA 2015 Generating Metadata by Machine
BEA 2015 Generating Metadata by Machine
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
 
Semantics Helps Connect the Dots
Semantics Helps Connect the DotsSemantics Helps Connect the Dots
Semantics Helps Connect the Dots
 
A quick overview of Eaagle
A quick overview of EaagleA quick overview of Eaagle
A quick overview of Eaagle
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Selling Text Analytics to your boss
Selling Text Analytics to your bossSelling Text Analytics to your boss
Selling Text Analytics to your boss
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
How to Use Artificial Intelligence by Microsoft Product Manager
 How to Use Artificial Intelligence by Microsoft Product Manager How to Use Artificial Intelligence by Microsoft Product Manager
How to Use Artificial Intelligence by Microsoft Product Manager
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Full-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMs
Full-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMsFull-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMs
Full-on DITA Strategies Beyond Technical Publications with Rob Hanna, ECMs
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...
Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...
Cloudwordslifetechnologieshardiemeinhardtlocworldlondon2013 130617152105-phpa...
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...
Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...
Innovate or Die!: A journey with Life Technologies to innovate, optimize & re...
 
Automatic and rapid generation of massive knowledge repositories from data
Automatic and rapid generation of massive knowledge repositories from dataAutomatic and rapid generation of massive knowledge repositories from data
Automatic and rapid generation of massive knowledge repositories from data
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 

BEA 2015 Generating Metadata by Machine Final

  • 1. Generating Metadata by Machine BEA 2015 Friday, May 29, 11:30-12:20 Room 1E10
  • 2. Presenters Moderator • Pat Payton, Senior Manager Publisher Relations, Bowker Speakers • Randi Park, Publishing Officer, The World Bank • Hassan Zaidi, Digital Publishing Officer, International Monetary Fund • Jim Bryant, CEO, Trajectory Inc.
  • 3. Terminology • Automated or Machine Indexing – Process of assigning index terms against a set vocabulary or taxonomy without human intervention – Full text or bibliographic records – Multiple vocabularies/rule sets allow for complex text analysis • Optical Character Recognition (OCR) – Machine conversion of an image to text – PDF of book content • Extensible Markup Language (XML) – Set of rules for encoding documents – Both machine readable and human readable 2
  • 4. Experience with semantic metadata creation Randi Park Rpark@worldbankgroup.org WORLD BANK PUBLICATIONS
  • 5. ABOUT THE WORLD BANK 4 • The World Bank Group is the world’s largest source of funding and technical assistance for developing countries. • Through its five institutions, the Bank Group partners with developing countries to reduce poverty, increase economic growth, and improve the quality of life. • Comprised of 188 member countries with offices in 120 countries around the world. around the world. Our Twin Goals End Extreme Poverty within a Generation & Boost Shared Prosperity
  • 6. Likeotherpublishersinsomerespects but... • Publishing arm of a larger institution, with institutional imperatives • Open access o Dissemination trumps revenue • Research is performed by in-house economists and experts in other fields, by development practitioners working on the ground, and by external contributors. • Our publishing outputs are meant to enrich the development debate, inform policies, and support the development goals of our client countries. We are a “Knowledge Bank” The World Bank is the largest source of development knowledge
  • 7.
  • 9. Two platforms: The World Bank eLibrary and the Open Knowledge Repository (OKR)
  • 11. Topics wecover=29 • Plus 5 Regions, Countries and Keywords
  • 12. Metadata strategy Primary Purpose • Supports user-centered discovery in WB electronic products • Semantic fields often exposed and browseable • Complimented by full text search and filtering • Book, chapter and article level abstracts, topics, regions, countries, keywords • Books do not inherit chapter semantics Secondary Re-purpose • Search and discovery services • Aggregators • Retail sales channels, both print and electronic
  • 13. Ourexperiencewithmachinegenerated metadata Set up • Customized our enterprise system as much as was practical Pros • Reasonable solution when there is a huge corpus • Fast throughput • Inexpensive to run after labor- intensive set up • PDF source for extraction of topics, subtopics, countries, regions, keywords • XML output easily transformed Cons • Set up effort/cost • Inconsistent use of keyword terms, depending on how they were used in the text anti-corruption/anticorruption decision-making/decision making policy-making/policy making • Abstracts must be written by humans • False hits due to footnotes, references, names, etc..
  • 14.
  • 15. Presentworkflow –humangenerated Pros • Book and chapter level including abstracts • Able to manage keyword vocabulary using pick-lists with additions as needed • More accurate, author provides book level draft, EP team does sense check • New rules and terms can be added any time with little set- up Cons • Cost per book/chapter • Capacity • Inconsistencies between legacy (edited machine- generated) and newer content to be addressed • Single version of keywords may not be ideal for all channels (ie more keywords for discovery services)
  • 16. Future • Interested in using technology to improve discovery for direct users and in discovery services • Full text XML and ePub available for indexing • Institutional need to implement new taxonomy and full text search for over 200k documents
  • 18. Introduction: IMF Publications Objectives: Establish digital publishing program 2010-2011 • New IMF eLibrary • Digital distribution • Digital production • New metadata management system • Create metadata to a granular level (chapters and articles) ***
  • 19. Digitization and Metadata Challenges 2010-2011
  • 20. Digitization and Metadata Challenges: 2010-2011
  • 21. New Challenges – New Solutions Manual vs. Machine •Metadata quality •Time factor •Cost of labor comparison Challenge: Cataloging to a granular level (keywords, countries, topics and sub-topics)
  • 22. New challenges – New solutions Do the Math IMF example: • 12, 000 titles containing 60,000 chapters/articles (assumes an average of 5 per title), • 15 minutes to catalog each chapter/article with keywords etc, • 15,000 hours/40 (per week) hours =375 weeks • 375 weeks/52 = 7 years of work for one cataloger. If you pay just $30 per hour to a cataloger, the overall cost would be $450,000. Not to mention new content is being created daily. Automation allows us to slash the time it takes to catalog our content, saving us time and money.
  • 26. Results on eLibrary Super keywords or specific subjects
  • 27. Browsing the IMF eLibrary
  • 28.
  • 30. Simple Search - Type a word or phrase into the search bar at the top of every page… …or Advanced Search allows multiple concepts and filters
  • 31. Search within results to search within publications using a single word or phrase. Select Content Type (Books and Journals/Chapters and Articles), Countries/Region, Topics, Languages, or Date. Type a word in the Starts with box to go to the first title that begins with the word. Sort by Title, Date, Source or Author. Change the number of Items per page. Keywords
  • 32. Read on screen in HTML Read on a variety of devices Citation tools Click on a title from the results page to go to the publication landing page.
  • 35.
  • 36.
  • 37. • New IMF eLibrary was delivered in March 2011 • Digital distribution: Distribute IMF contents to 35 channels in various digital formats • Digital production: Have an established workflow to generate XML based contents, ePubs, Mobi and PDF ebooks • New metadata management system. MetaLogic is a full functioning metadata management system • Create metadata to a granular level (all chapters and articles have individual ) ***
  • 38. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Generating Metadata By Machine BEA May 29, 2015 11:30 – 12:20
  • 39. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Natural Language Processing: Processing & Analysis 38 Natural language analysis tools process English language text input, transforming each sentence into data that can be used for search and analysis. Identify the base forms of words. Identify parts of speech. Identify names of companies, people, places, etc. Describe the structure of sentences in terms of phrases and word dependencies. Indicate which noun phrases refer to the same entities.
  • 40. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Attributes/Entities that Characterize A Book 39
  • 41. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Sentiment: Analyzing the Words Within the Book “Outstanding”words(5) breathtaking,thrilled,superb hell,rape,(more unmentionables)“Catastrophic”words(-5) torture,fraud,(unmentionables)“Damned”words(-4) woeful,worsen,kill“Terrible”words(-3) worthless,travesty,threaten“Upset”words(-2) numb, provoke,pushy“No”words(-1) validate,safe,adequate“Yes”words(1): strengthen,rich,funky“Welcome”words(2) praise,marvelous,impressive winning,stunning “Happy”words(3) “Wow”words(4) 40 Each wordisgivena numericvalue basedon itssubjectivemeaning. “Positive”wordsrangeona positive scale;“Negative”wordsrangeon a negativescale. Trajectory’sAnalyticsEngineuses thesevaluestocomputethebook’s sentimentcurveacrosssentence, paragraph,chapterandentirebook. Thissentiment“fingerprint”atan aggregatelevelyieldsaunique pictureofthebook.
  • 42. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Sentiment: Analyzing the Words Within the Book 41
  • 43. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Sentiment: Analyzing the Words Within the Book 42
  • 44. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Trajectory Index 43
  • 45. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Keyword Analysis and Comparison 44
  • 46. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Keyword Translation into Local Languages 45
  • 47. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Recommendations 46
  • 48. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Thank You 47 2015BEA – BOOTH 1347 United States: 50 Doaks Lane Marblehead, Massachusetts 01945 United States info@trajectory.com www.trajectory.com China: No. 3, 8 ChuangYe Road HaidanDistrict, Beijing, China100085
  • 49. Q & A Generating Metadata by Machine BEA 2015 Friday, May 29, 11:30-12:20 Room 1E10