This document summarizes a presentation on generating metadata by machine from the BEA 2015 conference. It discusses the experiences of the World Bank, IMF, and Trajectory Inc. with using automated processes to generate metadata for books and other publications. The World Bank uses a combination of automated and manual metadata generation depending on the publication. The IMF was able to significantly reduce the time and costs required to generate metadata for over 60,000 publications by using automated systems. Trajectory demonstrated several natural language processing and text analysis techniques their systems use to automatically extract metadata like keywords, entities, sentiment analysis, and translations from documents.
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
This document provides an overview of statistical natural language processing (NLP). It begins with introducing the speaker, Mona Diab, and their research interests in NLP. It then discusses the growing amount of digital data being produced and the potential for machines to process and understand human language. However, language is complex with ambiguity, and good NLP solutions require both linguistic and machine learning knowledge. The document outlines some of the goals and challenges of NLP, including resolving ambiguity, and provides examples of NLP applications and techniques like probabilistic models built from language data.
The document discusses tools for analyzing social media discussions around climate change. It describes how natural language processing (NLP) can be used to understand opinions and debates, identify influential users, and analyze how campaigns spread on topics like climate change. However, NLP approaches face challenges with noisy language on social media. The document also provides an example analysis of the Earth Hour campaign on Twitter which found engagement was driven more by activities than climate issues.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
El documento habla sobre la propuesta de una Ruta Mágica entre los municipios de Chignahuapan y Zacatlán en Puebla, los cuales han sido designados como "Pueblos Mágicos". Explica brevemente algunos de los principales atractivos turísticos de cada región e historia, y argumenta que esta ruta sería beneficiosa para ambas regiones al explotar su magia y misticismo si es implementada correctamente.
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
This document provides an overview of statistical natural language processing (NLP). It begins with introducing the speaker, Mona Diab, and their research interests in NLP. It then discusses the growing amount of digital data being produced and the potential for machines to process and understand human language. However, language is complex with ambiguity, and good NLP solutions require both linguistic and machine learning knowledge. The document outlines some of the goals and challenges of NLP, including resolving ambiguity, and provides examples of NLP applications and techniques like probabilistic models built from language data.
The document discusses tools for analyzing social media discussions around climate change. It describes how natural language processing (NLP) can be used to understand opinions and debates, identify influential users, and analyze how campaigns spread on topics like climate change. However, NLP approaches face challenges with noisy language on social media. The document also provides an example analysis of the Earth Hour campaign on Twitter which found engagement was driven more by activities than climate issues.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
El documento habla sobre la propuesta de una Ruta Mágica entre los municipios de Chignahuapan y Zacatlán en Puebla, los cuales han sido designados como "Pueblos Mágicos". Explica brevemente algunos de los principales atractivos turísticos de cada región e historia, y argumenta que esta ruta sería beneficiosa para ambas regiones al explotar su magia y misticismo si es implementada correctamente.
This document discusses assumptions about skillful teaching and the learning environment. It covers four key points: 1) Skillful teaching helps students learn in different ways; 2) Skillful teachers adopt a critically reflective stance; 3) Teachers need awareness of how students experience learning; and 4) College students are adults. It also discusses understanding the learning environment, both formal and informal settings. Brookfield's techniques for understanding the classroom are presented, like the Critical Incident Questionnaire. The importance of being aware of both physical and virtual learning environment factors is emphasized.
Este documento describe diferentes tipos de televisión educativa y sus características. Identifica la televisión cultural, educativa y escolar, y explica que la televisión educativa busca divulgar información educativa sin ser parte del sistema educativo formal, mientras que la televisión escolar tiene los mismos objetivos que el sistema educativo. También destaca variables como la edad del receptor, objetivos y contenidos a transmitir, y las etapas del proceso de producción de programas educativos de televisión como la pretelevisión, teledifusión y evaluación.
Christopher R. Forbes' resume summarizes his over 21 years of experience in construction project and program management across various industries including private, public, and utilities. He has held roles such as Construction Manager, Vice President of Business Development, Project Manager, Project Executive, and Program Director. His experience spans management of projects ranging from $2 million to $85 million. He possesses expertise in areas such as construction administration, contracts, commissioning, safety, scheduling, cost control, and ensuring compliance.
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...Cristian Randieri PhD
Diversi fattori tra loro convergenti, quali il crescente bisogno di sicurezza urbana, l’esigenza di una maggiore fruibilità di servizi pubblici e di diffusione delle informazioni, nonché l’attenzione sempre maggiore rivolta al risparmio energetico e all’ambiente, hanno spinto le grandi aree metropolitane di tutto il mondo a ripensare la modalità attraverso cui gestire la sicurezza di ogni cittadino secondo i più moderni canoni della Smart City.
Articolo completo disponibile alla pagina http://www.intellisystem.it/it/portfolio/ss-luglioagosto-2016
Los elementos clave para ser un emprendedor exitoso incluyen detectar oportunidades de negocio, tener la capacidad y voluntad para liderar un proyecto empresarial viable, y actuar rápidamente en pro del interés común y el control del proyecto.
Our journey to digital transformation by Working Out Loud continues with an event dedicated to the subject. Co-hosted with the IABC, this session will discuss how enterprise social networks (ESNs) like Jive and Yammer hold the key to successful employee engagement and advocacy.
O documento apresenta um trabalho acadêmico sobre a situação econômica do Brasil em 2016, abordando tópicos de microeconomia, macroeconomia e métodos quantitativos. Foi realizado por um grupo de estudantes de administração como requisito parcial para obtenção de nota em diversas disciplinas. O trabalho contém quatro questões que discutem a crise econômica brasileira em 2016, conceitos de inflação, taxa de juros e câmbio, além de medidas descritivas em métodos quantitativos.
This was originally presented at BEA 2105. This presentation looks at the experiences of two publishers as they conducted machine indexing projects. It also shows the capabilities of machine indexing today.
An overview of some core concept in natural language processing, some example (experimental for now!) use cases, and a brief survey of some tools I have explored.
This document provides a summary of the key topics covered in a lecture on natural language processing and information extraction:
1. Natural language processing involves understanding human language through text and speech analysis, as well as generating natural language responses. Some fundamental NLP tasks discussed include parsing, semantic analysis, and information extraction.
2. Information extraction involves segmenting text into entities and relationships, and then classifying and clustering these extracted elements to populate a structured database. Examples of information extraction applications to different types of text are described.
3. The challenges of ambiguity from lexical, syntactic, semantic and pragmatic sources are discussed as a major hurdle for natural language understanding systems to overcome. Different theories of semantic representation are also summarized
This document summarizes an online presentation about Eaagle tools for text mining. Eaagle provides software and services to help organizations analyze and categorize unstructured text data from sources like surveys, forums, social media, and customer comments. It can analyze thousands of responses within seconds without requiring predefined categories or taxonomies. The software automatically identifies relevant topics and words and provides visualizations and reports to help users discover insights from text data. Major clients include research firms, media companies, and large corporations who have reduced the time and costs of text analysis by 50% or more using Eaagle.
Agile Data Rationalization for Operational IntelligenceInside Analysis
The Briefing Room with Eric Kavanagh and Phasic Systems
Live Webcast Mar. 26, 2013
The complexity of today's information architectures creates a wide range of challenges for executives trying to get a strategic view of their current operations. The data and context locked in operational systems often get diluted during the normalization processes of data warehousing and other types of analytic solutions. And the ultimate goal of seeing the big picture gets derailed by a basic inability to reconcile disparate organizational views of key information assets and rules.
Register for this episode of The Briefing Room to learn from Bloor Group CEO Eric Kavanagh, who will explain how a tightly controlled methodology can be combined with modern NoSQL technology to resolve both process and system complexities, thus enabling a much richer, more interconnected information landscape. Kavanagh will be briefed by Geoffrey Malafsky of Phasic Systems who will share his company's tested methodology for capturing and managing the business and process logic that run today's data-driven organizations. He'll demonstrate how a “don't say no” approach to entity definitions can dissolve previously intractable disagreements, opening the door to clear, verifiable operational intelligence.
Visit: http://www.insideanalysis.com
Content will range start with why does Text Analytics need a special session on convincing boss, followed by a role play summarizing current mistakes, a sample elevator pitch for your boss and a proposed execution plan. The content is tailored for Mid to Senior Level Managers trying to convince Leaders/Executives/Heads. It doesn’t provide any technical details –methodologies, tools, vendors or hardware investments.
This was presented at Text Analytics West Summit 2014 at San Francisco. Questions? Reach out at Ramkumar Ravichandran @ Linkedin.
Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customers’ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
How to Use Artificial Intelligence by Microsoft Product ManagerProduct School
The talk focused on the Fundamentals of Product Management, leveraging the speaker's personal experiences in the AI field. It covered core Product Manager topics such as managing customer needs, business goals & technology feasibility, the holy trinity of the Product Manager discipline, delve into data analyses, rapid experimentation, and execution, and finally, explored the challenges of customer privacy, bias, and inclusivity in AI products.
This document discusses assumptions about skillful teaching and the learning environment. It covers four key points: 1) Skillful teaching helps students learn in different ways; 2) Skillful teachers adopt a critically reflective stance; 3) Teachers need awareness of how students experience learning; and 4) College students are adults. It also discusses understanding the learning environment, both formal and informal settings. Brookfield's techniques for understanding the classroom are presented, like the Critical Incident Questionnaire. The importance of being aware of both physical and virtual learning environment factors is emphasized.
Este documento describe diferentes tipos de televisión educativa y sus características. Identifica la televisión cultural, educativa y escolar, y explica que la televisión educativa busca divulgar información educativa sin ser parte del sistema educativo formal, mientras que la televisión escolar tiene los mismos objetivos que el sistema educativo. También destaca variables como la edad del receptor, objetivos y contenidos a transmitir, y las etapas del proceso de producción de programas educativos de televisión como la pretelevisión, teledifusión y evaluación.
Christopher R. Forbes' resume summarizes his over 21 years of experience in construction project and program management across various industries including private, public, and utilities. He has held roles such as Construction Manager, Vice President of Business Development, Project Manager, Project Executive, and Program Director. His experience spans management of projects ranging from $2 million to $85 million. He possesses expertise in areas such as construction administration, contracts, commissioning, safety, scheduling, cost control, and ensuring compliance.
112 SMART CITY PIÙ SICURE CON LE NUOVE TECNOLOGIE E LE TELECAMERE DI RETE INT...Cristian Randieri PhD
Diversi fattori tra loro convergenti, quali il crescente bisogno di sicurezza urbana, l’esigenza di una maggiore fruibilità di servizi pubblici e di diffusione delle informazioni, nonché l’attenzione sempre maggiore rivolta al risparmio energetico e all’ambiente, hanno spinto le grandi aree metropolitane di tutto il mondo a ripensare la modalità attraverso cui gestire la sicurezza di ogni cittadino secondo i più moderni canoni della Smart City.
Articolo completo disponibile alla pagina http://www.intellisystem.it/it/portfolio/ss-luglioagosto-2016
Los elementos clave para ser un emprendedor exitoso incluyen detectar oportunidades de negocio, tener la capacidad y voluntad para liderar un proyecto empresarial viable, y actuar rápidamente en pro del interés común y el control del proyecto.
Our journey to digital transformation by Working Out Loud continues with an event dedicated to the subject. Co-hosted with the IABC, this session will discuss how enterprise social networks (ESNs) like Jive and Yammer hold the key to successful employee engagement and advocacy.
O documento apresenta um trabalho acadêmico sobre a situação econômica do Brasil em 2016, abordando tópicos de microeconomia, macroeconomia e métodos quantitativos. Foi realizado por um grupo de estudantes de administração como requisito parcial para obtenção de nota em diversas disciplinas. O trabalho contém quatro questões que discutem a crise econômica brasileira em 2016, conceitos de inflação, taxa de juros e câmbio, além de medidas descritivas em métodos quantitativos.
This was originally presented at BEA 2105. This presentation looks at the experiences of two publishers as they conducted machine indexing projects. It also shows the capabilities of machine indexing today.
An overview of some core concept in natural language processing, some example (experimental for now!) use cases, and a brief survey of some tools I have explored.
This document provides a summary of the key topics covered in a lecture on natural language processing and information extraction:
1. Natural language processing involves understanding human language through text and speech analysis, as well as generating natural language responses. Some fundamental NLP tasks discussed include parsing, semantic analysis, and information extraction.
2. Information extraction involves segmenting text into entities and relationships, and then classifying and clustering these extracted elements to populate a structured database. Examples of information extraction applications to different types of text are described.
3. The challenges of ambiguity from lexical, syntactic, semantic and pragmatic sources are discussed as a major hurdle for natural language understanding systems to overcome. Different theories of semantic representation are also summarized
This document summarizes an online presentation about Eaagle tools for text mining. Eaagle provides software and services to help organizations analyze and categorize unstructured text data from sources like surveys, forums, social media, and customer comments. It can analyze thousands of responses within seconds without requiring predefined categories or taxonomies. The software automatically identifies relevant topics and words and provides visualizations and reports to help users discover insights from text data. Major clients include research firms, media companies, and large corporations who have reduced the time and costs of text analysis by 50% or more using Eaagle.
Agile Data Rationalization for Operational IntelligenceInside Analysis
The Briefing Room with Eric Kavanagh and Phasic Systems
Live Webcast Mar. 26, 2013
The complexity of today's information architectures creates a wide range of challenges for executives trying to get a strategic view of their current operations. The data and context locked in operational systems often get diluted during the normalization processes of data warehousing and other types of analytic solutions. And the ultimate goal of seeing the big picture gets derailed by a basic inability to reconcile disparate organizational views of key information assets and rules.
Register for this episode of The Briefing Room to learn from Bloor Group CEO Eric Kavanagh, who will explain how a tightly controlled methodology can be combined with modern NoSQL technology to resolve both process and system complexities, thus enabling a much richer, more interconnected information landscape. Kavanagh will be briefed by Geoffrey Malafsky of Phasic Systems who will share his company's tested methodology for capturing and managing the business and process logic that run today's data-driven organizations. He'll demonstrate how a “don't say no” approach to entity definitions can dissolve previously intractable disagreements, opening the door to clear, verifiable operational intelligence.
Visit: http://www.insideanalysis.com
Content will range start with why does Text Analytics need a special session on convincing boss, followed by a role play summarizing current mistakes, a sample elevator pitch for your boss and a proposed execution plan. The content is tailored for Mid to Senior Level Managers trying to convince Leaders/Executives/Heads. It doesn’t provide any technical details –methodologies, tools, vendors or hardware investments.
This was presented at Text Analytics West Summit 2014 at San Francisco. Questions? Reach out at Ramkumar Ravichandran @ Linkedin.
Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customers’ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
How to Use Artificial Intelligence by Microsoft Product ManagerProduct School
The talk focused on the Fundamentals of Product Management, leveraging the speaker's personal experiences in the AI field. It covered core Product Manager topics such as managing customer needs, business goals & technology feasibility, the holy trinity of the Product Manager discipline, delve into data analyses, rapid experimentation, and execution, and finally, explored the challenges of customer privacy, bias, and inclusivity in AI products.
This document discusses using R for Twitter data analytics. It outlines the basics of Twitter data analytics using R, including collecting real-time Twitter data, text mining techniques for Twitter data, and sentiment analysis. Some key steps involved are exploring the Twitter corpus, preprocessing the text by removing stopwords and stemming words, creating a document-term matrix, and calculating TF-IDF weights. Cosine similarity is used to measure similarity between text documents. The goal is to extract useful patterns and insights from large amounts of Twitter data in real-time.
This document summarizes a presentation about expanding the use of DITA (Darwin Information Typing Architecture) beyond technical publications. It discusses how organizations should focus on content strategy before implementing new technologies. The presentation will examine value of semantic markup for the enterprise and several non-traditional DITA projects. It also provides background on the presenter and their company which helps organizations improve information usability through content strategy, architecture, transformation and tools selection.
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
This presentation delves into the world of Natural Language Processing (NLP), exploring its goal to make human language understandable to machines. The complexities of language, such as ambiguity and complex structures, are highlighted as major challenges. The talk underscores the evolution of NLP through deep learning methodologies, leading to a new era defined by large-scale language models. However, obstacles like low-resource languages and ethical issues including bias and hallucination are acknowledged as enduring challenges in the field. Overall, the presentation provides a condensed, yet comprehensive view of NLP's accomplishments and ongoing hurdles.
This document discusses how Life Technologies innovated and optimized their translation processes through the use of cloud technology. It summarizes how they established a cross-functional leadership group to define translation policies and strategies, conducted research to identify translation needs and feasibility, and implemented a cloud-based system to standardize workflows and enable local teams and vendors. This allowed Life Technologies to gain efficiencies, improve processes, and better measure the return on investment of their translation activities.
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
Advances in text mining, analytics and machine learning are transforming our applications and enabling ever more powerful applications, yet most applications and platforms are designed to deal with a single (normalized) language. Hence as our applications and platforms are increasingly required to ingest international content, the challenge becomes to find ways to normalize content to a single language without compromising quality. An extension of this question in terms of such applications is also how we define quality in this context and what, if any, bi-products a localization effort can produce that may enhance the usefulness of the application.
This talk will, using patent searching as an example use case, review the challenges and possible solution approaches for handling localization effectively and will show what current emerging technology offers, what to expect and what not to expect and provide an introductory practical guide to handling localization in the context of data mining and analytics.
Explore the power of Natural Language Processing (NLP) and Data Science in uncovering valuable insights from Flipkart product reviews. This presentation delves into the methodology, tools, and techniques used to analyze customer sentiments, identify trends, and extract actionable intelligence from a vast sea of textual data. From understanding customer preferences to improving product offerings, discover how NLP Data Science is revolutionizing the way businesses leverage consumer feedback on Flipkart. Visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Automatic and rapid generation of massive knowledge repositories from dataSIKM
The document discusses the use of automated data-driven synthesis (DDS) to rapidly generate massive knowledge repositories directly from data. DDS uses a compiler-based approach to synthesize large quantities of high-quality, semantically linked content at low cost. This overcomes limitations of manual knowledge repository construction, including high costs, errors, and limited scale. DDS can synthesize a wide range of knowledge structures, from simple data records and reports to complex digital libraries and knowledge repositories.
Text analysis and Semantic Search with GATEDiana Maynard
This document provides an outline for a tutorial on text analysis with GATE (General Architecture for Text Engineering). The tutorial covers topics such as natural language processing, information extraction, social media analysis, semantic search, semantic annotation, and example applications that use GATE like news analysis and patent analysis. It also discusses NLP components for text mining like entity recognition, relation extraction, event recognition, and summarization. Finally, it introduces GATE as an NLP toolkit, its main components, and its built-in information extraction system called ANNIE.
Similar to BEA 2015 Generating Metadata by Machine Final (20)
2. Presenters
Moderator
• Pat Payton, Senior Manager Publisher Relations, Bowker
Speakers
• Randi Park, Publishing Officer, The World Bank
• Hassan Zaidi, Digital Publishing Officer, International Monetary Fund
• Jim Bryant, CEO, Trajectory Inc.
3. Terminology
• Automated or Machine Indexing
– Process of assigning index terms against a set
vocabulary or taxonomy without human intervention
– Full text or bibliographic records
– Multiple vocabularies/rule sets allow for complex text
analysis
• Optical Character Recognition (OCR)
– Machine conversion of an image to text
– PDF of book content
• Extensible Markup Language (XML)
– Set of rules for encoding documents
– Both machine readable and human readable
2
5. ABOUT THE WORLD BANK
4
• The World Bank Group is the world’s largest
source of funding and technical assistance for
developing countries.
• Through its five institutions, the Bank Group
partners with developing countries to reduce
poverty, increase economic growth, and
improve the quality of life.
• Comprised of 188 member countries with
offices in 120 countries around the world.
around the world.
Our Twin Goals
End Extreme Poverty within a Generation &
Boost Shared Prosperity
6. Likeotherpublishersinsomerespects but...
• Publishing arm of a larger institution, with institutional
imperatives
• Open access
o Dissemination trumps revenue
• Research is performed by in-house economists and experts in
other fields, by development practitioners working on the ground,
and by external contributors.
• Our publishing outputs are meant to enrich the development
debate, inform policies, and support the development goals of our
client countries.
We are a “Knowledge Bank”
The World Bank is the largest source of development knowledge
12. Metadata strategy
Primary Purpose
• Supports user-centered
discovery in WB electronic
products
• Semantic fields often exposed
and browseable
• Complimented by full text
search and filtering
• Book, chapter and article level
abstracts, topics, regions,
countries, keywords
• Books do not inherit chapter
semantics
Secondary Re-purpose
• Search and discovery services
• Aggregators
• Retail sales channels, both print
and electronic
13. Ourexperiencewithmachinegenerated
metadata
Set up
• Customized our enterprise system as much as was practical
Pros
• Reasonable solution when
there is a huge corpus
• Fast throughput
• Inexpensive to run after labor-
intensive set up
• PDF source for extraction of
topics, subtopics, countries,
regions, keywords
• XML output easily
transformed
Cons
• Set up effort/cost
• Inconsistent use of keyword
terms, depending on how
they were used in the text
anti-corruption/anticorruption
decision-making/decision making
policy-making/policy making
• Abstracts must be written by
humans
• False hits due to footnotes,
references, names, etc..
14.
15. Presentworkflow –humangenerated
Pros
• Book and chapter level
including abstracts
• Able to manage keyword
vocabulary using pick-lists
with additions as needed
• More accurate, author
provides book level draft, EP
team does sense check
• New rules and terms can be
added any time with little set-
up
Cons
• Cost per book/chapter
• Capacity
• Inconsistencies between
legacy (edited machine-
generated) and newer content
to be addressed
• Single version of keywords
may not be ideal for all
channels (ie more keywords
for discovery services)
16. Future
• Interested in using technology to improve
discovery for direct users and in discovery
services
• Full text XML and ePub available for indexing
• Institutional need to implement new taxonomy
and full text search for over 200k documents
18. Introduction: IMF Publications
Objectives: Establish digital publishing program 2010-2011
• New IMF eLibrary
• Digital distribution
• Digital production
• New metadata management system
• Create metadata to a granular level (chapters and articles) ***
21. New Challenges – New Solutions
Manual vs. Machine
•Metadata quality
•Time factor
•Cost of labor comparison
Challenge: Cataloging to a granular level (keywords,
countries, topics and sub-topics)
22. New challenges – New solutions
Do the Math
IMF example:
• 12, 000 titles containing 60,000 chapters/articles (assumes an
average of 5 per title),
• 15 minutes to catalog each chapter/article with keywords etc,
• 15,000 hours/40 (per week) hours =375 weeks
• 375 weeks/52 = 7 years of work for one cataloger.
If you pay just $30 per hour to a cataloger, the overall cost would be
$450,000. Not to mention new content is being created daily.
Automation allows us to slash the time it takes to catalog our
content, saving us time and money.
30. Simple Search - Type a word or phrase into the
search bar at the top of every page…
…or Advanced Search allows
multiple concepts and filters
31. Search within results to search
within publications using a single
word or phrase.
Select Content Type (Books and
Journals/Chapters and Articles),
Countries/Region, Topics,
Languages, or Date.
Type a word in the Starts with box
to go to the first title that begins
with the word.
Sort by Title, Date, Source or
Author.
Change the number of Items per
page.
Keywords
32. Read on screen
in HTML
Read on a
variety of
devices
Citation
tools
Click on a title from the results page to go to the publication
landing page.
37. • New IMF eLibrary was delivered in March 2011
• Digital distribution: Distribute IMF contents to 35 channels
in various digital formats
• Digital production: Have an established workflow to
generate XML based contents, ePubs, Mobi and PDF ebooks
• New metadata management system. MetaLogic is a full
functioning metadata management system
• Create metadata to a granular level (all chapters and
articles have individual ) ***
38. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Generating Metadata By Machine
BEA May 29, 2015 11:30 – 12:20
39. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Natural Language Processing: Processing & Analysis
38
Natural language analysis tools process English language text input, transforming
each sentence into data that can be used for search and analysis.
Identify the base forms of words.
Identify parts of speech.
Identify names of companies, people, places, etc.
Describe the structure of sentences in terms of phrases and word dependencies.
Indicate which noun phrases refer to the same entities.
40. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Attributes/Entities that Characterize A Book
39
41. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
“Outstanding”words(5) breathtaking,thrilled,superb
hell,rape,(more unmentionables)“Catastrophic”words(-5)
torture,fraud,(unmentionables)“Damned”words(-4)
woeful,worsen,kill“Terrible”words(-3)
worthless,travesty,threaten“Upset”words(-2)
numb, provoke,pushy“No”words(-1)
validate,safe,adequate“Yes”words(1):
strengthen,rich,funky“Welcome”words(2)
praise,marvelous,impressive
winning,stunning
“Happy”words(3)
“Wow”words(4)
40
Each wordisgivena numericvalue
basedon itssubjectivemeaning.
“Positive”wordsrangeona positive
scale;“Negative”wordsrangeon a
negativescale.
Trajectory’sAnalyticsEngineuses
thesevaluestocomputethebook’s
sentimentcurveacrosssentence,
paragraph,chapterandentirebook.
Thissentiment“fingerprint”atan
aggregatelevelyieldsaunique
pictureofthebook.
42. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
41
43. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
42
44. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Trajectory Index
43
45. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Keyword Analysis and Comparison
44
46. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Keyword Translation into Local Languages
45
47. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Recommendations
46
48. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Thank You
47
2015BEA – BOOTH 1347
United States:
50 Doaks Lane
Marblehead, Massachusetts
01945 United States
info@trajectory.com
www.trajectory.com
China:
No. 3, 8 ChuangYe Road
HaidanDistrict,
Beijing, China100085
49. Q & A
Generating Metadata by Machine
BEA 2015
Friday, May 29, 11:30-12:20
Room 1E10