- The document summarizes a talk on text analytics of 2 million documents. It discusses extracting keywords from large datasets efficiently using cloud computing resources and parallel processing. It provides examples of extracting keywords from a scientific paper dataset and compares results to human indexers. The talk outlines steps to estimate processing time, understand the data, and leverage cloud infrastructure to speed up keyword extraction at scale.
E-TEXT in E-FL : FOUR FLAVOURS
Dr. Przemysław Kaszubski : IFAConc - web-concordancing with EAP writing students
Mgr Joanna Jendryczka-Wierszycka : E-text annotation - why bother?
Dr. Michał Remiszewski : Towards competence mapping in language teaching/learning
Prof. Włodzimierz Sobkowiak : E-text in Second Life: reification of text?
[ http://ifa.amu.edu.pl/fa/node/1144 ]
[ http://ifa.amu.edu.pl/fa/node/1123 ]
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Saurav Jha
In this paper, we present two approaches to automatically classify the gender of blog authors: the first is a manual feature extraction based system incorporating two novel feature classes: variable length character sequence patterns and thirteen new word classes, along with an added class of surface features while the second is a first-ever application of a memory variant of Recurrent Neural Networks, i.e. Bidirectional Long Short Term Memory Networks (BLSTMs) on this task. We use two blog data sets to report our results: the first is a well-explored one used by the previous state-of-the-art models while the other is a 20 times larger corpus. For the first system, we use a voting of machine learning classifiers to obtain an improved accuracy with respect to the previous feature mining systems on the former data set. Using our second approach, we show that the accuracy obtained using such deep LSTMs is comparable to the current state-of-the-art deep learning system for the task of gender classification. Finally, we carry out a comparative study of performance of both the systems on the two data sets.
Integrating IT assets is a problem that all companies face. The challenge is not just in integrating the technologies, it is in selecting the right tools for the job. In this webinar, Ken Vollmer, Principal Analyst, Forrester Research, will talk about the evolution of how companies approach integration and the factors that should be considered in selecting a tool. Use this on demand email to promote the webinar. Visit us at http://www.softwareag.com Become part of our growing community: Facebook: http://www.facebook.com/softwareag Twitter: http://www.twitter.com/softwareag LinkedIn: http://www.linkedin.com/company/software-ag YouTube: http://www.youtube.com/softwareag
How do industry trends like cloud computing, DevOps, internet-of-things, mobility, and wearables impact application integration? This presentation looks at some considerations for integration architects.
E-TEXT in E-FL : FOUR FLAVOURS
Dr. Przemysław Kaszubski : IFAConc - web-concordancing with EAP writing students
Mgr Joanna Jendryczka-Wierszycka : E-text annotation - why bother?
Dr. Michał Remiszewski : Towards competence mapping in language teaching/learning
Prof. Włodzimierz Sobkowiak : E-text in Second Life: reification of text?
[ http://ifa.amu.edu.pl/fa/node/1144 ]
[ http://ifa.amu.edu.pl/fa/node/1123 ]
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Saurav Jha
In this paper, we present two approaches to automatically classify the gender of blog authors: the first is a manual feature extraction based system incorporating two novel feature classes: variable length character sequence patterns and thirteen new word classes, along with an added class of surface features while the second is a first-ever application of a memory variant of Recurrent Neural Networks, i.e. Bidirectional Long Short Term Memory Networks (BLSTMs) on this task. We use two blog data sets to report our results: the first is a well-explored one used by the previous state-of-the-art models while the other is a 20 times larger corpus. For the first system, we use a voting of machine learning classifiers to obtain an improved accuracy with respect to the previous feature mining systems on the former data set. Using our second approach, we show that the accuracy obtained using such deep LSTMs is comparable to the current state-of-the-art deep learning system for the task of gender classification. Finally, we carry out a comparative study of performance of both the systems on the two data sets.
Integrating IT assets is a problem that all companies face. The challenge is not just in integrating the technologies, it is in selecting the right tools for the job. In this webinar, Ken Vollmer, Principal Analyst, Forrester Research, will talk about the evolution of how companies approach integration and the factors that should be considered in selecting a tool. Use this on demand email to promote the webinar. Visit us at http://www.softwareag.com Become part of our growing community: Facebook: http://www.facebook.com/softwareag Twitter: http://www.twitter.com/softwareag LinkedIn: http://www.linkedin.com/company/software-ag YouTube: http://www.youtube.com/softwareag
How do industry trends like cloud computing, DevOps, internet-of-things, mobility, and wearables impact application integration? This presentation looks at some considerations for integration architects.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...Dr. Haxel Consult
Synonym breaks search! How? Why is this important? What synonym is and how it breaks search will be explained with real-world examples. AI-based solutions are proposed, and relevant standards are identified. How synonym solutions should be used for search are explained. Learn what you can do yourself. Tools help, but it doesn’t have to be complicated, nor expensive. It is as straight forward as setting priorities!
Qualitative Data Analysis I: Text Analysis - a summary based on Chapter 17 of H. Russell Bernard’s Research Methods in Anthropology: Qualitative and Quantitative Approaches for a Report for Anthro 297: Seminar in Research Design and Methods under Dr. Francisco Datar, Department of Anthropology, College of Social Sciences and Philosophy, University of the Philippines Diliman
Are you interested in learning about text analysis but have little to no experience with programming languages or writing code? These two short courses will introduce you to multiple text analysis methods. We will examine real-world examples and engage in hands-on activities that don’t require running any code. These short courses are ideal for students and researchers in non-technical fields, faculty who would like to incorporate text analysis in their curriculum, or as a precursor to programming with text analysis tools.
A Gentle Introduction to Text Analysis I will cover both qualitative and quantitative text analysis methods, bag-of-words techniques and classification.
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence Marina Santini
Query logs are an important source of information to surmize users intents'. Although Karlgren (2010) points out that “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“, some linguistic problems could be sorted out by applying more advanced text/content analytics, such as register/sublanguage identification and terminology classification (see Friberg Heppin, 2011) . In this presentation, I will argue that query logs can be considered a digital textual genre alike emails, blogs, chats, tweets and so forth. All these genres contain unstructured information that, still today, is difficult to leverage upon satisfactorily. The hypothesis that I would like to put forward in this workshop is that query logs might be easier to exploit to extract useful information and actionable intelligence than other digital genres.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
How Taxonomies and facets bring end users closer to big dataPeter Wren-Hilton
Pingar researcher Dr Anna Divoli's presentation given at the 2012 Text Analytics World Boston. Content includes discussion of taxonomies and big data,.
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton
Alyona Medelyan (Pingar), Anna Divoli (Pingar)
presented at Strata O'Reilly Making Data Work Conference on March 1, 2012
The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.
Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.
In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.
In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.
And read a full interview with Alyona and Anna at http://radar.oreilly.com/2012/02/unstructured-data-analysis-tools.html
More Related Content
Similar to Case Study: Text Analytics on 2 Million Documents
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...Dr. Haxel Consult
Synonym breaks search! How? Why is this important? What synonym is and how it breaks search will be explained with real-world examples. AI-based solutions are proposed, and relevant standards are identified. How synonym solutions should be used for search are explained. Learn what you can do yourself. Tools help, but it doesn’t have to be complicated, nor expensive. It is as straight forward as setting priorities!
Qualitative Data Analysis I: Text Analysis - a summary based on Chapter 17 of H. Russell Bernard’s Research Methods in Anthropology: Qualitative and Quantitative Approaches for a Report for Anthro 297: Seminar in Research Design and Methods under Dr. Francisco Datar, Department of Anthropology, College of Social Sciences and Philosophy, University of the Philippines Diliman
Are you interested in learning about text analysis but have little to no experience with programming languages or writing code? These two short courses will introduce you to multiple text analysis methods. We will examine real-world examples and engage in hands-on activities that don’t require running any code. These short courses are ideal for students and researchers in non-technical fields, faculty who would like to incorporate text analysis in their curriculum, or as a precursor to programming with text analysis tools.
A Gentle Introduction to Text Analysis I will cover both qualitative and quantitative text analysis methods, bag-of-words techniques and classification.
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence Marina Santini
Query logs are an important source of information to surmize users intents'. Although Karlgren (2010) points out that “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“, some linguistic problems could be sorted out by applying more advanced text/content analytics, such as register/sublanguage identification and terminology classification (see Friberg Heppin, 2011) . In this presentation, I will argue that query logs can be considered a digital textual genre alike emails, blogs, chats, tweets and so forth. All these genres contain unstructured information that, still today, is difficult to leverage upon satisfactorily. The hypothesis that I would like to put forward in this workshop is that query logs might be easier to exploit to extract useful information and actionable intelligence than other digital genres.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
How Taxonomies and facets bring end users closer to big dataPeter Wren-Hilton
Pingar researcher Dr Anna Divoli's presentation given at the 2012 Text Analytics World Boston. Content includes discussion of taxonomies and big data,.
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton
Alyona Medelyan (Pingar), Anna Divoli (Pingar)
presented at Strata O'Reilly Making Data Work Conference on March 1, 2012
The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.
Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.
In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.
In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.
And read a full interview with Alyona and Anna at http://radar.oreilly.com/2012/02/unstructured-data-analysis-tools.html
Presented at Semantic Garage Meetup San Francisco 2011. Unstructured data comes at a high cost - $37,000 per year per person in information industries. By using tools to automatically add metadata enterprises can improve search results, speed e-discovery and risk assessment, summarize content and extract entities from files. Unstructured and semi-structured data represents a large component of big data. By turning unstructured content into business intelligence, enterprise can speed time to information.
Pingar chief research officer Alyona Medelyan presents research conducted jointly with Anna Divoli at the Human Computer Information Retrieval workshop 2011.
Presentation that won the SharePoint Idol competition at the 2011 New Zealand SharePoint Conference. Demonstrates how the Pingar technology can automatically populate metadata fields in SharePoint document collections.
1. Text Analytics World, Boston, October 3-4, 2012
Text Analytics on 2 Million
Documents: A Case Study
Plus, An Introduction into Keyword Extraction
Alyona Medelyan
2. What are these books about?
“Because he could” by D. Morris, E. McGann
“Still stripping after 25 years” by E. Burns
“Glut” by A. Wright
Only metadata will tell…
3. What this talk will cover:
• Who am I & my relation to the topic
• What types of keyword extraction are out there
• How does keyword extraction work
• How accurate can keywords be
• How to analyze 2 million documents efficiently
4. My Background
@zelandiya
medelyan.com
2005-2009 PhD Thesis on keyword extraction
“Human-competitive automatic topic indexing”
Maui
Multi-purpose
automatic topic indexing
nzdl.org/kea/ maui-indexer.googlecode.com
2010 co-organized keyword extraction competition
SemEval-2 SemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”
2010-2012 leading the R&D of Pingar’s text analytics API
Pingar API features: keyword & named entities extraction, summarization etc.
5. Findability is ensured with the help of metadata
Document Easy to extract: Metadata
Title, file type & location,
creation & modification date,
authors, publisher
Difficult to extract:
Keywords & keyphrases,
people & companies mentioned,
suppliers & addresses mentioned
6. What can text analytics determine from text?
focus of this presentation
keywords text text text
tags
text text text
sentiment
text text text
text text text
text text text
text text text
genre
categories
taxonomy terms
entities
names biochemical
patterns … entities
text text text
text text text
text text text
text text text
text text text
text text text
7. Types of keyword extraction (or topic indexing)
• Subject headings in libraries
• general with Library of Congress Subject Headings
• domain-specific in PubMed with MeSH categories
taxonomy terms
controlled indexing
• Keyphrases in academic publications
keywords
tags
• Tags in folksonomies
• by authors on Technorati
• by users on Del.icio.us
free indexing
8. Free indexing Controlled indexing
E.g. keywords, tags E.g. LCSH, ACM, MeSH
Inconsistent Restricted
No control Centrally controlled
No semantics Inflexible
Ad hoc Not always available
9. How keyword extraction works
Document Candidates Keywords
1. Extract phrases using the sliding window approach
NEJM usually has the highest impact factor of the journals of clinical medicine.
ignore Alternative approach:
stopwords a) Assign part-of-speech tags
b) Extract valid noun phrases (NPs)
NEJM
highest, highest impact, highest impact factor
impact, impact factor…
10. How keyword extraction works
Document Candidates Keywords
2. Normalize phrases (case folding, stemming etc.)
NEJM usually has the highest impact factor of the journals of clinical medicine.
NEJM nejm New England J of Med
highest high -
highest impact factor high impact factor -
impact impact -
impact factor impact factor Impact Factor
journals journal Journal
journals of clinical journal of clinic -
clinical clinic Clinic
clinical medicine clinic medic Medicine
medicine medic Medicine
11. How keyword extraction works
Document Candidates Properties Keywords
1. Frequency: number of occurrences (incl. synonyms)
2. Position: beginning/end of a document, title, headers
3. Phrase length: longer means more specific
4. Similarity: semantic relatedness to other candidates
5. Corpus statistics: how prominent in this particular text
6. Popularity: how often people select this candidate
7. Part of speech pattern: some patterns are more common
…
12. How keyword extraction works
Document Candidates Properties Scoring Keywords
Heuristics Supervised machine learning
A formula that combines most Train a model from manually
powerful features indexed documents
• requires accurate crafting • requires training data
• performs equaly well or less well • performs really well on docs that
across various domains are similar to training data, but
poorly on dissimilar ones
13. How accurate is keyword extraction?
• It’s subjective…
• But: the higher the indexing consistency is,
the better the search effectiveness (findability)
A – set of keyphrases 1
A B – set of keyphrases 2
C – set of keyphrases in common
C
ConsistencyRolling = 2C / (A + B)
B
ConsistencyHopper = C / (A + B – C)
14. Professional indexers’ keywords*
Agrovoc terms: energy public
value nutritional health
disorders regulations
weight
reduction nutrient disease developing
excesses control countries
nutritional
diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
surveillance
overweight
food
nutritional policies price
physiology
formation
food
overeating intake human nutrition
nutrition policies
price
foods food
fiscal policies
consumption
policies
prices
direct
urbanization globalization
taxation
taxes
* 6 professional FAO indexers assigned terms from the Agrovoc thesaurus
to the same document, entitled “The global obesity problem”
15. Comparison of 2 indexers
Agrovoc terms: energy public
Agrovoc relation: value nutritional health
disorders regulations
Indexer 1: weight
reduction nutrient disease developing
Indexer 2: excesses countries
control
nutritional
diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
surveillance
overweight
food
nutritional policies price
physiology
formation
food
overeating intake human nutrition
nutrition policies
price
foods food
fiscal policies
consumption
policies
prices
direct
urbanization globalization
taxation
taxes
16. Comparison of 6 indexers & Kea
Agrovoc terms: energy public
Agrovoc relation: value nutritional health
disorders regulations
Indexers: weight
reduction nutrient
1 2 3 4 5 6 disease developing
excesses control countries
nutritional
Kea Algorithm: diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
body weight overweight surveillance
food
nutritional policies price
physiology
formation
price fixing
saturated fat food
overeating intake human nutrition
nutrition policies controlled prices
foods food price
policies
consumption fiscal policies
policies prices
direct
urbanization globalization
taxation
taxes
17. Comparison of CS students* & Maui
* 15 teams of 2 students each assigned keywords to the same document,
entitled “A safe, efficient regression test selection technique”
18. Human vs. algorithm consistency
6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus
Method Min Avg Max
Professionals 26 39 47
KEA 24 32 38
15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary
Method Min Avg Max
Students 21 31 37
Maui 24 32 36
CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing
With other taggers With Maui
330 taggers & 180 docs 19 24
35 taggers & 140 docs 38 35
19. Text Analytics on 2 Million Documents:
A Case Study
+
Collaboration with Gene Golovchinsky
fxpal.com/?p=gene
20. The dataset
Twitter
490 Million
CiteSeer tweets per
1.7 Million week
scientific 84 GB
publications
110 GB Wikipedia
3.6 Million articles
13 GB
Britannica
0.65 Million articles
ICWSM 2011 0.3 GB
2.1 TB (compressed!)
News, blogs, forums, etc.
slideshare.net/raffikrikorian/twitter-by-the-numbers
en.wikipedia.org/wiki/Wikipedia:Size_comparisons
21. The task goal
1. Extract all phrases that appear in search results
2. Weigh and suggest the best phrases for query refinement
Gene’s collaborative search system Querium
22. Step 1: Get time estimates
A. Take a subset, e.g. 100 documents
B. Run on various machines / settings
C. Extrapolate to the entire dataset, e.g. 1.7M docs
Our example:
• Standard laptop 4 Core, 8GB RAM: 30 days
• Similar Rackspace VM: 46 days
• Threading reduces time: 24 days
23. Step 2: Look into your data
Understand the nature of your data:
look at samples, compute statistcs.
Speed up by removing anomalies & targetting the text analytics.
Our example:
30% docs exceed 50KB (some ≈600KB)
Most important phrase appear in title,
abstract, introduction and conclusions.
Only process top 30% and last 20%
This reduces the time by 57%!
24. Validate: Can we crop our documents?
Top 20 keywords from*…
…original document ...cropped document
Top N How many were ontology ontology
knowledge base knowledge base
keywords in found in the knowledge knowledge engineering
original doc cropped doc representation knowledge
Semantic Web representation
10 91% WordNet WordNet
50 80% knowledge engineering predicate logic
predicate logic artificial intelligence
100 75% artificial intelligence ontology engineering
semantic networks semantic networks
All 64% natural language Semantic Web
first-order logic first-order logic
ontology engineering block diagram
lexicon dynamic systems
conceptual graphs higher-order logic
higher-order logic conceptual graphs
natural language modeling & simulation
processing universe of discourse
* Toward principles for the design of design rationale bond graph
ontologies used for knowledge sharing block diagram lexicon
T. R. Gruber (1993)
25. Step 3: Go cloud
Don’t be afraid to bring out the big guns
• Large Elastic Compute instance
1000 docs x 4 threads = 30 min
• High-CPU Extra Large (8 virtual cores)
1000 docs x 24 threads = 6 min
Also: increase the number of machines
• 4 machines = 4 times faster,
i.e. 50 instead of 200 hours (or 1 weekend!)
26. How long would a human
need to extract keywords
from 1.7M docs?
Min per Min Hours Days* Years**
doc
1 1.700.000 28.333 3.542 14
2 3.400.000 56.666 7.083 28
3 5.100.000 85.000 10.625 42
* Taking into account 8h per working day
** Assuming 250 working days per year (no holidays, no sickdays)
http://www.flickr.com/photos/mararie/2663711551/
27. Document Candidates Properties Scoring Keywords
To estimate quality, take a sample and compute
inter-indexer consistency between several people
CiteSeer
1.7 Million
scientific
publications
110 GB 1. Get time estimates
Can be done 2. Look into your data
in a weekend 3. Go cloud
Don’t do it manually!
Keyword extraction : medelyan.com/files/phd2009.pdf
CiteSeer study: pingar.com/technical-blog/
Pingar API: apidemo.pingar.com
Editor's Notes
KEA performs better than 8 of the best taggers
Dev machine: 4 Core CPU, 8 RAMRackspace
So among the top 10 keywords from the full document, 91% appear in the keywords from the chopped document (so, basically 9 out of 10 are the same),