March 2, 2017
Ivelina Nikolova
Senior NLP Engineer
Efficient Practices for Large Scale
Text Mining Process
2Mar 2, 2017
In this webinar you will learn …
• Industry applications that maximize Return on
Investment (ROI) of your text mining process
• To describe your text mining problem
• To define the output of the text mining
• To select the appropriate text analysis techniques
• To plan the prerequisites for a successful text mining
solution
• DOs and DON’Ts in setting up a text mining process.
3
Outline
• Business need for text mining solutions
• Introduction to NLP and information extraction
• How to tailor your text analysis process
• Applications and demonstrations
Mar 2, 2017
4
Analyzing text to capture data from them supports:
– increased user engagement via content
recommendations,
– shortened research cycle via semantic search,
– regulatory compliance via smart indexing,
– better content management etc.
Business needs for text mining solutions
Mar 2, 2017
5
Some of our customers
Mar 2, 2017
6
• Parsing texts in order to extract machine-readabe facts
from them.
• Create sets of structured or semi-structured data out
of heaps of unstructured heterogeneous documents.
• Relies on natural language processing techniques like:
- automatic morphological analysis,
- automated syntax analysis,
- term weights and co-occurrence,
- lexical semantics,
and more compex tasks like:
- named entity recognition
- relation extraction etc.
Text analysis
Mar 2, 2017
7
• Inextricably tied to text analysis
• Links mentions in the text to knowledge base concepts
• Automatic, manual and semi-automatic
Semantic annotation/enrichment
Mar 2, 2017
8
• Named Entity Recognition
– 60% F1 [OKE-challenge@ESWC2015]
– 82.9% F1 [Leaman and Lu, 2016] in the biomedical
domain
– above 90% for more specific tasks
State-of-the art
Mar 2, 2017
9
Designing the text mining process
• Know your business problem
• Know your data
• Find appropriate samples
• Use common formats or formats which can be easily transformed to such
• Get together domain experts, technical staff, NLP engineers and potential
users
• Narrow the business problem to information extraction task
• Clearly define the annotation types
• Clearly define the annotation guidelines
• Apply the appropriate algorithm for IE
• Do iterations of evaluation and improvement
• Insure continuous adaptation by curation and re-training
Mar 2, 2017
10Mar 2, 2017
11
Clear problem definition
• Define clearly your business problem
• specific smart search
• content recommendation
• content enrichment
• content aggregation etc.
E.g. the system must do <A, B, C>
• Define clearly the text analysis problem
• Reduce the business problem to information extraction problem
Business problem: faceted search by Persons, Organizations,
Locations
Information extraction problem: extract mentions of Persons,
Organizations, Locations and link them to the corresponding
concepts in the knowledge base
Mar 2, 2017
12
• Annotations – abstract descriptions of the mentions of
concepts of interest
Named entities: Person, Location, Organization
Disease, Symptom, Chemical
SpaceObject, SpaceCraf
Relations: PersonHasRoleInOrganisation, Causation
Define the annotation types I
Mar 2, 2017
13
• Annotation types
• Person, Organization, Location
• Person, Organization, City
• Person, Organization, City, Country
• Annotation features
Location: string, geonames instance, latitude, longitude
Define the annotation types II
Mar 2, 2017
14
• Annotation types
• Person, Organization, Location
• Person, Organization, City
• Person, Organization, City, Country
• Annotation features
Location: string, geonames instance, latitude, longitude
Chemical: string, inChi, SMILES, CAS
PersonHasRoleInOrganization: person instance, role instance,
organization instance, timestamp
Define the annotation types II
string: the Gulf of Mexico
startOffset: 71
endOffset: 89
type: Location
inst: http://ontology.ontotext.com/resource/tsk7b61yf5ds
links: [http://sws.geonames.org/3523271/
http://dbpedia.org/resource/Gulf_of_Mexico]
latitude:25.368611
longitude:-90.390556
Mar 2, 2017
15
Locations mentioned Holocaust documents
Mar 2, 2017
16
• Realistic
• Demonstrating the desired output
• Positive and negative
• “It therefore increases insulin secretion and reduces POS[glucose] levels,
especially postprandially.”
• “It acts by increasing POS[NEG[glucose]-induced insulin] release and by
reducing glucagon secretion postprandially.”
• Representative and balanced set of the types of problems
• In appropriate/commonly used format – XML, HTML, TXT,
CSV, DOC, PDF.
Provide examples
Mar 2, 2017
17
Domain model and knowledge
• Domain model/ontology - describes the types of objects in the
problem area and the relations between them
Mar 2, 2017
18
• Data sources - proprietary data, public data, professional data
• Data cleanup
• Data formats
• Data stores
• For metadata - GraphDB (http://ontotext.com/graphdb/)
• For content – MongoDB, MarkLogic etc.
• Data modeling is inevitable part of the process of semantic data
enrichment
• Start it as early as possible
• Keep to the common data formats
• Mistakes and underestimations are expensive because they influence the
whole process of developing a text mining solution
Data
Mar 2, 2017
19
• Gold standard – annotated data with superior quality
• Annotation guidelines - used as guidance for manually annotating the
documents.
POS[London] universities = universities located in London
NEG[London] City Council
NEG[London] Mayor
• Manual annotation tools – intuitive UI, visualization features, export formats
• MANT – Ontotext's in-house tool
• GATE – http://gate.ac.uk/ and https://gate.ac.uk/teamware/
• Brad - http://brat.nlplab.org/
• Annotation approach
• Manual vs. semi-automatic
• Domain experts vs. crowd annotation
• E.g. Mechanical Turk - https://www.mturk.com/
• Inter-annotator agreement
• Train:Test ratio – 60:40, 70:30
Gold standard
Mar 2, 2017
20
• Rule-based approach
• lower number of clear patterns which do not change over time or slightly change
• high precision
• appropriate for domains where it is important to know how the decision for extracting
given annotation is taken – e.g. bio-medical domain
• Machine learning approach
• higher number of patterns which do change over time
• requires annotated data
• allows for retraining over time
• Neural Network approach
• Deep Neural Networks - getting closer to AI
• Recent advances promise true natural language understanding via complex neural
networks
• Great results in Speech recognition, Image recognition and Machine translation;
breakthrough expected in NLP
• Still unclear why and how it works thus difficult to optimize
Text analysis approach
Mar 2, 2017
21
• Preprocessing
• Keyphrase extraction
• Gazetteer based enrichment
• Named entity recognition and disambiguation
• Generic entity extraction
• Result consolidation
• Relation extraction
NER Pipeline
Mar 2, 2017
22
NER pipeline
Mar 2, 2017
23
NER pipeline
Mar 2, 2017
24
NER pipeline
Mar 2, 2017
25
• Curation of results - domain experts assess manually the work of the text
analysis components
• Testing interfaces
• Feedback
• Select representative set of documents to evaluate manually
• Provide as full description of the results and the used component as
possible:
 <pipeline version>
 <input as send for processing>
 <description of the wrong behavior>
 <description of the correct behavior>
• The earlier this happens it triggers revision of the models and
improvement of the annotation
Results curation / Error analysis
Mar 2, 2017
26
• Gold standard split train:test
• 70:30
• 80:20
• Which task you want to evaluate
• E.g. extraction at document level
or inline annotation
• Evaluation metrics
• Information extraction tasks – precision, recall, F-measure
• Recommendations – A/B-testing
Evaluation of the results
Mar 2, 2017
27
Continuous adaptation
Mar 2, 2017
28
• Document categorization
• post, political news, sport news, etc.;
• Topic extraction
• important words and phrases in the text;
• Named entity recognition
• People, Organization, Location, Time, Amounts of money, etc.;
• Keyterm assignment from predefined hierarchies
• Concept extraction
• entities from a knowledge base;
• Relation extraction
• relations between types of entities.
Types of extracted information
Mar 2, 2017
29
• TAG (http://tag.ontotext.com)
• NOW (http://now.ontotext.com)
• Patient Insights (http://patient.ontotext.com/) -
contact todor.primov@ontotext.com for credentials.
Applications
Mar 2, 2017
30
• Clearly defined business problem needs to be broken down to a clearly defined
information extraction problem
• Requires combined efforts from business decision makers, domain experts,
natural language processing experts and technical staff
• Data modeling is inevitable part of the process, consider it as early as possible
• Create clear annotation guidelines based on real-world examples
• Start with an initial small set of balanced and representative documents
• Plan the evaluation of the results in advance
• Choose appropriate manual annotation tool
• While annotating content check how the quantity influences the performance
• Select the appropriate text analysis approach
• Plan iterations of curation by domain experts followed by revision of the text
analysis approach
• Plan the aspects of continuous adaptation – document quantity, timing,
temporality of the information fed in the model
Take away messages - DOs
Mar 2, 2017
31
Most common mistakes are caused by under/overestimation of some phases in the text
mining process:
• Underestimated efforts for training corpus – this may lead to a longer phase of determining
the correct algorithms and training models.
• Underestimated efforts for evaluation corpus – this may lead to a solution which cannot be
practically evaluated thus formally delivered/released.
• Overestimating the value of the data in the text mining process – if you spend too much
efforts in building your own vocabularies, you will most probably end up with the same text
mining solution as if you buy professionally prepared data.
• Underestimating the data ETL before starting a text mining solution – this may lead to a
delay in the text mining solution, caused by delayed training cicle.
• Overexpectations from dynamic data updates – it ofen turns that when the solution is
ready, it is more important to have a good process for dynamic update of data rather than
having the updates instantly avaiable.
• Intolerance towards extraction speed – this may leed to a faster solution which offer lower
quality resuts. If the speed is not crucial tolerate it.
• No readiness to implement changes in the workflow and collected data. The good
automatised soution is not the one that completely replaces the manual workflow but the
one that brings higher value to your business. Be ready to slightly change your workflow,
start collecting some new data and aim for an automated solution which is focused in new
benefits.
Take away messages – DON'Ts
Mar 2, 2017
32
Thank you very much for the attention!
You are welcome to try our demos at http://ontotext.com
Mar 2, 2017

Efficient Practices for Large Scale Text Mining Process

  • 1.
    March 2, 2017 IvelinaNikolova Senior NLP Engineer Efficient Practices for Large Scale Text Mining Process
  • 2.
    2Mar 2, 2017 Inthis webinar you will learn … • Industry applications that maximize Return on Investment (ROI) of your text mining process • To describe your text mining problem • To define the output of the text mining • To select the appropriate text analysis techniques • To plan the prerequisites for a successful text mining solution • DOs and DON’Ts in setting up a text mining process.
  • 3.
    3 Outline • Business needfor text mining solutions • Introduction to NLP and information extraction • How to tailor your text analysis process • Applications and demonstrations Mar 2, 2017
  • 4.
    4 Analyzing text tocapture data from them supports: – increased user engagement via content recommendations, – shortened research cycle via semantic search, – regulatory compliance via smart indexing, – better content management etc. Business needs for text mining solutions Mar 2, 2017
  • 5.
    5 Some of ourcustomers Mar 2, 2017
  • 6.
    6 • Parsing textsin order to extract machine-readabe facts from them. • Create sets of structured or semi-structured data out of heaps of unstructured heterogeneous documents. • Relies on natural language processing techniques like: - automatic morphological analysis, - automated syntax analysis, - term weights and co-occurrence, - lexical semantics, and more compex tasks like: - named entity recognition - relation extraction etc. Text analysis Mar 2, 2017
  • 7.
    7 • Inextricably tiedto text analysis • Links mentions in the text to knowledge base concepts • Automatic, manual and semi-automatic Semantic annotation/enrichment Mar 2, 2017
  • 8.
    8 • Named EntityRecognition – 60% F1 [OKE-challenge@ESWC2015] – 82.9% F1 [Leaman and Lu, 2016] in the biomedical domain – above 90% for more specific tasks State-of-the art Mar 2, 2017
  • 9.
    9 Designing the textmining process • Know your business problem • Know your data • Find appropriate samples • Use common formats or formats which can be easily transformed to such • Get together domain experts, technical staff, NLP engineers and potential users • Narrow the business problem to information extraction task • Clearly define the annotation types • Clearly define the annotation guidelines • Apply the appropriate algorithm for IE • Do iterations of evaluation and improvement • Insure continuous adaptation by curation and re-training Mar 2, 2017
  • 10.
  • 11.
    11 Clear problem definition •Define clearly your business problem • specific smart search • content recommendation • content enrichment • content aggregation etc. E.g. the system must do <A, B, C> • Define clearly the text analysis problem • Reduce the business problem to information extraction problem Business problem: faceted search by Persons, Organizations, Locations Information extraction problem: extract mentions of Persons, Organizations, Locations and link them to the corresponding concepts in the knowledge base Mar 2, 2017
  • 12.
    12 • Annotations –abstract descriptions of the mentions of concepts of interest Named entities: Person, Location, Organization Disease, Symptom, Chemical SpaceObject, SpaceCraf Relations: PersonHasRoleInOrganisation, Causation Define the annotation types I Mar 2, 2017
  • 13.
    13 • Annotation types •Person, Organization, Location • Person, Organization, City • Person, Organization, City, Country • Annotation features Location: string, geonames instance, latitude, longitude Define the annotation types II Mar 2, 2017
  • 14.
    14 • Annotation types •Person, Organization, Location • Person, Organization, City • Person, Organization, City, Country • Annotation features Location: string, geonames instance, latitude, longitude Chemical: string, inChi, SMILES, CAS PersonHasRoleInOrganization: person instance, role instance, organization instance, timestamp Define the annotation types II string: the Gulf of Mexico startOffset: 71 endOffset: 89 type: Location inst: http://ontology.ontotext.com/resource/tsk7b61yf5ds links: [http://sws.geonames.org/3523271/ http://dbpedia.org/resource/Gulf_of_Mexico] latitude:25.368611 longitude:-90.390556 Mar 2, 2017
  • 15.
    15 Locations mentioned Holocaustdocuments Mar 2, 2017
  • 16.
    16 • Realistic • Demonstratingthe desired output • Positive and negative • “It therefore increases insulin secretion and reduces POS[glucose] levels, especially postprandially.” • “It acts by increasing POS[NEG[glucose]-induced insulin] release and by reducing glucagon secretion postprandially.” • Representative and balanced set of the types of problems • In appropriate/commonly used format – XML, HTML, TXT, CSV, DOC, PDF. Provide examples Mar 2, 2017
  • 17.
    17 Domain model andknowledge • Domain model/ontology - describes the types of objects in the problem area and the relations between them Mar 2, 2017
  • 18.
    18 • Data sources- proprietary data, public data, professional data • Data cleanup • Data formats • Data stores • For metadata - GraphDB (http://ontotext.com/graphdb/) • For content – MongoDB, MarkLogic etc. • Data modeling is inevitable part of the process of semantic data enrichment • Start it as early as possible • Keep to the common data formats • Mistakes and underestimations are expensive because they influence the whole process of developing a text mining solution Data Mar 2, 2017
  • 19.
    19 • Gold standard– annotated data with superior quality • Annotation guidelines - used as guidance for manually annotating the documents. POS[London] universities = universities located in London NEG[London] City Council NEG[London] Mayor • Manual annotation tools – intuitive UI, visualization features, export formats • MANT – Ontotext's in-house tool • GATE – http://gate.ac.uk/ and https://gate.ac.uk/teamware/ • Brad - http://brat.nlplab.org/ • Annotation approach • Manual vs. semi-automatic • Domain experts vs. crowd annotation • E.g. Mechanical Turk - https://www.mturk.com/ • Inter-annotator agreement • Train:Test ratio – 60:40, 70:30 Gold standard Mar 2, 2017
  • 20.
    20 • Rule-based approach •lower number of clear patterns which do not change over time or slightly change • high precision • appropriate for domains where it is important to know how the decision for extracting given annotation is taken – e.g. bio-medical domain • Machine learning approach • higher number of patterns which do change over time • requires annotated data • allows for retraining over time • Neural Network approach • Deep Neural Networks - getting closer to AI • Recent advances promise true natural language understanding via complex neural networks • Great results in Speech recognition, Image recognition and Machine translation; breakthrough expected in NLP • Still unclear why and how it works thus difficult to optimize Text analysis approach Mar 2, 2017
  • 21.
    21 • Preprocessing • Keyphraseextraction • Gazetteer based enrichment • Named entity recognition and disambiguation • Generic entity extraction • Result consolidation • Relation extraction NER Pipeline Mar 2, 2017
  • 22.
  • 23.
  • 24.
  • 25.
    25 • Curation ofresults - domain experts assess manually the work of the text analysis components • Testing interfaces • Feedback • Select representative set of documents to evaluate manually • Provide as full description of the results and the used component as possible:  <pipeline version>  <input as send for processing>  <description of the wrong behavior>  <description of the correct behavior> • The earlier this happens it triggers revision of the models and improvement of the annotation Results curation / Error analysis Mar 2, 2017
  • 26.
    26 • Gold standardsplit train:test • 70:30 • 80:20 • Which task you want to evaluate • E.g. extraction at document level or inline annotation • Evaluation metrics • Information extraction tasks – precision, recall, F-measure • Recommendations – A/B-testing Evaluation of the results Mar 2, 2017
  • 27.
  • 28.
    28 • Document categorization •post, political news, sport news, etc.; • Topic extraction • important words and phrases in the text; • Named entity recognition • People, Organization, Location, Time, Amounts of money, etc.; • Keyterm assignment from predefined hierarchies • Concept extraction • entities from a knowledge base; • Relation extraction • relations between types of entities. Types of extracted information Mar 2, 2017
  • 29.
    29 • TAG (http://tag.ontotext.com) •NOW (http://now.ontotext.com) • Patient Insights (http://patient.ontotext.com/) - contact todor.primov@ontotext.com for credentials. Applications Mar 2, 2017
  • 30.
    30 • Clearly definedbusiness problem needs to be broken down to a clearly defined information extraction problem • Requires combined efforts from business decision makers, domain experts, natural language processing experts and technical staff • Data modeling is inevitable part of the process, consider it as early as possible • Create clear annotation guidelines based on real-world examples • Start with an initial small set of balanced and representative documents • Plan the evaluation of the results in advance • Choose appropriate manual annotation tool • While annotating content check how the quantity influences the performance • Select the appropriate text analysis approach • Plan iterations of curation by domain experts followed by revision of the text analysis approach • Plan the aspects of continuous adaptation – document quantity, timing, temporality of the information fed in the model Take away messages - DOs Mar 2, 2017
  • 31.
    31 Most common mistakesare caused by under/overestimation of some phases in the text mining process: • Underestimated efforts for training corpus – this may lead to a longer phase of determining the correct algorithms and training models. • Underestimated efforts for evaluation corpus – this may lead to a solution which cannot be practically evaluated thus formally delivered/released. • Overestimating the value of the data in the text mining process – if you spend too much efforts in building your own vocabularies, you will most probably end up with the same text mining solution as if you buy professionally prepared data. • Underestimating the data ETL before starting a text mining solution – this may lead to a delay in the text mining solution, caused by delayed training cicle. • Overexpectations from dynamic data updates – it ofen turns that when the solution is ready, it is more important to have a good process for dynamic update of data rather than having the updates instantly avaiable. • Intolerance towards extraction speed – this may leed to a faster solution which offer lower quality resuts. If the speed is not crucial tolerate it. • No readiness to implement changes in the workflow and collected data. The good automatised soution is not the one that completely replaces the manual workflow but the one that brings higher value to your business. Be ready to slightly change your workflow, start collecting some new data and aim for an automated solution which is focused in new benefits. Take away messages – DON'Ts Mar 2, 2017
  • 32.
    32 Thank you verymuch for the attention! You are welcome to try our demos at http://ontotext.com Mar 2, 2017