SlideShare a Scribd company logo
1 of 22
Shivani singh
NLP MODEL
(6 STEP EXPLANATION)
IDENTIFY THE DATA
Sources
Acquiring The
Data(how do you acquire
content from the Internet? There
are, fundamentally, four
techniques)
Handling A Very, Very Large Number of
Sources…
If you need to acquire content from a large number of
data sources, you will likely need to develop your
own data acquisition and ingestion tools.
Cleansing and
Formatting Content
● Determine the format (e.g. PDF, XML, HTML, etc.)
● Extract text content
● Identify and remove useless sections, such as common
headers, footers, and sidebars as well as legal or
commercial boilerplates
● Identify differences and changes
● Extract coded metadata
Approaches to Cleansing and Formatting Data from the Internet
Approach 1: Use screen scrapers and/or browser automation tools :
Advantages: extracts metadata from complex structures
Disadvantages: does not work at large scale or with a large variety of content and typically requires software programming
Approach 2: Use text extractors like Apache Tika or Oracle Outside In
Advantages: works on all types of files and formats
Disadvantages: does not extract much metadata (title, description, author) and may not extract content structure (headings,
paragraphs, tables, etc.)
Approach 3: Custom coding based on the format, such as XML SAX parser, Beautiful Soup for HTML, and Aspose for other formats
Advantages: most power and flexibility
Disadvantages: most expensive to implement and custom coding is required
Additional Tools
These additional tools can work in conjunction with the basic cleansing and extraction
methods above.
Common paragraph removal
● Identifies common, frequently occurring paragraphs so they can be automatically
removed
Structure mapping patterns
● These are large, structural patterns which are easy to describe. They are applied to
input documents to extract and map metadata.
● Patterns can be XML, HTML, or text patterns.
Optical Character Recognition (OCR)
● OCR systems extract text from images, so the text can be further processed by
machines.
● There are some open source engines (e.g. Tesseract and OCRopus) as well as
some good commercial options (e.g. Abbyy and AquaForest).
Natural
Language
Processing
(NLP)
Techniques
for Extracting
Information
STEP 1
Understand the Whole Document (Macro Understanding)
Once you have decided to embark on your NLP project, if you need a more holistic understanding of the document this is a “macro
understanding.” This is useful for:
● Classifying / categorizing / organizing records
● Clustering records
● Extracting topics
● General sentiment analysis
● Record similarity, including finding similarities between different types of records (for example, job descriptions to résumés /
CVs)
● Keyword / keyphrase extraction
● Duplicate and near-duplicate detection
● Summarization / key sentence extraction
● Semantic search
STEP 2
Step 3
Extracting Facts, Entities, and Relationships (Micro Understanding)
Micro understanding is the extracting of individual entities, facts or relationships from the text. This is useful for (from easiest to
hardest):
● Extracting acronyms and their definitions
● Extracting citation references to other documents
● Extracting key entities (people, company, product, dollar amounts, locations, dates). Note that extracting “key” entities is not the
same as extracting “all” entities (there is some discrimination implied in selecting what entity is ‘key’)
● Extracting facts and metadata from full text when it’s not separately tagged in the web page
● Extracting entities with sentiment (e.g. positive sentiment towards a product or company)
● Identifying relationships such as business relationships, target / action / perpetrator, etc.
● Identifying compliance violations, statements which show possible violation of rules
● Extracting statements with attribution, for example, quotes from people (who said what)
● Extracting rules or requirements, such as contract terms, regulation requirements, etc.
STEP 4
Micro understanding must be done with syntactic analysis of the text. This
means that order and word usage are important.
1. Top Down – determine Part of Speech, then understand and diagram the sentence into clauses, nouns, verbs, object and subject,
modifying adjectives and adverbs, etc., then traverse this structure to identify structures of interest
● Advantages – can handle complex, never-seen-before structures and patterns
● Disadvantages – hard to construct rules, brittle, often fails with variant input, may still require substantial pattern
matching even after parsing.
2. Bottoms Up – create lots of patterns, match the patterns to the text and extract the necessary facts.
Patterns may be manually entered or may be computed using text mining.
● Advantages – Easy to create patterns, can be done by business users, does not require programming, easy to debug and fix,
runs fast, matches directly to desired outputs
● Disadvantages – Requires on-going pattern maintenance, cannot match on newly invented constructs
3. Statistical – similar to bottoms-up, but matches patterns against a statistically weighted database of
patterns generated from tagged training data.
● Advantages – patterns are created automatically, built-in statistical trade-offs
● Disadvantages – requires generating extensive training data (1000’s of examples), will need to be periodically retrained for best
accuracy, cannot match on newly invented constructs, harder to debug
Service frameworks for NLP
● IBM Cognitive – statistical approach based on training data
● Google Cloud Natural Language API – top-down full-sentence diagramming system
● Amazon Lex – geared more towards human-interactive (human in the loop) conversatio
Some tricky things to watch out for
● Co-reference resolution - sentences often refer to previous objects.
- Pronoun reference: “She is 49 years old.”
- Partial reference: “Linda Nelson is a top accountant working in Hawaii. Linda is 49 years old.”
- Implied container reference: “The state of Maryland is a place of history. The capital, Annapolis, was founded in 1649.”
● Handling lists and repeated items
Eg;
“The largest cities in Maryland are Baltimore, Columbia, Germantown, Silver Spring, and Waldorf.”
- Such lists often break NLP algorithms and may require special handling which exists outside the standard
structures.
● Handling embedded structures
such as tables, markup, bulleted lists, headings, etc.
- Note that structure elements can also play havoc with NLP technologies.
- Make sure that NLP does not match sentences and patterns across structural boundaries. For example, from one
bullet point and into the next.
- Make sure that markup does not break NLP analysis where it shouldn’t. For example, embedded emphasis should
not cause undue problems.
STEP 5: Maintain Provenance / Traceability
STEP 6: Human-Aided Processes
WORK WITH
RESULTS
Data Has Been Cleansed and Processed. What's Next?
There are several places this information can go to:
● A search engine - to enhance the full document (additional metadata fields for
additional facets or filters) and to support search-based visualization dashboards
(Kibana, Banana, Hue, or ZoomData, for example)
● A relational database - to be combined with other business data for visualization
and business analytics (Tableau, Pentaho, or others)
● A graph database - for complex relationship analysis
● A monitoring and alerting tool - for situations that need immediate attention
(e.g. compliance violations, trending negative sentiment, bad customer service
situations, etc.)
● Apache Spark - for further real-time analytics and machine learning
● A business rules engine / ESB / workflow - to send the output through further
manual and business processing. For example, to review the output for quality,
check for compliance violations, etc.
● Custom applications - for quality review and analysis, crowdsourcing review,
etc.
Quality
Analysis
To do most quality analysis, you will need to check two parameters:
● Do you have everything? This is the “completeness” or “coverage”
check.
● Is what you have correct? This is the “accuracy” check.
Quality
Analysis
Goals
● Completeness of bulk download from the Internet
● Completeness of incremental download from the Internet
● Accuracy and completeness of tagged metadata extraction
● Accuracy and completeness of basic linguistic processing
● Accuracy and completeness of entity extraction
● Accuracy and completeness of categorization
● Accuracy and completeness of natural language processing
extraction
Nlp model

More Related Content

Similar to Nlp model

Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Rohit Dubey
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
Qualitative Content Analysis
Qualitative Content AnalysisQualitative Content Analysis
Qualitative Content Analysis
Ricky Bilakhia
 

Similar to Nlp model (20)

Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Fiona caldwell presentation
Fiona caldwell   presentationFiona caldwell   presentation
Fiona caldwell presentation
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture Strategy
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Tldr
TldrTldr
Tldr
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Talent Base: Best practises in a WCM project
Talent Base: Best practises in a WCM projectTalent Base: Best practises in a WCM project
Talent Base: Best practises in a WCM project
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
 
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata
 
Refactoring Fat Components
Refactoring Fat ComponentsRefactoring Fat Components
Refactoring Fat Components
 
Prescriptive Analytics-1.pptx
Prescriptive Analytics-1.pptxPrescriptive Analytics-1.pptx
Prescriptive Analytics-1.pptx
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Information Architecture Explained
Information Architecture ExplainedInformation Architecture Explained
Information Architecture Explained
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Qualitative Content Analysis
Qualitative Content AnalysisQualitative Content Analysis
Qualitative Content Analysis
 

Recently uploaded

1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
a8om7o51
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 

Recently uploaded (20)

1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 

Nlp model

  • 1. Shivani singh NLP MODEL (6 STEP EXPLANATION)
  • 2.
  • 5. Acquiring The Data(how do you acquire content from the Internet? There are, fundamentally, four techniques)
  • 6. Handling A Very, Very Large Number of Sources… If you need to acquire content from a large number of data sources, you will likely need to develop your own data acquisition and ingestion tools.
  • 7. Cleansing and Formatting Content ● Determine the format (e.g. PDF, XML, HTML, etc.) ● Extract text content ● Identify and remove useless sections, such as common headers, footers, and sidebars as well as legal or commercial boilerplates ● Identify differences and changes ● Extract coded metadata
  • 8. Approaches to Cleansing and Formatting Data from the Internet Approach 1: Use screen scrapers and/or browser automation tools : Advantages: extracts metadata from complex structures Disadvantages: does not work at large scale or with a large variety of content and typically requires software programming Approach 2: Use text extractors like Apache Tika or Oracle Outside In Advantages: works on all types of files and formats Disadvantages: does not extract much metadata (title, description, author) and may not extract content structure (headings, paragraphs, tables, etc.) Approach 3: Custom coding based on the format, such as XML SAX parser, Beautiful Soup for HTML, and Aspose for other formats Advantages: most power and flexibility Disadvantages: most expensive to implement and custom coding is required
  • 9. Additional Tools These additional tools can work in conjunction with the basic cleansing and extraction methods above. Common paragraph removal ● Identifies common, frequently occurring paragraphs so they can be automatically removed Structure mapping patterns ● These are large, structural patterns which are easy to describe. They are applied to input documents to extract and map metadata. ● Patterns can be XML, HTML, or text patterns. Optical Character Recognition (OCR) ● OCR systems extract text from images, so the text can be further processed by machines. ● There are some open source engines (e.g. Tesseract and OCRopus) as well as some good commercial options (e.g. Abbyy and AquaForest).
  • 11.
  • 12. Understand the Whole Document (Macro Understanding) Once you have decided to embark on your NLP project, if you need a more holistic understanding of the document this is a “macro understanding.” This is useful for: ● Classifying / categorizing / organizing records ● Clustering records ● Extracting topics ● General sentiment analysis ● Record similarity, including finding similarities between different types of records (for example, job descriptions to résumés / CVs) ● Keyword / keyphrase extraction ● Duplicate and near-duplicate detection ● Summarization / key sentence extraction ● Semantic search STEP 2
  • 13. Step 3 Extracting Facts, Entities, and Relationships (Micro Understanding) Micro understanding is the extracting of individual entities, facts or relationships from the text. This is useful for (from easiest to hardest): ● Extracting acronyms and their definitions ● Extracting citation references to other documents ● Extracting key entities (people, company, product, dollar amounts, locations, dates). Note that extracting “key” entities is not the same as extracting “all” entities (there is some discrimination implied in selecting what entity is ‘key’) ● Extracting facts and metadata from full text when it’s not separately tagged in the web page ● Extracting entities with sentiment (e.g. positive sentiment towards a product or company) ● Identifying relationships such as business relationships, target / action / perpetrator, etc. ● Identifying compliance violations, statements which show possible violation of rules ● Extracting statements with attribution, for example, quotes from people (who said what) ● Extracting rules or requirements, such as contract terms, regulation requirements, etc.
  • 14. STEP 4 Micro understanding must be done with syntactic analysis of the text. This means that order and word usage are important. 1. Top Down – determine Part of Speech, then understand and diagram the sentence into clauses, nouns, verbs, object and subject, modifying adjectives and adverbs, etc., then traverse this structure to identify structures of interest ● Advantages – can handle complex, never-seen-before structures and patterns ● Disadvantages – hard to construct rules, brittle, often fails with variant input, may still require substantial pattern matching even after parsing.
  • 15. 2. Bottoms Up – create lots of patterns, match the patterns to the text and extract the necessary facts. Patterns may be manually entered or may be computed using text mining. ● Advantages – Easy to create patterns, can be done by business users, does not require programming, easy to debug and fix, runs fast, matches directly to desired outputs ● Disadvantages – Requires on-going pattern maintenance, cannot match on newly invented constructs 3. Statistical – similar to bottoms-up, but matches patterns against a statistically weighted database of patterns generated from tagged training data. ● Advantages – patterns are created automatically, built-in statistical trade-offs ● Disadvantages – requires generating extensive training data (1000’s of examples), will need to be periodically retrained for best accuracy, cannot match on newly invented constructs, harder to debug
  • 16. Service frameworks for NLP ● IBM Cognitive – statistical approach based on training data ● Google Cloud Natural Language API – top-down full-sentence diagramming system ● Amazon Lex – geared more towards human-interactive (human in the loop) conversatio Some tricky things to watch out for ● Co-reference resolution - sentences often refer to previous objects. - Pronoun reference: “She is 49 years old.” - Partial reference: “Linda Nelson is a top accountant working in Hawaii. Linda is 49 years old.” - Implied container reference: “The state of Maryland is a place of history. The capital, Annapolis, was founded in 1649.”
  • 17. ● Handling lists and repeated items Eg; “The largest cities in Maryland are Baltimore, Columbia, Germantown, Silver Spring, and Waldorf.” - Such lists often break NLP algorithms and may require special handling which exists outside the standard structures. ● Handling embedded structures such as tables, markup, bulleted lists, headings, etc. - Note that structure elements can also play havoc with NLP technologies. - Make sure that NLP does not match sentences and patterns across structural boundaries. For example, from one bullet point and into the next. - Make sure that markup does not break NLP analysis where it shouldn’t. For example, embedded emphasis should not cause undue problems.
  • 18. STEP 5: Maintain Provenance / Traceability STEP 6: Human-Aided Processes
  • 19. WORK WITH RESULTS Data Has Been Cleansed and Processed. What's Next? There are several places this information can go to: ● A search engine - to enhance the full document (additional metadata fields for additional facets or filters) and to support search-based visualization dashboards (Kibana, Banana, Hue, or ZoomData, for example) ● A relational database - to be combined with other business data for visualization and business analytics (Tableau, Pentaho, or others) ● A graph database - for complex relationship analysis ● A monitoring and alerting tool - for situations that need immediate attention (e.g. compliance violations, trending negative sentiment, bad customer service situations, etc.) ● Apache Spark - for further real-time analytics and machine learning ● A business rules engine / ESB / workflow - to send the output through further manual and business processing. For example, to review the output for quality, check for compliance violations, etc. ● Custom applications - for quality review and analysis, crowdsourcing review, etc.
  • 20. Quality Analysis To do most quality analysis, you will need to check two parameters: ● Do you have everything? This is the “completeness” or “coverage” check. ● Is what you have correct? This is the “accuracy” check.
  • 21. Quality Analysis Goals ● Completeness of bulk download from the Internet ● Completeness of incremental download from the Internet ● Accuracy and completeness of tagged metadata extraction ● Accuracy and completeness of basic linguistic processing ● Accuracy and completeness of entity extraction ● Accuracy and completeness of categorization ● Accuracy and completeness of natural language processing extraction