The document provides a 6-step explanation of natural language processing (NLP) techniques for extracting information from documents:
1. Identify and acquire data from various sources and handle large volumes of data.
2. Cleanse and format content by extracting text, removing useless sections, and extracting metadata.
3. Use NLP techniques like topic modeling, keyword extraction, and sentiment analysis to understand documents at a macro level.
4. Extract facts, entities, and relationships from text using syntactic analysis or pattern matching.
5. Maintain provenance and traceability of extracted information.
6. Involve human processes like quality analysis to check completeness and accuracy of extracted information.
5. Acquiring The
Data(how do you acquire
content from the Internet? There
are, fundamentally, four
techniques)
6. Handling A Very, Very Large Number of
Sources…
If you need to acquire content from a large number of
data sources, you will likely need to develop your
own data acquisition and ingestion tools.
7. Cleansing and
Formatting Content
● Determine the format (e.g. PDF, XML, HTML, etc.)
● Extract text content
● Identify and remove useless sections, such as common
headers, footers, and sidebars as well as legal or
commercial boilerplates
● Identify differences and changes
● Extract coded metadata
8. Approaches to Cleansing and Formatting Data from the Internet
Approach 1: Use screen scrapers and/or browser automation tools :
Advantages: extracts metadata from complex structures
Disadvantages: does not work at large scale or with a large variety of content and typically requires software programming
Approach 2: Use text extractors like Apache Tika or Oracle Outside In
Advantages: works on all types of files and formats
Disadvantages: does not extract much metadata (title, description, author) and may not extract content structure (headings,
paragraphs, tables, etc.)
Approach 3: Custom coding based on the format, such as XML SAX parser, Beautiful Soup for HTML, and Aspose for other formats
Advantages: most power and flexibility
Disadvantages: most expensive to implement and custom coding is required
9. Additional Tools
These additional tools can work in conjunction with the basic cleansing and extraction
methods above.
Common paragraph removal
● Identifies common, frequently occurring paragraphs so they can be automatically
removed
Structure mapping patterns
● These are large, structural patterns which are easy to describe. They are applied to
input documents to extract and map metadata.
● Patterns can be XML, HTML, or text patterns.
Optical Character Recognition (OCR)
● OCR systems extract text from images, so the text can be further processed by
machines.
● There are some open source engines (e.g. Tesseract and OCRopus) as well as
some good commercial options (e.g. Abbyy and AquaForest).
12. Understand the Whole Document (Macro Understanding)
Once you have decided to embark on your NLP project, if you need a more holistic understanding of the document this is a “macro
understanding.” This is useful for:
● Classifying / categorizing / organizing records
● Clustering records
● Extracting topics
● General sentiment analysis
● Record similarity, including finding similarities between different types of records (for example, job descriptions to résumés /
CVs)
● Keyword / keyphrase extraction
● Duplicate and near-duplicate detection
● Summarization / key sentence extraction
● Semantic search
STEP 2
13. Step 3
Extracting Facts, Entities, and Relationships (Micro Understanding)
Micro understanding is the extracting of individual entities, facts or relationships from the text. This is useful for (from easiest to
hardest):
● Extracting acronyms and their definitions
● Extracting citation references to other documents
● Extracting key entities (people, company, product, dollar amounts, locations, dates). Note that extracting “key” entities is not the
same as extracting “all” entities (there is some discrimination implied in selecting what entity is ‘key’)
● Extracting facts and metadata from full text when it’s not separately tagged in the web page
● Extracting entities with sentiment (e.g. positive sentiment towards a product or company)
● Identifying relationships such as business relationships, target / action / perpetrator, etc.
● Identifying compliance violations, statements which show possible violation of rules
● Extracting statements with attribution, for example, quotes from people (who said what)
● Extracting rules or requirements, such as contract terms, regulation requirements, etc.
14. STEP 4
Micro understanding must be done with syntactic analysis of the text. This
means that order and word usage are important.
1. Top Down – determine Part of Speech, then understand and diagram the sentence into clauses, nouns, verbs, object and subject,
modifying adjectives and adverbs, etc., then traverse this structure to identify structures of interest
● Advantages – can handle complex, never-seen-before structures and patterns
● Disadvantages – hard to construct rules, brittle, often fails with variant input, may still require substantial pattern
matching even after parsing.
15. 2. Bottoms Up – create lots of patterns, match the patterns to the text and extract the necessary facts.
Patterns may be manually entered or may be computed using text mining.
● Advantages – Easy to create patterns, can be done by business users, does not require programming, easy to debug and fix,
runs fast, matches directly to desired outputs
● Disadvantages – Requires on-going pattern maintenance, cannot match on newly invented constructs
3. Statistical – similar to bottoms-up, but matches patterns against a statistically weighted database of
patterns generated from tagged training data.
● Advantages – patterns are created automatically, built-in statistical trade-offs
● Disadvantages – requires generating extensive training data (1000’s of examples), will need to be periodically retrained for best
accuracy, cannot match on newly invented constructs, harder to debug
16. Service frameworks for NLP
● IBM Cognitive – statistical approach based on training data
● Google Cloud Natural Language API – top-down full-sentence diagramming system
● Amazon Lex – geared more towards human-interactive (human in the loop) conversatio
Some tricky things to watch out for
● Co-reference resolution - sentences often refer to previous objects.
- Pronoun reference: “She is 49 years old.”
- Partial reference: “Linda Nelson is a top accountant working in Hawaii. Linda is 49 years old.”
- Implied container reference: “The state of Maryland is a place of history. The capital, Annapolis, was founded in 1649.”
17. ● Handling lists and repeated items
Eg;
“The largest cities in Maryland are Baltimore, Columbia, Germantown, Silver Spring, and Waldorf.”
- Such lists often break NLP algorithms and may require special handling which exists outside the standard
structures.
● Handling embedded structures
such as tables, markup, bulleted lists, headings, etc.
- Note that structure elements can also play havoc with NLP technologies.
- Make sure that NLP does not match sentences and patterns across structural boundaries. For example, from one
bullet point and into the next.
- Make sure that markup does not break NLP analysis where it shouldn’t. For example, embedded emphasis should
not cause undue problems.
19. WORK WITH
RESULTS
Data Has Been Cleansed and Processed. What's Next?
There are several places this information can go to:
● A search engine - to enhance the full document (additional metadata fields for
additional facets or filters) and to support search-based visualization dashboards
(Kibana, Banana, Hue, or ZoomData, for example)
● A relational database - to be combined with other business data for visualization
and business analytics (Tableau, Pentaho, or others)
● A graph database - for complex relationship analysis
● A monitoring and alerting tool - for situations that need immediate attention
(e.g. compliance violations, trending negative sentiment, bad customer service
situations, etc.)
● Apache Spark - for further real-time analytics and machine learning
● A business rules engine / ESB / workflow - to send the output through further
manual and business processing. For example, to review the output for quality,
check for compliance violations, etc.
● Custom applications - for quality review and analysis, crowdsourcing review,
etc.
20. Quality
Analysis
To do most quality analysis, you will need to check two parameters:
● Do you have everything? This is the “completeness” or “coverage”
check.
● Is what you have correct? This is the “accuracy” check.
21. Quality
Analysis
Goals
● Completeness of bulk download from the Internet
● Completeness of incremental download from the Internet
● Accuracy and completeness of tagged metadata extraction
● Accuracy and completeness of basic linguistic processing
● Accuracy and completeness of entity extraction
● Accuracy and completeness of categorization
● Accuracy and completeness of natural language processing
extraction