Approaches to text analysis

Text Analysis
on News Documents
Rohtih Yeravothula
rohith@sigmoid.com
SDE, Sigmoid

Outline
•Introduction to Text Analytics
•Introduction to News Documents text analysis
•Introduction to our Architecture and its elements
•Introduction to Compute Pipeline
•Phase I Computation
•Phase II Computation
•Knowledge Graph
•Introduction to News-Explorer

Why Text Analytics ??
l80% of all data today is unstructured This includes news articles,
research reports, social media posts and enterprise system data

lText Analytics can help an organization derive potentially valuable
business insights from text-based content such as word documents,
email and postings on social media streams like Facebook, Twitter or
customer responses posted on e commerce sites like flipkart, amazon
Why Text Analytics ??

•Sentiment Analysis
•Hot Topics or Key Trends
•Root cause analysis
Appications of Text Analytics

Text Analytics
• Is the process of analyzing unstructured text, extracting relevant
information, and transforming it into useful business intelligence.
•It includes the process of structuring the input text (usually parsing)
deriving patterns within the structured data, and finally evaluation and
interpretation of the output

lWe at Sigmoid are trying to build a knowledge graph by identifying high-
quality information from the raw text from news documents that we
crawl and parse.
News Documents text analysis

News Documents text analysis : Statistics

News Documents text analysis : Statistics
lEvery day we crawl 20K new documents
lWe crawl 35K archive news documents
lWe have crawled 40M archive news documents
lToday it mounts up to approximately 100GB of text to process

News Documents text analysis
l
lArchitecture

Architecture Element : MongoDB
lNews Documents crawled are put into mongoDB
lPreprocessing on documents is done, elimination of duplicates, empty
documents etc..
lDocuments raw text is extracted and stored to Hadoop

Compute Pipeline : Preprocessing
lRaw text extracted from News Documents gets preprocessed
lPreprocessing Includes removing Non-ASCII characters,
lremoving punctuation marks, symbols like '|','','{' etc.
lSplitting the whole text into sentences

lCompute Process includes identification of Concepts and Identifying the
relations between concepts
lPhase I : Parse the raw text and identify concepts
lPhase II : Parse the raw text again and Identify relations between
concepts
Compute Pipeline : Main Algorithm

lConcept is a meaning full N-gram that is formed from a concept of N-1
gram and 1-gram provided they make sense
lImportant factors that judge the meaning of a concept are
lConfidence, (Frequency)
lStrength

Strenth{Concept(N)} = Function{Confidence(Concept(N),
Confidence(Concept(N-1)}
lNew N-gram Concept is formed from a concept (N-1)-gram and 1-gram
lPhase I identifies all the concepts

lPhase II : Identifying the relations between the concepts
lTwo factors in judging the meaning of a relation between two meaning
full concepts
lConfidence, (Frequency)
lStrength

lNew relation is formed from two meaning full concepts that were
identified appear in same sentence.
lA relation is logical representation of two concepts that appear in a
single sentence and some how makes sense appearing together

Strength[Relation(Concept1,Concept2)] =
Function{Confidence(Concept1,Concept2),
Confidence(Concept1),Confidence(Concept2)}

lPhase II will compute relations from concepts formed from Phase I
Strength[Relation(Concept1,Concept2)] =
Function{Confidence(Concept1,Concept2),
Confidence(Concept1),Confidence(Concept2)}

lConcepts can be treated as vertices of knowledge graph
lRelations can be treated as edges between two vertices (concepts)
lGraph storage is the best way to represent the information for fast
retrieval
Compute Pipeline : Knowledge Graph

Concepts and Relations
lConcepts:
l sachin tendulkar, params
l sania mirza, params
l barack obama, params
lRelations:
l sachin tendulkar, rahul gandhi, params
l sonia gandhi, rahul gandhi, params
l sachin tendulkar, BCCI, params

Concepts and Relations : Sample sub graph

Advantages of Graph
lRecommendation
lPath Finding
lTraversal

Introduction to News Explorer
Personalized, customizable application to explore about about a topic.
Gives insight to Concepts connected to search made by the user
Maintains search history for every use
Recommends new documents based on user history

Approaches to text analysis

More Related Content

Viewers also liked

Similar to Approaches to text analysis

More from Sigmoid

Recently uploaded

Approaches to text analysis