Text Analysis
on News Documents
Rohtih Yeravothula
rohith@sigmoid.com
SDE, Sigmoid
Outline
•Introduction to Text Analytics
•Introduction to News Documents text analysis
•Introduction to our Architecture and its elements
•Introduction to Compute Pipeline
•Phase I Computation
•Phase II Computation
•Knowledge Graph
•Introduction to News-Explorer
Why Text Analytics ??
Why Text Analytics ??
l80% of all data today is unstructured This includes news articles,
research reports, social media posts and enterprise system data
lText Analytics can help an organization derive potentially valuable
business insights from text-based content such as word documents,
email and postings on social media streams like Facebook, Twitter or
customer responses posted on e commerce sites like flipkart, amazon
Why Text Analytics ??
•Sentiment Analysis
•Hot Topics or Key Trends
•Root cause analysis
Appications of Text Analytics
Text Analytics
• Is the process of analyzing unstructured text, extracting relevant
information, and transforming it into useful business intelligence.
•It includes the process of structuring the input text (usually parsing)
deriving patterns within the structured data, and finally evaluation and
interpretation of the output
lWe at Sigmoid are trying to build a knowledge graph by identifying high-
quality information from the raw text from news documents that we
crawl and parse.
News Documents text analysis
News Documents text analysis
News Documents text analysis : Statistics
News Documents text analysis : Statistics
lEvery day we crawl 20K new documents
lWe crawl 35K archive news documents
lWe have crawled 40M archive news documents
lToday it mounts up to approximately 100GB of text to process
News Documents text analysis
l
lArchitecture
Architecture Element : MongoDB
lNews Documents crawled are put into mongoDB
lPreprocessing on documents is done, elimination of duplicates, empty
documents etc..
lDocuments raw text is extracted and stored to Hadoop
Compute Pipeline : Preprocessing
lRaw text extracted from News Documents gets preprocessed
lPreprocessing Includes removing Non-ASCII characters,
lremoving punctuation marks, symbols like '|','','{' etc.
lSplitting the whole text into sentences
lCompute Process includes identification of Concepts and Identifying the
relations between concepts
lPhase I : Parse the raw text and identify concepts
lPhase II : Parse the raw text again and Identify relations between
concepts
Compute Pipeline : Main Algorithm
lConcept is a meaning full N-gram that is formed from a concept of N-1
gram and 1-gram provided they make sense
lImportant factors that judge the meaning of a concept are
lConfidence, (Frequency)
lStrength
Compute Pipeline : Main Algorithm
Strenth{Concept(N)} = Function{Confidence(Concept(N),
Confidence(Concept(N-1)}
lNew N-gram Concept is formed from a concept (N-1)-gram and 1-gram
lPhase I identifies all the concepts
Compute Pipeline : Main Algorithm
lPhase II : Identifying the relations between the concepts
lTwo factors in judging the meaning of a relation between two meaning
full concepts
lConfidence, (Frequency)
lStrength
Compute Pipeline : Main Algorithm
lNew relation is formed from two meaning full concepts that were
identified appear in same sentence.
lA relation is logical representation of two concepts that appear in a
single sentence and some how makes sense appearing together
Compute Pipeline : Main Algorithm
Compute Pipeline : Main Algorithm
Strength[Relation(Concept1,Concept2)] =
Function{Confidence(Concept1,Concept2),
Confidence(Concept1),Confidence(Concept2)}
Compute Pipeline : Main Algorithm
lPhase II will compute relations from concepts formed from Phase I
Strength[Relation(Concept1,Concept2)] =
Function{Confidence(Concept1,Concept2),
Confidence(Concept1),Confidence(Concept2)}
Compute Pipeline : Main Algorithm
lConcepts can be treated as vertices of knowledge graph
lRelations can be treated as edges between two vertices (concepts)
lGraph storage is the best way to represent the information for fast
retrieval
Compute Pipeline : Knowledge Graph
Concepts and Relations
lConcepts:
l sachin tendulkar, params
l sania mirza, params
l barack obama, params
lRelations:
l sachin tendulkar, rahul gandhi, params
l sonia gandhi, rahul gandhi, params
l sachin tendulkar, BCCI, params
Concepts and Relations : Sample sub graph
Advantages of Graph
lRecommendation
lPath Finding
lTraversal
Introduction to News Explorer
Personalized, customizable application to explore about about a topic.
Gives insight to Concepts connected to search made by the user
Maintains search history for every use
Recommends new documents based on user history
Approaches to text analysis

Approaches to text analysis

  • 1.
    Text Analysis on NewsDocuments Rohtih Yeravothula rohith@sigmoid.com SDE, Sigmoid
  • 2.
    Outline •Introduction to TextAnalytics •Introduction to News Documents text analysis •Introduction to our Architecture and its elements •Introduction to Compute Pipeline •Phase I Computation •Phase II Computation •Knowledge Graph •Introduction to News-Explorer
  • 3.
  • 4.
    Why Text Analytics?? l80% of all data today is unstructured This includes news articles, research reports, social media posts and enterprise system data
  • 5.
    lText Analytics canhelp an organization derive potentially valuable business insights from text-based content such as word documents, email and postings on social media streams like Facebook, Twitter or customer responses posted on e commerce sites like flipkart, amazon Why Text Analytics ??
  • 6.
    •Sentiment Analysis •Hot Topicsor Key Trends •Root cause analysis Appications of Text Analytics
  • 7.
    Text Analytics • Isthe process of analyzing unstructured text, extracting relevant information, and transforming it into useful business intelligence. •It includes the process of structuring the input text (usually parsing) deriving patterns within the structured data, and finally evaluation and interpretation of the output
  • 8.
    lWe at Sigmoidare trying to build a knowledge graph by identifying high- quality information from the raw text from news documents that we crawl and parse. News Documents text analysis
  • 9.
  • 10.
    News Documents textanalysis : Statistics
  • 11.
    News Documents textanalysis : Statistics lEvery day we crawl 20K new documents lWe crawl 35K archive news documents lWe have crawled 40M archive news documents lToday it mounts up to approximately 100GB of text to process
  • 12.
    News Documents textanalysis l lArchitecture
  • 14.
    Architecture Element :MongoDB lNews Documents crawled are put into mongoDB lPreprocessing on documents is done, elimination of duplicates, empty documents etc.. lDocuments raw text is extracted and stored to Hadoop
  • 15.
    Compute Pipeline :Preprocessing lRaw text extracted from News Documents gets preprocessed lPreprocessing Includes removing Non-ASCII characters, lremoving punctuation marks, symbols like '|','','{' etc. lSplitting the whole text into sentences
  • 16.
    lCompute Process includesidentification of Concepts and Identifying the relations between concepts lPhase I : Parse the raw text and identify concepts lPhase II : Parse the raw text again and Identify relations between concepts Compute Pipeline : Main Algorithm
  • 17.
    lConcept is ameaning full N-gram that is formed from a concept of N-1 gram and 1-gram provided they make sense lImportant factors that judge the meaning of a concept are lConfidence, (Frequency) lStrength Compute Pipeline : Main Algorithm
  • 18.
    Strenth{Concept(N)} = Function{Confidence(Concept(N), Confidence(Concept(N-1)} lNewN-gram Concept is formed from a concept (N-1)-gram and 1-gram lPhase I identifies all the concepts Compute Pipeline : Main Algorithm
  • 19.
    lPhase II :Identifying the relations between the concepts lTwo factors in judging the meaning of a relation between two meaning full concepts lConfidence, (Frequency) lStrength Compute Pipeline : Main Algorithm
  • 20.
    lNew relation isformed from two meaning full concepts that were identified appear in same sentence. lA relation is logical representation of two concepts that appear in a single sentence and some how makes sense appearing together Compute Pipeline : Main Algorithm
  • 21.
    Compute Pipeline :Main Algorithm Strength[Relation(Concept1,Concept2)] = Function{Confidence(Concept1,Concept2), Confidence(Concept1),Confidence(Concept2)}
  • 22.
    Compute Pipeline :Main Algorithm lPhase II will compute relations from concepts formed from Phase I Strength[Relation(Concept1,Concept2)] = Function{Confidence(Concept1,Concept2), Confidence(Concept1),Confidence(Concept2)}
  • 23.
    Compute Pipeline :Main Algorithm
  • 24.
    lConcepts can betreated as vertices of knowledge graph lRelations can be treated as edges between two vertices (concepts) lGraph storage is the best way to represent the information for fast retrieval Compute Pipeline : Knowledge Graph
  • 25.
    Concepts and Relations lConcepts: lsachin tendulkar, params l sania mirza, params l barack obama, params lRelations: l sachin tendulkar, rahul gandhi, params l sonia gandhi, rahul gandhi, params l sachin tendulkar, BCCI, params
  • 26.
    Concepts and Relations: Sample sub graph
  • 27.
  • 28.
    Introduction to NewsExplorer Personalized, customizable application to explore about about a topic. Gives insight to Concepts connected to search made by the user Maintains search history for every use Recommends new documents based on user history