News and Blog Analysis
      with Lydia
 Charles Ward – Stony Brook University
 Karthik Balaji, Levon Lloyd – General Sentiment




October 2nd, 2009
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Large-Scale News/Blog Analysis
  The Lydia news/blog analysis system does a daily
  analysis of over 1000+ English and foreign-language
  online newspapers, plus blogs, and other text sources.
  We currently track tens of millions of named entities in
  the news and blogs, providing spatial, temporal,
  relational and sentiment analysis.
  Customer's track entities of interest using reports
  generated in our user interface.
Lydia Text Analysis Phases
  Lydia performs named entity recognition and analysis over large
  text corpora.
      Spidering: Lydia spiders and parses thousands of online news
      sources. We also handle the feed of social media provided by
      Spinn3r.
      Named Entity Recognition: Lydia identifies and classifies
      occurrences of named entities (people, places, companies,
      etc.)
      Sentiment Analysis: Lydia assigns sentiment scores to
      identified entities using shallow NLP techniques.
      Entity Statistics Aggregation: Lydia digests marked-up text
      and produces usable entity statistics.
      Data Exploration: Aggregated entity statistics are made
      available through user interfaces and programming APIs for
      detailed exploration of the data.
Lydia Architecture
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Frequency Time Series
  Michael Vick references (2004-2009)




  Mel Gibson references (2004-2009)
Sentiment Analysis
  Michael Phelps sentiment score (June 2008-Feb
  2009)




  David Paterson sentiment score (Jan 2008-Jul 2009)
Comparative Analysis
  Peyton Manning vs. Eli Manning
Heatmaps




 Arnold Schwarzenegger   Alabama
Ethnic Biases in News Coverage




 Frequency of coverage of entities Percentage of population self-
    with Hispanic names in the        reporting as Hispanic in the 2000
    U.S. news, 2004-2008              census. Courtesy of Wikipedia.
Ethnic Biases in News Coverage


  (a) African
  (b) Hispanic
  (c) East Asian
  (d) Indian
  (e) Eastern
  European
  (f) Muslim
Juxtaposition Analysis
   Top Juxtapositions for Barack Obama




   Juxtapositions between Barack Obama and John McCain
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Hadoop in Lydia
  The legacy Perl NLP pipeline runs in parallel on
  Hadoop Streaming, generating articles with marked-up
  entities which are stored as compressed XML in
  HDFS.

  To build or update Lydia entity statistics and indexes
  for a single text corpus, over 80 map-reduce jobs are
  necessary.

  We have developed a custom workflow management
  framework in Amazon EC2 to manage our data and
  processing.
Lydia Workflow Framework
 High-level concepts:
   A depository is a statistics dataset derived from a
   text corpus. It consists of artifacts.
     Stored as a directory structure in HDFS
   An artifact is a homogeneous dataset of a specific
   type.
     Examples:
        Key-value artifacts, e.g. entity name -> frequency time
        series
        Lucene index artifacts (entity and article indexes)
     Stored as a directory in HDFS containing several map-
     reduce job output subdirectories named as date ranges
     (we do updates on a daily granularity).
Artifact Dependencies
 Most Lydia artifacts are derived from other artifacts:
Artifact Storage

 Lydia artifacts are stored in HDFS inside the
   depository directory:
   /dailies (depository name)
     /EXACT_DUP_ARTICLES (artifact name)
        /2004_11_01-2009_03_31 (date range-named MR output)
           /part-00000
           ...
           /part-00017
        /2009_04_01-2009_04_02
           ...
        /2009_04_03. . .
Job Input Selection
   Artifact updates are incrementally propagated through
   the dependency graph:




   Multiple date ranges (sometimes overlapping) typically
   exist for each artifact.
   Some small artifacts get fully rebuilt on every update.
Depository Build Scheduling
   The same tool is used for the initial depository build
   and for updating it with new data.

   Any set of target artifacts to build can be specified,
   similarly to a makefile. Prerequisites of the targets are
   automatically identified.

   Artifacts are built in the correct order according to
   dependencies.

   The build process runs as a sequence of Hadoop
   map-reduce jobs and occasional serial jobs.
Amazon EC2
  We run Hadoop on Amazon EC2.
  – Quickly scale capacity as requirements change.
  10 extra large nodes for weekly data processing.
  Amazon S3 is our persistent data store.
  All our web services are hosted in dedicated amazon
  nodes.
  S3 is not meeting our required level-of-service
   – Moving to EBS
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Depository Server
  Random access to the Lydia depository, e.g.:
    Monthly frequency time series of Barack Obama in all
    U.S. sources
    Top juxtapositions for Continental Airlines in February
    2009
    Sentiment time series for Michael Phelps in all U.S.
    sources
  Uses the mapfiles generated by map-reduce jobs.
  Currently is not distributed (but we can put
  different depositories on different machines).
  Provides a caching subsystem to reduce the
  number of HDFS accesses.
Artifact Date Range Merging

   The depository server combines results from
   multiple groups of mapfiles on the fly.
   (MR output = date range = mapfile group)
     This may result in performance problems and
     memory shortage (direct memory buffers).
     Solution: limit the number of covering date ranges
     to be O(log N) after N daily updates.
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Conclusion

  Great improvement (up to 20x) in the
  Lydia system performance and
  scalability from using Hadoop.
  Lydia w/ Hadoop makes new types of
  automated analysis of web-scale content
  possible.

Hw09 Understanding Natural Language

  • 1.
    News and BlogAnalysis with Lydia Charles Ward – Stony Brook University Karthik Balaji, Levon Lloyd – General Sentiment October 2nd, 2009
  • 2.
    Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 3.
    Large-Scale News/Blog Analysis The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language online newspapers, plus blogs, and other text sources. We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal, relational and sentiment analysis. Customer's track entities of interest using reports generated in our user interface.
  • 4.
    Lydia Text AnalysisPhases Lydia performs named entity recognition and analysis over large text corpora. Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r. Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.) Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques. Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics. Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.
  • 5.
  • 6.
    Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 7.
    Frequency Time Series Michael Vick references (2004-2009) Mel Gibson references (2004-2009)
  • 8.
    Sentiment Analysis Michael Phelps sentiment score (June 2008-Feb 2009) David Paterson sentiment score (Jan 2008-Jul 2009)
  • 9.
    Comparative Analysis Peyton Manning vs. Eli Manning
  • 10.
  • 11.
    Ethnic Biases inNews Coverage Frequency of coverage of entities Percentage of population self- with Hispanic names in the reporting as Hispanic in the 2000 U.S. news, 2004-2008 census. Courtesy of Wikipedia.
  • 12.
    Ethnic Biases inNews Coverage (a) African (b) Hispanic (c) East Asian (d) Indian (e) Eastern European (f) Muslim
  • 13.
    Juxtaposition Analysis Top Juxtapositions for Barack Obama Juxtapositions between Barack Obama and John McCain
  • 14.
    Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 15.
    Hadoop in Lydia The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS. To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary. We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.
  • 16.
    Lydia Workflow Framework High-level concepts: A depository is a statistics dataset derived from a text corpus. It consists of artifacts. Stored as a directory structure in HDFS An artifact is a homogeneous dataset of a specific type. Examples: Key-value artifacts, e.g. entity name -> frequency time series Lucene index artifacts (entity and article indexes) Stored as a directory in HDFS containing several map- reduce job output subdirectories named as date ranges (we do updates on a daily granularity).
  • 17.
    Artifact Dependencies MostLydia artifacts are derived from other artifacts:
  • 18.
    Artifact Storage Lydiaartifacts are stored in HDFS inside the depository directory: /dailies (depository name) /EXACT_DUP_ARTICLES (artifact name) /2004_11_01-2009_03_31 (date range-named MR output) /part-00000 ... /part-00017 /2009_04_01-2009_04_02 ... /2009_04_03. . .
  • 19.
    Job Input Selection Artifact updates are incrementally propagated through the dependency graph: Multiple date ranges (sometimes overlapping) typically exist for each artifact. Some small artifacts get fully rebuilt on every update.
  • 20.
    Depository Build Scheduling The same tool is used for the initial depository build and for updating it with new data. Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified. Artifacts are built in the correct order according to dependencies. The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.
  • 21.
    Amazon EC2 We run Hadoop on Amazon EC2. – Quickly scale capacity as requirements change. 10 extra large nodes for weekly data processing. Amazon S3 is our persistent data store. All our web services are hosted in dedicated amazon nodes. S3 is not meeting our required level-of-service – Moving to EBS
  • 22.
    Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 23.
    Depository Server Random access to the Lydia depository, e.g.: Monthly frequency time series of Barack Obama in all U.S. sources Top juxtapositions for Continental Airlines in February 2009 Sentiment time series for Michael Phelps in all U.S. sources Uses the mapfiles generated by map-reduce jobs. Currently is not distributed (but we can put different depositories on different machines). Provides a caching subsystem to reduce the number of HDFS accesses.
  • 24.
    Artifact Date RangeMerging The depository server combines results from multiple groups of mapfiles on the fly. (MR output = date range = mapfile group) This may result in performance problems and memory shortage (direct memory buffers). Solution: limit the number of covering date ranges to be O(log N) after N daily updates.
  • 25.
    Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 26.
    Conclusion Greatimprovement (up to 20x) in the Lydia system performance and scalability from using Hadoop. Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.