News and Blog Analysis
      with Lydia
 Charles Ward – Stony Brook University
 Karthik Balaji, Levon Lloyd – General Sent...
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   ...
Large-Scale News/Blog Analysis
  The Lydia news/blog analysis system does a daily
  analysis of over 1000+ English and for...
Lydia Text Analysis Phases
  Lydia performs named entity recognition and analysis over large
  text corpora.
      Spideri...
Lydia Architecture
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   ...
Frequency Time Series
  Michael Vick references (2004-2009)




  Mel Gibson references (2004-2009)
Sentiment Analysis
  Michael Phelps sentiment score (June 2008-Feb
  2009)




  David Paterson sentiment score (Jan 2008-...
Comparative Analysis
  Peyton Manning vs. Eli Manning
Heatmaps




 Arnold Schwarzenegger   Alabama
Ethnic Biases in News Coverage




 Frequency of coverage of entities Percentage of population self-
    with Hispanic nam...
Ethnic Biases in News Coverage


  (a) African
  (b) Hispanic
  (c) East Asian
  (d) Indian
  (e) Eastern
  European
  (f)...
Juxtaposition Analysis
   Top Juxtapositions for Barack Obama




   Juxtapositions between Barack Obama and John McCain
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   ...
Hadoop in Lydia
  The legacy Perl NLP pipeline runs in parallel on
  Hadoop Streaming, generating articles with marked-up
...
Lydia Workflow Framework
 High-level concepts:
   A depository is a statistics dataset derived from a
   text corpus. It c...
Artifact Dependencies
 Most Lydia artifacts are derived from other artifacts:
Artifact Storage

 Lydia artifacts are stored in HDFS inside the
   depository directory:
   /dailies (depository name)
  ...
Job Input Selection
   Artifact updates are incrementally propagated through
   the dependency graph:




   Multiple date...
Depository Build Scheduling
   The same tool is used for the initial depository build
   and for updating it with new data...
Amazon EC2
  We run Hadoop on Amazon EC2.
  – Quickly scale capacity as requirements change.
  10 extra large nodes for we...
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   ...
Depository Server
  Random access to the Lydia depository, e.g.:
    Monthly frequency time series of Barack Obama in all
...
Artifact Date Range Merging

   The depository server combines results from
   multiple groups of mapfiles on the fly.
   ...
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   ...
Conclusion

  Great improvement (up to 20x) in the
  Lydia system performance and
  scalability from using Hadoop.
  Lydia...
Upcoming SlideShare
Loading in...5
×

Hw09 Understanding Natural Language

1,207

Published on

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,207
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
78
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hw09 Understanding Natural Language

  1. 1. News and Blog Analysis with Lydia Charles Ward – Stony Brook University Karthik Balaji, Levon Lloyd – General Sentiment October 2nd, 2009
  2. 2. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  3. 3. Large-Scale News/Blog Analysis The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language online newspapers, plus blogs, and other text sources. We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal, relational and sentiment analysis. Customer's track entities of interest using reports generated in our user interface.
  4. 4. Lydia Text Analysis Phases Lydia performs named entity recognition and analysis over large text corpora. Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r. Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.) Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques. Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics. Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.
  5. 5. Lydia Architecture
  6. 6. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  7. 7. Frequency Time Series Michael Vick references (2004-2009) Mel Gibson references (2004-2009)
  8. 8. Sentiment Analysis Michael Phelps sentiment score (June 2008-Feb 2009) David Paterson sentiment score (Jan 2008-Jul 2009)
  9. 9. Comparative Analysis Peyton Manning vs. Eli Manning
  10. 10. Heatmaps Arnold Schwarzenegger Alabama
  11. 11. Ethnic Biases in News Coverage Frequency of coverage of entities Percentage of population self- with Hispanic names in the reporting as Hispanic in the 2000 U.S. news, 2004-2008 census. Courtesy of Wikipedia.
  12. 12. Ethnic Biases in News Coverage (a) African (b) Hispanic (c) East Asian (d) Indian (e) Eastern European (f) Muslim
  13. 13. Juxtaposition Analysis Top Juxtapositions for Barack Obama Juxtapositions between Barack Obama and John McCain
  14. 14. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  15. 15. Hadoop in Lydia The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS. To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary. We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.
  16. 16. Lydia Workflow Framework High-level concepts: A depository is a statistics dataset derived from a text corpus. It consists of artifacts. Stored as a directory structure in HDFS An artifact is a homogeneous dataset of a specific type. Examples: Key-value artifacts, e.g. entity name -> frequency time series Lucene index artifacts (entity and article indexes) Stored as a directory in HDFS containing several map- reduce job output subdirectories named as date ranges (we do updates on a daily granularity).
  17. 17. Artifact Dependencies Most Lydia artifacts are derived from other artifacts:
  18. 18. Artifact Storage Lydia artifacts are stored in HDFS inside the depository directory: /dailies (depository name) /EXACT_DUP_ARTICLES (artifact name) /2004_11_01-2009_03_31 (date range-named MR output) /part-00000 ... /part-00017 /2009_04_01-2009_04_02 ... /2009_04_03. . .
  19. 19. Job Input Selection Artifact updates are incrementally propagated through the dependency graph: Multiple date ranges (sometimes overlapping) typically exist for each artifact. Some small artifacts get fully rebuilt on every update.
  20. 20. Depository Build Scheduling The same tool is used for the initial depository build and for updating it with new data. Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified. Artifacts are built in the correct order according to dependencies. The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.
  21. 21. Amazon EC2 We run Hadoop on Amazon EC2. – Quickly scale capacity as requirements change. 10 extra large nodes for weekly data processing. Amazon S3 is our persistent data store. All our web services are hosted in dedicated amazon nodes. S3 is not meeting our required level-of-service – Moving to EBS
  22. 22. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  23. 23. Depository Server Random access to the Lydia depository, e.g.: Monthly frequency time series of Barack Obama in all U.S. sources Top juxtapositions for Continental Airlines in February 2009 Sentiment time series for Michael Phelps in all U.S. sources Uses the mapfiles generated by map-reduce jobs. Currently is not distributed (but we can put different depositories on different machines). Provides a caching subsystem to reduce the number of HDFS accesses.
  24. 24. Artifact Date Range Merging The depository server combines results from multiple groups of mapfiles on the fly. (MR output = date range = mapfile group) This may result in performance problems and memory shortage (direct memory buffers). Solution: limit the number of covering date ranges to be O(log N) after N daily updates.
  25. 25. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  26. 26. Conclusion Great improvement (up to 20x) in the Lydia system performance and scalability from using Hadoop. Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×