Hw09 Understanding Natural Language


Published on

Published in: Technology, News & Politics
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hw09 Understanding Natural Language

  1. 1. News and Blog Analysis with Lydia Charles Ward – Stony Brook University Karthik Balaji, Levon Lloyd – General Sentiment October 2nd, 2009
  2. 2. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  3. 3. Large-Scale News/Blog Analysis The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language online newspapers, plus blogs, and other text sources. We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal, relational and sentiment analysis. Customer's track entities of interest using reports generated in our user interface.
  4. 4. Lydia Text Analysis Phases Lydia performs named entity recognition and analysis over large text corpora. Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r. Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.) Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques. Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics. Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.
  5. 5. Lydia Architecture
  6. 6. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  7. 7. Frequency Time Series Michael Vick references (2004-2009) Mel Gibson references (2004-2009)
  8. 8. Sentiment Analysis Michael Phelps sentiment score (June 2008-Feb 2009) David Paterson sentiment score (Jan 2008-Jul 2009)
  9. 9. Comparative Analysis Peyton Manning vs. Eli Manning
  10. 10. Heatmaps Arnold Schwarzenegger Alabama
  11. 11. Ethnic Biases in News Coverage Frequency of coverage of entities Percentage of population self- with Hispanic names in the reporting as Hispanic in the 2000 U.S. news, 2004-2008 census. Courtesy of Wikipedia.
  12. 12. Ethnic Biases in News Coverage (a) African (b) Hispanic (c) East Asian (d) Indian (e) Eastern European (f) Muslim
  13. 13. Juxtaposition Analysis Top Juxtapositions for Barack Obama Juxtapositions between Barack Obama and John McCain
  14. 14. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  15. 15. Hadoop in Lydia The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS. To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary. We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.
  16. 16. Lydia Workflow Framework High-level concepts: A depository is a statistics dataset derived from a text corpus. It consists of artifacts. Stored as a directory structure in HDFS An artifact is a homogeneous dataset of a specific type. Examples: Key-value artifacts, e.g. entity name -> frequency time series Lucene index artifacts (entity and article indexes) Stored as a directory in HDFS containing several map- reduce job output subdirectories named as date ranges (we do updates on a daily granularity).
  17. 17. Artifact Dependencies Most Lydia artifacts are derived from other artifacts:
  18. 18. Artifact Storage Lydia artifacts are stored in HDFS inside the depository directory: /dailies (depository name) /EXACT_DUP_ARTICLES (artifact name) /2004_11_01-2009_03_31 (date range-named MR output) /part-00000 ... /part-00017 /2009_04_01-2009_04_02 ... /2009_04_03. . .
  19. 19. Job Input Selection Artifact updates are incrementally propagated through the dependency graph: Multiple date ranges (sometimes overlapping) typically exist for each artifact. Some small artifacts get fully rebuilt on every update.
  20. 20. Depository Build Scheduling The same tool is used for the initial depository build and for updating it with new data. Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified. Artifacts are built in the correct order according to dependencies. The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.
  21. 21. Amazon EC2 We run Hadoop on Amazon EC2. – Quickly scale capacity as requirements change. 10 extra large nodes for weekly data processing. Amazon S3 is our persistent data store. All our web services are hosted in dedicated amazon nodes. S3 is not meeting our required level-of-service – Moving to EBS
  22. 22. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  23. 23. Depository Server Random access to the Lydia depository, e.g.: Monthly frequency time series of Barack Obama in all U.S. sources Top juxtapositions for Continental Airlines in February 2009 Sentiment time series for Michael Phelps in all U.S. sources Uses the mapfiles generated by map-reduce jobs. Currently is not distributed (but we can put different depositories on different machines). Provides a caching subsystem to reduce the number of HDFS accesses.
  24. 24. Artifact Date Range Merging The depository server combines results from multiple groups of mapfiles on the fly. (MR output = date range = mapfile group) This may result in performance problems and memory shortage (direct memory buffers). Solution: limit the number of covering date ranges to be O(log N) after N daily updates.
  25. 25. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  26. 26. Conclusion Great improvement (up to 20x) in the Lydia system performance and scalability from using Hadoop. Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.