Hw09 Understanding Natural Language

News and Blog Analysis
with Lydia
Charles Ward – Stony Brook University
Karthik Balaji, Levon Lloyd – General Sentiment

October 2nd, 2009

Outline

Lydia System Overview
News Analysis Examples
Data and Workflow Organization
Data Access Interface
Conclusion

Large-Scale News/Blog Analysis
The Lydia news/blog analysis system does a daily
analysis of over 1000+ English and foreign-language
online newspapers, plus blogs, and other text sources.
We currently track tens of millions of named entities in
the news and blogs, providing spatial, temporal,
relational and sentiment analysis.
Customer's track entities of interest using reports
generated in our user interface.

Lydia Text Analysis Phases
Lydia performs named entity recognition and analysis over large
text corpora.
Spidering: Lydia spiders and parses thousands of online news
sources. We also handle the feed of social media provided by
Spinn3r.
Named Entity Recognition: Lydia identifies and classifies
occurrences of named entities (people, places, companies,
etc.)
Sentiment Analysis: Lydia assigns sentiment scores to
identified entities using shallow NLP techniques.
Entity Statistics Aggregation: Lydia digests marked-up text
and produces usable entity statistics.
Data Exploration: Aggregated entity statistics are made
available through user interfaces and programming APIs for
detailed exploration of the data.

Frequency Time Series
Michael Vick references (2004-2009)

Mel Gibson references (2004-2009)

Sentiment Analysis
Michael Phelps sentiment score (June 2008-Feb
2009)

David Paterson sentiment score (Jan 2008-Jul 2009)

Comparative Analysis
Peyton Manning vs. Eli Manning

Heatmaps

Arnold Schwarzenegger Alabama

Ethnic Biases in News Coverage

Frequency of coverage of entities Percentage of population self-
with Hispanic names in the reporting as Hispanic in the 2000
U.S. news, 2004-2008 census. Courtesy of Wikipedia.

Ethnic Biases in News Coverage

(a) African
(b) Hispanic
(c) East Asian
(d) Indian
(e) Eastern
European
(f) Muslim

Juxtaposition Analysis
Top Juxtapositions for Barack Obama

Juxtapositions between Barack Obama and John McCain

Hadoop in Lydia
The legacy Perl NLP pipeline runs in parallel on
Hadoop Streaming, generating articles with marked-up
entities which are stored as compressed XML in
HDFS.

To build or update Lydia entity statistics and indexes
for a single text corpus, over 80 map-reduce jobs are
necessary.

We have developed a custom workflow management
framework in Amazon EC2 to manage our data and
processing.

Lydia Workflow Framework
High-level concepts:
A depository is a statistics dataset derived from a
text corpus. It consists of artifacts.
Stored as a directory structure in HDFS
An artifact is a homogeneous dataset of a specific
type.
Examples:
Key-value artifacts, e.g. entity name -> frequency time
series
Lucene index artifacts (entity and article indexes)
Stored as a directory in HDFS containing several map-
reduce job output subdirectories named as date ranges
(we do updates on a daily granularity).

Artifact Dependencies
Most Lydia artifacts are derived from other artifacts:

Artifact Storage

Lydia artifacts are stored in HDFS inside the
depository directory:
/dailies (depository name)
/EXACT_DUP_ARTICLES (artifact name)
/2004_11_01-2009_03_31 (date range-named MR output)
/part-00000
...
/part-00017
/2009_04_01-2009_04_02
...
/2009_04_03. . .

Job Input Selection
Artifact updates are incrementally propagated through
the dependency graph:

Multiple date ranges (sometimes overlapping) typically
exist for each artifact.
Some small artifacts get fully rebuilt on every update.

Depository Build Scheduling
The same tool is used for the initial depository build
and for updating it with new data.

Any set of target artifacts to build can be specified,
similarly to a makefile. Prerequisites of the targets are
automatically identified.

Artifacts are built in the correct order according to
dependencies.

The build process runs as a sequence of Hadoop
map-reduce jobs and occasional serial jobs.

Amazon EC2
We run Hadoop on Amazon EC2.
– Quickly scale capacity as requirements change.
10 extra large nodes for weekly data processing.
Amazon S3 is our persistent data store.
All our web services are hosted in dedicated amazon
nodes.
S3 is not meeting our required level-of-service
– Moving to EBS

Depository Server
Random access to the Lydia depository, e.g.:
Monthly frequency time series of Barack Obama in all
U.S. sources
Top juxtapositions for Continental Airlines in February
2009
Sentiment time series for Michael Phelps in all U.S.
sources
Uses the mapfiles generated by map-reduce jobs.
Currently is not distributed (but we can put
different depositories on different machines).
Provides a caching subsystem to reduce the
number of HDFS accesses.

Artifact Date Range Merging

The depository server combines results from
multiple groups of mapfiles on the fly.
(MR output = date range = mapfile group)
This may result in performance problems and
memory shortage (direct memory buffers).
Solution: limit the number of covering date ranges
to be O(log N) after N daily updates.

Conclusion

Great improvement (up to 20x) in the
Lydia system performance and
scalability from using Hadoop.
Lydia w/ Hadoop makes new types of
automated analysis of web-scale content
possible.

Hw09 Understanding Natural Language

More Related Content

What's hot

Viewers also liked

Similar to Hw09 Understanding Natural Language

More from Cloudera, Inc.

Recently uploaded

Hw09 Understanding Natural Language