IRE 2012 Unstructured Data Talk

Analyzing Unstructured Data
for Stories
eugenewu@mit.edu

What Am I Talking About?
• Example

• Structured Data 101

• Structured Data Continuum

• More Examples

http://projects.propublica.org/drywall/

http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint

1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle Road,
Monroeville, Alabama


herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle
Road, Monroeville, Alabama


541 Lynn Hurst Court,
Montgomery, Alabama 36117

37.0625, -95.677068

541 Lynn Hurst Court,
Montgomery, Alabama 36117

37.0625, -95.677068

Jefferson County

Scanned
Documents

Addresses

Google Maps

Scanned Unstructured
Documents Information

Addresses Structured Data

Google Maps Visualization

Scanned Unstructured
Documents Information

Addresses Structured Data

Google Maps Visualization

Who cares?
What is it?

Who Cares?
Software Store
Databases
PANDA

Visualization Analyze
Fusion tables
Excel
Databases
Mashups R/Python/Ruby

Who Cares?
Software

Visualization

Mashups

Who Cares?
Software Tainted House Data
+ Economic Data
+ Health Stats
Visualization + Crime Stats
+ Corruption Data

Mashups

Structured Data
Attribute
Name
Data type

Consistent

Structured Data
Attribute Florida’s Lee County
has 1518 addresses
Name
Data type

Consistent

Structured Data
Attribute Numeric
(integers, dollars,…)

Name
Date/Time
Data type
Lat, Lon

Consistent

Structured Data
Attribute Numeric
(integers, dollars,…)

Name
Date/Time
Data type
Lat, Lon

Consistent Structured strings
(Florida)

Structured Data
Attribute FLORIDA
Name FL
Flroida
Data type
FloridaState
Florida’s
Consistent

Structured Data
Attribute FLORIDA 5
Name FL 10
Flroida 1
Data type
FloridaState 1
Florida’s 1
Consistent



• Structured Data Continuum

• More Examples

unstructured structured

Continuum

Images
Images


http://www.whatisstephenharperreading.ca/2010/03/01/book-number-76-one-day-in-the-life-of-ivan-denisovich-by-alexander-solzhenitsyn/

Images
Images Text Blob


accompanying this complaint which are incorporated herein
by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a
citizen of Alabama and owns real property located at 2105
Lane Avenue, Birmingham, Alabama 35217. Plaintiff is
participating as a class representative in the

Images
Images Text Blob Email


Images
Images Text Blob Email


Subject Re: IRE conference in Boston

Date June 1, 3:08PM

From jaimi@ire.org

Images
Images Text Blob Email Excel




“It’s sunny
in texas”



“It’s sunny Tweet Weather Location
It’s sunny in Sunny Texas
in texas” texas



“It’s sunny Tweet Weather Location
It’s sunny in Sunny (37.06,
in texas” texas -95.67)

Whe You have unstructured data
n

What structure do I need?
Ask

Attributes with simple types
Find



• Structured data continuum

• More Examples

2011 State of the Union

http://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/

Name Type/Meaning

Word String

Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow
Americans:

Tonight I want to begin by
congratulating the men and
women of the 112th Congress, as
well as your new Speaker, John
Boehner. And as we mark this
occasion, we're also mindful of
the empty chair in this chamber,
and we pray for the health of our
colleague -- and our friend --
Gabby Giffords.

It's no secret that those of us here
tonight have had our differences
over the last two years. The
debates have been contentious;
we have fought fiercely for our
beliefs. And that's a good thing.

Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow Word
Americans:
Mr
Speaker
Tonight I want to begin by
congratulating the men and Vice
women of the 112th Congress, as President
well as your new Speaker, John
Members
Boehner. And as we mark this
occasion, we're also mindful of Congress
the empty chair in this chamber, Distinguished
and we pray for the health of our
Guests
colleague -- and our friend --
Gabby Giffords. Americans
People
It's no secret that those of us here Jobs
tonight have had our differences
New
over the last two years. The
debates have been contentious; years
we have fought fiercely for our
beliefs. And that's a good thing.

Bin Laden Tweets/Sec

http://www.flickr.com/photos/twitteroffice/5681263084/

Name Type/Meaning

Time Time

Deadly Day in Baghdad

http://www.nytimes.com/interactive/2010/10/24/world/1024-surge-graphic.html?pagewanted=all

Name Type/Meaning

Location Lat, Lon

Body Count Number

http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all

14, 12
Killed in Action

Lat Lon

http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all

Sentiment of NZ Earthquake

http://twitinfo.csail.mit.edu/detail/4/

Name Type/Meaning

Happiness -1 to 1

Pattern
Matching
Great, 7AM
meeting

7:00AM

Interpret
Meaning
Great, 7AM
meeting

Not Happy

Interpret
Meaning
Great, 7AM
It’s still meeting
new

Happy!

Interpret
Meaning

It’s still
new

Interpret
Meaning
Earthquakes
It’s still
new

Lack of
context

Extracting meaning is
by far the most difficult

What if it’s just unstructured?

CrowdSourcing

Lots of humans do
tasks computers suck
at

Training

Quality Issues

Pattern Matching
• Regex
– Describe and find patterns
– Killed in action

(?P<n>d{1,3})(s[A-Z]{1,3})?sKIA

Structure = Super Valuable

When You have unstructured data
Ask What structure do I need?
Find Attributes with simple types

Structure = Super Valuable

When You have unstructured data
Ask What structure do I need?
Find Attributes with simple types

tinyurl.com/iredatatipsheet

eugenewu@mit.edu
@sirrice

IRE 2012 Unstructured Data Talk

Recommended

Recommended

More Related Content

Similar to IRE 2012 Unstructured Data Talk

Similar to IRE 2012 Unstructured Data Talk (20)

Recently uploaded

Recently uploaded (20)

IRE 2012 Unstructured Data Talk

Editor's Notes