Big Data/DIG
Domain-Specific Insight Graphs
Pedro Szekely
University of Southern California
www.isi.edu/~szekely
Connecting The Dots
Using the Web To Solve Hard Problems
Hard Problems
State of the Art
Our Solution
Impact
Hard Problems
Healthcare
Research investment
Human trafficking
…
Human Trafficking
Illegal drugs
Arms trafficking
Human trafficking
Illegal Industries
$32 billion
profit per year
14
Average Age of Entry To Prostitution in the US
$150,000
PIMP’s Profit Per Child Per Year
$45,000,000
Advertising Budget On the Web
Human Trafficking on the Web
Thousands of Web sites
Millions of pages
Hard Problems
State of the Art
Our Solution
Impact
Google Finds “DOTS”
Recipe
“Dot”
Nutrition
“Dot”
Google finds dots
User finds connections
System Objectives
1.  find all the dots
2.  find all the connections
Hard Problems
State of the Art
Our Solution
Impact
1.  Downloads all relevant pages
2.  Extracts & cleans the data
3.  Discovers connections
4.  Builds unified database
5.  Creates query & analysis portal
1.  Go to Web site
2.  Download page
3.  Follow links
4.  Wait, then repeat
24/7
Web Crawling Software
2,000 Pages/Hour -- 50,000,000 pages Total
Data Extraction
“YOU don't wanna miss out on ME :)
Perfect lil booty Green eyes Long
curly black hair Im a Irish,Armenian
and Filipino mixed princess :) ❤ Kim
❤ 7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15 mins
60 roses”
name: Kim
eye-color: green
hair-color: black
phone: 707-727-7477
rate: $60/15min
$80/30min
$120/60min
Crowd-SourcED Annotations
“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes
Long curly black hair Im a Irish,Armenian and Filipino mixed princess :)
Green O eye color O hair color
black O eye color O hair color
2 cents/sentence
Automatic Construction of Extractors
5,000 annotations
Machine
Learning
Ready-to-use
Extraction
Software
$100, 1 day
Technology: Conditional Random Fields
Data Cleaning
AD Weight
1  130
2  480
3  133lbs
4  BBW
5  52 kg
6  110 pounds
AD Weight (Kg)
1  59
2 
3  60
4 
5  52
6  50
Using Extracted Data to Connect the Dots
Mary Lucy
222-0000 777-0000
Police Database
Bad Guy: 777-0000
Technology: Karma Information Integration Toolkit
Using Text Similarity to Connect the Dots
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S
LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S
L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S
Technology: MinHash/LSH
Using Image Similarity to Connect the Dots
20 Million Images Technology: Deep Learning
Create Unified Database
50 Million Ads
Technologies: Karma, Hadoop, Hive, Elastic-Search
20 Computers, 2 Hours 4 Billion Records
Hard Problems
State of the Art
Our Solution
Impact
Deployed to Law
Enforcement and NGOs
Organizations
University of
Southern California
Columbia University
InferLink
NASA JPL
Next Century
Researchers
Pedro Szekely (PI),
Shih-Fu Chang
Tao Chen
Kevin Knight
Craig Knoblock
Daniel Marcu
Chris Mattmann
Steve Minton
Prem Natarajan
Andrew Philpot
MikeTamayo
Engineers
Brian Amanatullah
Rachel Artiss
David Flynt
Dipsy Kapoor,
Students
Jason Slepicka
Amandeep Singh
ChengyeYin
Subessware
Karunamoorthy

Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC