Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data/DIG
Domain-Specific Insight Graphs
Pedro Szekely
University of Southern California
www.isi.edu/~szekely
Connecting The Dots
Using the Web To Solve Hard Problems
Hard Problems
State of the Art
Our Solution
Impact
Hard Problems
Healthcare
Research investment
Human trafficking
…
Human Trafficking
Illegal drugs
Arms trafficking
Human trafficking
Illegal Industries
$32 billion
profit per year
14
Average Age of Entry To Prostitution in the US
$150,000
PIMP’s Profit Per Child Per Year
$45,000,000
Advertising Budget On the Web
Human Trafficking on the Web
Thousands of Web sites
Millions of pages
Hard Problems
State of the Art
Our Solution
Impact
Google Finds “DOTS”
Recipe
“Dot”
Nutrition
“Dot”
Google finds dots
User finds connections
System Objectives
1.  find all the dots
2.  find all the connections
Hard Problems
State of the Art
Our Solution
Impact
1.  Downloads all relevant pages
2.  Extracts & cleans the data
3.  Discovers connections
4.  Builds unified database
5.  C...
1.  Go to Web site
2.  Download page
3.  Follow links
4.  Wait, then repeat
24/7
Web Crawling Software
2,000 Pages/Hour --...
Data Extraction
“YOU don't wanna miss out on ME :)
Perfect lil booty Green eyes Long
curly black hair Im a Irish,Armenian
...
Crowd-SourcED Annotations
“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes
Long curly black hair Im a Irish...
Automatic Construction of Extractors
5,000 annotations
Machine
Learning
Ready-to-use
Extraction
Software
$100, 1 day
Techn...
Data Cleaning
AD Weight
1  130
2  480
3  133lbs
4  BBW
5  52 kg
6  110 pounds
AD Weight (Kg)
1  59
2 
3  60
4 
5  52
6  50
Using Extracted Data to Connect the Dots
Mary Lucy
222-0000 777-0000
Police Database
Bad Guy: 777-0000
Technology: Karma I...
Using Text Similarity to Connect the Dots
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C...
Using Image Similarity to Connect the Dots
20 Million Images Technology: Deep Learning
Create Unified Database
50 Million Ads
Technologies: Karma, Hadoop, Hive, Elastic-Search
20 Computers, 2 Hours 4 Billion R...
Hard Problems
State of the Art
Our Solution
Impact
Deployed to Law
Enforcement and NGOs
Organizations
University of
Southern California
Columbia University
InferLink
NASA JPL
Next Century
Researchers
Pedro Szek...
Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC
Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC
Upcoming SlideShare
Loading in …5
×

Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC

711 views

Published on

Domain-specific Insight Graph (DIG) is a technology that harvests and harmonizes millions of Web pages to extract key elements of knowledge (e.g., entities and relations). It integrates corporate databases with the extracted data across sources and modalities encoding implicit and purposefully obfuscated relationships. It offers a faceted content search interface and visualizations to support analysis.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC

  1. 1. Big Data/DIG Domain-Specific Insight Graphs Pedro Szekely University of Southern California www.isi.edu/~szekely
  2. 2. Connecting The Dots Using the Web To Solve Hard Problems
  3. 3. Hard Problems State of the Art Our Solution Impact
  4. 4. Hard Problems Healthcare Research investment Human trafficking …
  5. 5. Human Trafficking
  6. 6. Illegal drugs Arms trafficking Human trafficking Illegal Industries
  7. 7. $32 billion profit per year
  8. 8. 14 Average Age of Entry To Prostitution in the US
  9. 9. $150,000 PIMP’s Profit Per Child Per Year
  10. 10. $45,000,000 Advertising Budget On the Web
  11. 11. Human Trafficking on the Web Thousands of Web sites Millions of pages
  12. 12. Hard Problems State of the Art Our Solution Impact
  13. 13. Google Finds “DOTS”
  14. 14. Recipe “Dot”
  15. 15. Nutrition “Dot”
  16. 16. Google finds dots User finds connections
  17. 17. System Objectives 1.  find all the dots 2.  find all the connections
  18. 18. Hard Problems State of the Art Our Solution Impact
  19. 19. 1.  Downloads all relevant pages 2.  Extracts & cleans the data 3.  Discovers connections 4.  Builds unified database 5.  Creates query & analysis portal
  20. 20. 1.  Go to Web site 2.  Download page 3.  Follow links 4.  Wait, then repeat 24/7 Web Crawling Software 2,000 Pages/Hour -- 50,000,000 pages Total
  21. 21. Data Extraction “YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish,Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses” name: Kim eye-color: green hair-color: black phone: 707-727-7477 rate: $60/15min $80/30min $120/60min
  22. 22. Crowd-SourcED Annotations “YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish,Armenian and Filipino mixed princess :) Green O eye color O hair color black O eye color O hair color 2 cents/sentence
  23. 23. Automatic Construction of Extractors 5,000 annotations Machine Learning Ready-to-use Extraction Software $100, 1 day Technology: Conditional Random Fields
  24. 24. Data Cleaning AD Weight 1  130 2  480 3  133lbs 4  BBW 5  52 kg 6  110 pounds AD Weight (Kg) 1  59 2  3  60 4  5  52 6  50
  25. 25. Using Extracted Data to Connect the Dots Mary Lucy 222-0000 777-0000 Police Database Bad Guy: 777-0000 Technology: Karma Information Integration Toolkit
  26. 26. Using Text Similarity to Connect the Dots E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S Technology: MinHash/LSH
  27. 27. Using Image Similarity to Connect the Dots 20 Million Images Technology: Deep Learning
  28. 28. Create Unified Database 50 Million Ads Technologies: Karma, Hadoop, Hive, Elastic-Search 20 Computers, 2 Hours 4 Billion Records
  29. 29. Hard Problems State of the Art Our Solution Impact
  30. 30. Deployed to Law Enforcement and NGOs
  31. 31. Organizations University of Southern California Columbia University InferLink NASA JPL Next Century Researchers Pedro Szekely (PI), Shih-Fu Chang Tao Chen Kevin Knight Craig Knoblock Daniel Marcu Chris Mattmann Steve Minton Prem Natarajan Andrew Philpot MikeTamayo Engineers Brian Amanatullah Rachel Artiss David Flynt Dipsy Kapoor, Students Jason Slepicka Amandeep Singh ChengyeYin Subessware Karunamoorthy

×