Your SlideShare is downloading. ×
0
Structured Data on the Web
Alon Halevy
Google
May 23, 2014
Joint work with: Jayant Madhavan, Cong Yu, Fei Wu, Hongrae Lee,...
Structured Data in Search Results
Set Queries
Chicago restaurants
Association Queries
Data in Movies!
The Knowledge Graph
Knowledge Graph
Brazil
Brasilia
Query Reformulation
Knowledge Graph
Brazil
Brasilia
Brazil capital
What is the capital of
Brazil
“Google, tell me the
c...
Other Sources of Data
Knowledge Graph
Brazil
Brasilia
Brazil capital
The population of Brasilia is
2207718 according to th...
Answer Queries Directly from Web?
Brazil capital
The population of Brasilia is
2207718 according to the
GeoNames geographi...
The Web vs. the Knowledge Graph
Tables, Tables
Brazil capital
The population of Brasilia is
2207718 according to the
GeoNames geographical
database
Tables...
• City planning
• Sustainability: water, coffee, …
• Crisis response
• Advancing public discourse (e.g., gun control)
• Da...
Background for Coffee Examples
Fusion Tables
google.com/fusiontables
[SIGMOD 2010, SIGMOD 2012]
• Goal: an easy-to-use database system that is
integrated...
Coffee Producing Countries
Coffee Consumption Per Capita
Big Data for Regular People
Table Facts:
English poverty rates:
32,000 wards with a total of 1.8
million vertices
Colors i...
Crowd Sourcing
Data Integration as Search
Join with Population Data:
What is a City?
Big Data Integration
Table Facts:
Texas Counties 2010 Census:
254 counties with 543000 vertices
Colored based on various d...
Crowdsourcing Cafes
HTML Tables
Search Engine for Data Sets
research.google.com/tables
[VLDB 2008, 2011, 2014]
Give Answers from Tables
It Better Be Right!
Answer with a Visualization
Long Term Goal:
A Data-Guided Decision Engine
• Support decision making:
– Healthcare debate
– Should I install solar in m...
WebTables on google.com!
HTML Lists
See Elmeleegy et al., VLDB 2009
Tree Search
Amish quilts
Parking tickets in India
Horses
The Deep Web [Madhavan et al., VLDB 2008]
Other Sources of Data
• Spreadsheets
• CSV files
• Tables embedded in PDF
• XML, RDF
• Visualizations
• Online databases (...
Non-Tabular Data in HTML
Vertical Tables
Data Optimized for Page Layout
Tabular Data Optimized for Site Layout
See [Ling et al, IJCAI 2013] for stitching tables within a site.
Semantics Can Be Brittle
Semantics are in Text
The Big Challenge
• Analyze natural language text as it pertains to
structured data.
• Different from (open) information e...
First Step: Annotating Columns
[Venetis et al., VLDB 2011]
Step 2: Understanding Relationships
Dictionary of Attributes
• I want the list of all attributes that countries
may have.
• Freebase doesn’t have coffee produ...
Biperpedia:
[VLDB 2014]
Ontology for Search Applications
Comparing to Freebase Coverage
Tower of Babel: Internet Style
In 2013, the coffee
production of El Salvador
dropped by 20% due to the
coffee rust disease...
Conclusions
• This was a talk about Big Data:
– Millions of people creating data sets
– Billions of people seeing the data...
References
• Fusion Tables: SIGMOD 2010, 2012
• WebTables: VLDB 2008, 2009, 2011
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Structured Data in Web Search
Upcoming SlideShare
Loading in...5
×

Structured Data in Web Search

848

Published on

For the first time since the emergence of the Web, structured data is playing a key role in search engines and is therefore being collected via a concerted effort. Much of this data is being extracted from the Web, which contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that encourages publishing more data sets from governments and other public organizations. The Web also supports new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets.

I will describe some of the efforts we are conducting at Google to collect structured data, filter the high-quality content, and serve it to our users. These efforts include providing Google Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the Web for high-quality HTML tables, and contributing these data assets to Google's other services.

Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received the the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). He received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem. Halevy is also a coffee culturalist and published the book "The Infinite Emotions of Coffee", published in 2011 and a co-author of the book "Principles of Data Integration", published in 2012.

Published in: Science, Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
848
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
16
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Structured Data in Web Search"

  1. 1. Structured Data on the Web Alon Halevy Google May 23, 2014 Joint work with: Jayant Madhavan, Cong Yu, Fei Wu, Hongrae Lee, Warren Shen Anish Das Sarma, Rahul Gupta, Boulos Harb, Zack Ives, Afshin Rostamizadeh, Sree Balakrishnan, Anno Langen, Steven Whang, Mohamed Yahya, and others
  2. 2. Structured Data in Search Results
  3. 3. Set Queries Chicago restaurants
  4. 4. Association Queries
  5. 5. Data in Movies!
  6. 6. The Knowledge Graph Knowledge Graph Brazil Brasilia
  7. 7. Query Reformulation Knowledge Graph Brazil Brasilia Brazil capital What is the capital of Brazil “Google, tell me the capital of brazil”  Brazil nuts  Culture of Brazil  “Google, will Brazil win the world cup?”
  8. 8. Other Sources of Data Knowledge Graph Brazil Brasilia Brazil capital The population of Brasilia is 2207718 according to the GeoNames geographical database Tables Text
  9. 9. Answer Queries Directly from Web? Brazil capital The population of Brasilia is 2207718 according to the GeoNames geographical database Tables Text Knowledge Graph Brazil Brasilia
  10. 10. The Web vs. the Knowledge Graph
  11. 11. Tables, Tables Brazil capital The population of Brasilia is 2207718 according to the GeoNames geographical database Tables Text Knowledge Graph Brazil Brasilia Fusion Tables: Enabling a broad range of users to create tabular content WebTables: Finding good HTML tables on the Web
  12. 12. • City planning • Sustainability: water, coffee, … • Crisis response • Advancing public discourse (e.g., gun control) • Data philanthropy – corporations encouraged to contribute data to the good of society.
  13. 13. Background for Coffee Examples
  14. 14. Fusion Tables google.com/fusiontables [SIGMOD 2010, SIGMOD 2012] • Goal: an easy-to-use database system that is integrated with the Web. • Key: support common workflows – Easy upload (CSV, KML, spreadsheets) – Sharing (even outside your company) – Visualizations front and center – Easy publishing • Goal 2: Fusion in the data cloud -- discover others’ data and combine with yours.
  15. 15. Coffee Producing Countries
  16. 16. Coffee Consumption Per Capita
  17. 17. Big Data for Regular People Table Facts: English poverty rates: 32,000 wards with a total of 1.8 million vertices Colors indicate poverty levels 2011 Rioting: 2100 incidents Colors indicate addresses of Rioting and Rioters Best UK Internet Journalist Knight-Batten Award for Innovations in Journalism
  18. 18. Crowd Sourcing
  19. 19. Data Integration as Search
  20. 20. Join with Population Data: What is a City?
  21. 21. Big Data Integration Table Facts: Texas Counties 2010 Census: 254 counties with 543000 vertices Colored based on various demographics See SIGMOD 2012 paper for details on scaling map visualizations
  22. 22. Crowdsourcing Cafes
  23. 23. HTML Tables
  24. 24. Search Engine for Data Sets research.google.com/tables [VLDB 2008, 2011, 2014]
  25. 25. Give Answers from Tables
  26. 26. It Better Be Right!
  27. 27. Answer with a Visualization
  28. 28. Long Term Goal: A Data-Guided Decision Engine • Support decision making: – Healthcare debate – Should I install solar in my house? – Which charity should I contribute to? • Show relevant data – Expose facets of the decision and enable drilldown – Show opposing views • Manually curated examples of decision engines: – Justfacts.com, followthemoney.com, decide.com
  29. 29. WebTables on google.com!
  30. 30. HTML Lists See Elmeleegy et al., VLDB 2009
  31. 31. Tree Search Amish quilts Parking tickets in India Horses The Deep Web [Madhavan et al., VLDB 2008]
  32. 32. Other Sources of Data • Spreadsheets • CSV files • Tables embedded in PDF • XML, RDF • Visualizations • Online databases (Fusion Tables, Tableau, …) Each source has its particularities, but most problems are common to all.
  33. 33. Non-Tabular Data in HTML
  34. 34. Vertical Tables
  35. 35. Data Optimized for Page Layout
  36. 36. Tabular Data Optimized for Site Layout See [Ling et al, IJCAI 2013] for stitching tables within a site.
  37. 37. Semantics Can Be Brittle
  38. 38. Semantics are in Text
  39. 39. The Big Challenge • Analyze natural language text as it pertains to structured data. • Different from (open) information extraction that builds databases entirely from text. • Good news: natural language parsing technology is now scalable.
  40. 40. First Step: Annotating Columns [Venetis et al., VLDB 2011]
  41. 41. Step 2: Understanding Relationships
  42. 42. Dictionary of Attributes • I want the list of all attributes that countries may have. • Freebase doesn’t have coffee production. • Is this an ontology? – Not quite! I want an ontology suited for search.
  43. 43. Biperpedia: [VLDB 2014] Ontology for Search Applications
  44. 44. Comparing to Freebase Coverage
  45. 45. Tower of Babel: Internet Style In 2013, the coffee production of El Salvador dropped by 20% due to the coffee rust disease. Coffee production el salvador 2013 El Salvador exports coffee 2013 Knowledge Graph Tables Text
  46. 46. Conclusions • This was a talk about Big Data: – Millions of people creating data sets – Billions of people seeing the data being impacted • Get out there and find your favorite application. • Dreams do come true: – At least as it pertains to structured data on the Web!
  47. 47. References • Fusion Tables: SIGMOD 2010, 2012 • WebTables: VLDB 2008, 2009, 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×