Exploring our world with freebase
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Exploring our world with freebase

on

  • 9,198 views

I gave this talk on Oct 2 at the Semantic Technology and Business conference. In this talk I discuss how I process Freebase data with the open source Infovore framework, which processes Freebase and ...

I gave this talk on Oct 2 at the Semantic Technology and Business conference. In this talk I discuss how I process Freebase data with the open source Infovore framework, which processes Freebase and other RDF data quickly by using Hadoop, Map/Reduce, and Amazon Web Services

Statistics

Views

Total Views
9,198
Views on SlideShare
9,164
Embed Views
34

Actions

Likes
9
Downloads
52
Comments
0

4 Embeds 34

http://www.redditmedia.com 17
http://www.linkedin.com 15
https://twitter.com 1
https://www.redditmedia.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Exploring our world with freebase Presentation Transcript

  • 1. Exploring Our World With Freebase Paul Houle paul@ontology2.com
  • 2. Generic Databases
  • 3. Where does the data come from? Copyright 2009 CC-BY by Richard HeavenRobot Image Copyright 2007 CC-BY by Crispin Summers
  • 4. Google Knowledge Graph
  • 5. The Wikipedia Data Ecosystem
  • 6. API RDF Deferencing Quad Dump Simple Topic Dump Type Tables
  • 7. MQL { "status": "200 OK", "code": "/api/status/ok", "result": { "type": "/music/artist", "name": "The Police", "album": [ "Outlandos d'Amour", "Reggatta de Blanc", "Zenyatta Mondatta", "Ghost in the Machine", "Synchronicity" ] } }
  • 8. My path to the semantic web
  • 9. My path to the semantic web
  • 10. My path to the semantic web
  • 11. Infovore 1 Quad Dump Simple Topic Dump :BaseKB Pro :BaseKB Lite
  • 12. Spring 2012
  • 13. Fall 2012 Quad Dump Official RDF Dump Infovore 1.0 released as open source under Apache License
  • 14. 13+ million Invalid Facts Image cc-by from arj03
  • 15. Infovore 1.0 Quad Dump -> RDF Infovore 1.1 General RDF Cleanup & Filtering Millipede framework – Map/Reduce on a single computer
  • 16. Infovore 2
  • 17. What does Freebase cover?
  • 18. Is it a bibliographic database?
  • 19. Ahead of their time? Reading Room, Library of Congress
  • 20. MARC… in electronic form since 1969! First standard data format with variable length fields & I18N.
  • 21. Now everybody has a bibliographic database…
  • 22. Or, do documents annotate the world?
  • 23. Social Semantic Systems Linked Data User-Generated Content
  • 24. The dominant paradigm Triple store
  • 25. How to break your triple store http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/
  • 26. The RDF data warehouse ETL warehouse operations development science
  • 27. The RDF data warehouse II warehouse Operations tools Science Tools
  • 28. Latency: low is not low enough
  • 29. operations development science
  • 30. 0 10 20 30 40 50 60 Freebase DBpedia any relational database machine learning Jena Amazon Web Services PHP map/reduce frameworks (ex. Hadoop) MongoDB Sesame Virtuoso OpenLink other NoSQL database Solid State Drives (SSD) other cloud computing service Neo4J Ruby Drupal alternative JVM languages (ex. Scala or Clojure) other triple store any key/value store (ex. JDBM or Berkeley DB) OWLIM Allegrograph 4store Factual dotNetRDF Stardog Kasabi/Talis Platform Oracle Spatial RDF Tools Popular With :BaseKB Users
  • 31. Map/Reduce Inputs Mappers Shuffle Sort Reducers Output
  • 32. RDF: Reduction on Subject :Goat :Bear :Alligator :Iguana :Dog :Elephant :Cat :Horse :Fox :Alligator :Dog :Goat :Bear :Elephant :Horse :Cat :Fox :Iguana
  • 33. Jena Framework SDB Relational db-based Triple store TDB Native disk-based triple store Model In-memory triple store “We use Jena Models like PHP programmers use hashtables” -- Kendall Clark, Clark and Parsia
  • 34. Hadoop Physical Architecture Namenode Jobtracker Datanodes & Tasktrackers HDFS
  • 35. My development cluster – Namenode/JobTracker
  • 36. Hadoop tolerates Hardware failures
  • 37. My other computer is
  • 38. Amazon Elastic Map/Reduce Amazon S3 (Permanent Storage)
  • 39. “It’s harder to make up names for things than to invent them” - Tom Swift Fictional American Inventor
  • 40. Infovore modules bakemono haruhi centipede chopper
  • 41. Bakemono Super JAR
  • 42. Bakemono Super JAR Contains applications like freebaseRDFPrefilter pse3 ranSample sieve3 Named after Japanese word for “monsters”
  • 43. “Haruhi” (1) Japanese religious word for “Full of Spirit” ; (2) a very dominant person
  • 44. Unpacking the Freebase RDF Dump photograph Copyright 2010 Ian Munroe CC-BY SA
  • 45. Eliminate Bulk Up Front BIG DATA
  • 46. Eliminate Bulk Up Front DATA
  • 47. Inputs Mappers
  • 48. freebaseRDFPrefilter removes… Wasteful Facts • 120M+ copies of the “a” predicate • 60M+ access control predicates Violent and Dangerous facts ns:common.topic ns:type.type.instance ?o . Is repeated 30M times, and if you group on ?s and keep them in memory…
  • 49. … uneven bin distribution … 331 332330 333 334 335 … …
  • 50. Prefiltering stops memory exhaustion before it happens!
  • 51. Parallel Super Eyeball “triples” valid triples junk Currently, 250,000 or so triples in Freebase are rejected by PSE3
  • 52. Parallel Super Eyeball 3
  • 53. Sieve3 literal facts (ex. ?s ?p 55. ) ?s :a ?p . ?s ?p ns:some_topic . ?s rdfs:label ?o .
  • 54. Horizontal Decomposition of Freebase
  • 55. a 5% description 18% key 11% keyNs 13% label 6% name 6% notability 0% nfp 0% text 8% web 6% links 20% other 7% percentage of gz compressed size
  • 56. a 16% description 1% key 9% keyNs 11% label 6% name 6% notability 2% nfp 2% text 0% web 5% links 32% other 10% percentage of facts
  • 57. a 15% description 7% key 8% keyNs 9% label 4% name 4% notability 2% nfp 1% text 3% web 6% links 30% other 11% percentage of uncompressed size
  • 58. rdf:type aka “a” 16% 15% 5% facts bytes compressed bytes ns:m.02qvftw rdf:type ns:business.employer .
  • 59. RDFS Inference :a :Actor ?
  • 60. RDFS Inference Jesse Plemons Todd
  • 61. :a :Actor . Jesse Plemons Todd implies
  • 62. Descriptions 1% facts 18% bytes 7% compressed
  • 63. Descriptions ns:m.010bfy ns:common.topic.description "Riverside u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt . ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .
  • 64. Descriptions ns:m.010bfy ns:common.topic.description "Riverside u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt . ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en . This does not compute!
  • 65. Descriptions ns:m.010bfy ns:common.topic.description "Riverside u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt . ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .
  • 66. Labels and Names ns:american_football.football_division rdfs:label "American football division"@en . ns:american_football.football_conference rdfs:label "Grupper inom amerikansk fotboll"@sv . ns:american_football.football_player ns:type.object.name "Football-Spieler"@de . ns:american_football.football_team ns:type.object.name "American football-team"@nl .
  • 67. Freebase Labels Are Not Unique
  • 68. Dbpedia Labels are Unique
  • 69. https://github.com/paulhoule/infovore/wiki https://groups.google.com/forum/#!forum/infovore-basekb
  • 70. Keys in the Freebase dump • Most objects represented by mid identifiers
  • 71. Keys in the Freebase dump • Schema objects have friendly identifiers
  • 72. Keys in the Freebase dump
  • 73. Examples… ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en . ns:american_football.football_division rdfs:label "American football division"@en . Freebase always uses the same key in the ?s, ?p, and ?o fields, but...
  • 74. It wasn’t always this way … the old quad dump used mids in the subject field, but others in the destination field …
  • 75. Turtle0 Turtle1 Turtle2 Turtle3 Extract namespace graph Convert all identifiers to mids Extract type information from schema Convert to RDF types :BaseKB 2012
  • 76. Freebase Knows Many Keys ns:g.11vk55hmr ns:type.object.key "/base/dspl/us_census/population/place" . ns:m.010004m ns:type.object.key "/authority/musicbrainz/339a2897-9ba4-4820-a2a8-f234c22608a4“ . ns:Lm.01003_ ns:type.object.key "/wikipedia/de/Krum_$0028Texas$0029“ . ns:m.01010d ns:type.object.key "/wikipedia/en_id/135860" . ns:m.0100_b ns:type.object.key "/authority/gnis/1352653" . ns:m.0100l2 ns:type.object.key "/authority/hud/countyplace/4814101390" . ns:m.01031l ns:type.object.key "/en/chandler_texas" . ns:m.015g9m ns:type.object.key "/en/aliens_from_space" . ns:m.015gdl ns:type.object.key "/en/self-publishing" . ns:m.015gjr ns:type.object.key "/authority/nndb/231$002F000085973" . … and type.object.key spells them out …
  • 77. A directed acyclic graph /m/01 root /m/019s wikipedia /m/047w32v authority /m/0gt9 en /m/05x_rjr Geoff_Simmons /wikipedia/en/Geoff_Simmons = /authority/wikipedia/en/Geoff_Simmons
  • 78. key: namespace encodes the graph ns:m.010005 key:wikipedia.pt "Corinth_$0028Texas$0029" . ns:m.010005h key:authority.musicbrainz "ab0b82ce-d1be-4641-b0d1-838896a25887" .
  • 79. Useful external keys
  • 80. Music
  • 81. http://www.freebase.com/authority/musicbrainz/e217a1e9-9ec8-4e88-aebc-7d6b720384c1
  • 82. Musical Composition … Recording “Recording appears on Album as track #”
  • 83. Functional Requirements For Bibliographic Records (FRBR)
  • 84. Nick Hexium Rap Rock 311 Omaha, NE Los Angeles, CA
  • 85. Unique data in DBpedia
  • 86. Wikipedia Categories
  • 87. Wikipedia Page Links
  • 88. “Smushing” dbpedia:Striated_Heron :linksTo dbpedia:Heron . dbpedia:Striated_Heron owl:sameAs ns:m.01v7dp . dbpedia:Heron owl:sameAs ns:m.01jgnh . Ns:m.01v7dp :linksTo ns:m.01jgnh .
  • 89. Duck Types • ?a performed on music track ?b - ?a is a musician
  • 90. Duck Types • ?a employed ?b - ?a is an employer
  • 91. Duck Types • Book ?a was written about ?b – ?b is a book subject
  • 92. The Problem of Notability
  • 93. ns:m.0100007 ns:common.topic.notable_types ns:m.0kpv11. ns:m.01000_r ns:common.topic.notable_types ns:m.0kpv11. ns:m.01000dh ns:common.topic.notable_types ns:m.09jd9nh. ns:m.01000pp ns:common.topic.notable_types ns:m.09jd9nh. ns:m.01000px ns:common.topic.notable_types ns:m.0kpv11. ns:m.01000w ns:common.topic.notable_types ns:m.01m9. ns:m.01000yk ns:common.topic.notable_types ns:m.0kpv11. ns:m.010012t ns:common.topic.notable_types ns:m.0kpv11. ns:m.010014_ ns:common.topic.notable_types ns:m.09jd9nh. ns:m.010019c ns:common.topic.notable_types ns:m.09jd9nh.
  • 94. Analysis with Chopper and Pig
  • 95. Why APIs suck (Including SPARQL endpoints) • Provider can afford maximum $/query • If you need a more complex query you’ve got no option!
  • 96. :BaseKB Now YOU AWS S3
  • 97. Cluster creation made easy :BaseKB Now
  • 98. Pig Script – count common types $ pig grunt> run chopper/src/main/pig/lib/chopper.pig grunt> a = LOAD '/freebase/20130915/a/' USING com.ontology2.chopper.io.PrimitiveTripleInput(); grunt> oNodes = FOREACH a GENERATE o; grunt> groupNodes = GROUP oNodes BY o; grunt> countedNodes = FOREACH groupNodes GENERATE group AS uri:chararray,COUNT(oNodes) AS cnt:long; grunt> sortedNodes = ORDER countedNodes BY cnt DESC; grunt> top100= DUMP sortedNodes;
  • 99. Most frequent types (<http://rdf.basekb.com/ns/common.topic>,39030195) (<http://rdf.basekb.com/ns/common.notable_for>,18747254) (<http://rdf.basekb.com/ns/music.release_track>,13304261) (<http://rdf.basekb.com/ns/music.recording>,8902041) (<http://rdf.basekb.com/ns/music.single>,6297869) (<http://rdf.basekb.com/ns/common.document>,5580077) (<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634) (<http://rdf.basekb.com/ns/book.book_edition>,2771323) (<http://rdf.basekb.com/ns/people.person>,2742157) (<http://rdf.basekb.com/ns/type.namespace>,2689781) (<http://rdf.basekb.com/ns/book.isbn>,2601099) (<http://rdf.basekb.com/ns/type.content>,2499648) (<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)
  • 100. Compound Value Types and our 4D world
  • 101. The 13th most prevalent type (<http://rdf.basekb.com/ns/common.topic>,39030195) (<http://rdf.basekb.com/ns/common.notable_for>,18747254) (<http://rdf.basekb.com/ns/music.release_track>,13304261) (<http://rdf.basekb.com/ns/music.recording>,8902041) (<http://rdf.basekb.com/ns/music.single>,6297869) (<http://rdf.basekb.com/ns/common.document>,5580077) (<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634) (<http://rdf.basekb.com/ns/book.book_edition>,2771323) (<http://rdf.basekb.com/ns/people.person>,2742157) (<http://rdf.basekb.com/ns/type.namespace>,2689781) (<http://rdf.basekb.com/ns/book.isbn>,2601099) (<http://rdf.basekb.com/ns/type.content>,2499648) (<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)
  • 102. :Las_Vegas 945 1910 :US_Census_Bureau population number date source
  • 103. 25 1900 945 1910 2,304 1920 5,165 1930 8,422 1940 24,624 1950 64,405 1960 125,787 1970 164,674 1980 260,561 1990 284,931 1991 297,326 1992 312,634 1993 336,380 1994 354,559 1995 372,849 1996 391,074 1997 405,245 1998 418,658 1999 484,487 2000 498,638 2001 507,219 2002 516,723 2003 534,168 2004 544,806 2005 552,855 2006 559,892 2007 562,849 2008 567,641 2009 584,539 2010 589,317 2011 0 100000 200000 300000 400000 500000 600000 700000 1900 1920 1940 1960 1980 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 AxisTitle Population of Las Vegas, NV Series1
  • 104. Vertical Divisions of Freebase Wikipedia Topics Movies and Television Travel and Lodging :BaseKB Lite
  • 105. Separating Blank Nodes
  • 106. Separating Blank Nodes
  • 107. Separating Blank Nodes
  • 108. Separating Blank Nodes
  • 109. :BaseKB Now • Created Weekly by automated process • Delivered to AMZN S3 • Accepted facts are 100% Valid RDF • Rejected facts collected for inspection • “Violent” predicates removed to fight skew • Horizontally divided for fast processing http://basekb.com/
  • 110. Infovore Software http://github.com/paulhoule/infovore/wiki