Your SlideShare is downloading. ×
Infovore: An Open Source MapReduce Framework For Processing Graph Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Infovore: An Open Source MapReduce Framework For Processing Graph Data

2,654
views

Published on

This talk describes an Infovore, a tool that uses the Map/Reduce approach to clean up, filter and combine RDF data sets to deliver purpose-built data sets for practical consumers of linked data

This talk describes an Infovore, a tool that uses the Map/Reduce approach to clean up, filter and combine RDF data sets to deliver purpose-built data sets for practical consumers of linked data

Published in: Technology, Education

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,654
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Infovore, an Open-Source Map/ReduceFramework For Processing GraphDataPaul HouleOntology2
  • 2. 2+ billion facts, 20+ gb!
  • 3. the data your project needs
  • 4. Why handle complete data sets?Quality PerimeterInfovore
  • 5. RDF Tools vs.Invalid TriplesImage cc-by from arj03
  • 6. Scaling Limits of Triple StoresCPU Main MemoryCPUCPUCPUCPUCPURandom-access bottleneckHard Drive or Flash Storage
  • 7. Map/Reduce conserves memory!Image cc-by-sa from Anua22a
  • 8. Partitioning Datamd5(“http://dbpedia.org/resource/Tree”) =b78f8f508982ceb4e8dd3510fac75f62331 332330 333 334 335… …
  • 9. If you really try it…331332330333334 335… …
  • 10. Preprocessing Freebase• Expand prefixes• Remove• fbase:type.type.instance• fbase:type.type.expected_by• rdfs:type w/ fbase:* subject• Reverse• Fbase:type.permission.controls• Fbase:dataworld_gardening_hint.replaced_by• Rewrite• Fbase:type.object.type to rdfs:type
  • 11. Parallel Super Eyeball
  • 12. sort | uniq:Surgeon a :Occupation .:Surgeon rdfs:label “Surgeon” @en.:Surgeon :mustHave :Md.:Tree a :Plant .:Tree rfs:label “Tree” @en .:Tree :has :Leaves .:Victory a :AbstractConcept .:Vectory rdfs:label “Victory” .:Victory :emotialTone :Positive .
  • 13. Huge scalability…:Tree:Victory:SurgeonMain memory
  • 14. Pig, Hadoop and All That…Source: http://www.dbis.informatik.hu-berlin.de/forschung/projekte/query-optimization-in-rdf-databases.html
  • 15. Monitoring for Quality ControlOperational Statistics(rdf)Preprocess Partition Clean Sort Classify Filter
  • 16. :basekb
  • 17. Parallel Loading into Triple Stores331 332330 333 334 335… …Openlink Virtuoso4x Speedup
  • 18. :basekb lite:Freebase:Chosenfacts:Rulebox:Chosentopics
  • 19. rdf diff
  • 20. See for yourselfhttps://github.com/paulhoule/infovore/wiki