Your SlideShare is downloading. ×
Data Big and Broad (Oxford, 2012)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Big and Broad (Oxford, 2012)


Published on

Definitions, examples, and challenges in a world where data is available and plentiful …

Definitions, examples, and challenges in a world where data is available and plentiful

Published in: Technology, Education

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Tetherless World Constellation Data: Big and Broad Jim Hendler Tetherless World ConstellationTetherless World Professor of Computer and Cognitive Science Head, Computer Science Department Rensselaer Polytechnic Institute @jahendler (twitter)
  • 2. Outline (if I stick to it) Tetherless World Constellation• What is big data?• How big is big?• What is big data on the Web?• What is Broad data?• Got an example?• What’s the problem?• What’s going on
  • 3. Useful Terms Tetherless World Constellation• Machine-readable Data – Information available in a form that is accessible and manipulable by computer – Accessible ≠ Manipulable • eg PDF documents can be read in and displayed, but the information in the document is not readily available without special tooling• Metadata – Information associated with (machine-readable) data that provides information about the data set• Workflow, Provenance, and lots of other terms – Useful sorts of metadata with respect to who created the data, when, how was it processed, etc.• Metadata and the other stuff most useful when it is machine-readable and openly available in commonly agreed upon formats
  • 4. BIG Data is NOT the Web of Data Tetherless World Constellation• The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place – 3 main contexts • The large data collections of “big science” projects – in traditional data warehouse or database formats • The enterprise data of large, non-Web-based companies (IBM, TATA, etc.) – Generally in multiple • The data holdings of a Google, Facebook or other large Web company – Include large “unstructured” holdings – Include “graph” data
  • 5. Tera, Peta, Zeta yotta, yotta, yotta… Tetherless World Constellation• World Wide Web data is extremely large• Extremely well “funded” – eg. Facebook • 25 Terabytes of logged data per day; valuation $33B (US NIH budget ~ $31B) – eg. Google • In 2008 it was estimated at 20 petabytes per day (not including youTube); current valuation $190B (about 1/3 the entire US DoD budget)• And really, really fascinating stuff – Data about people and their relationships • To each other • To products • To activities and actions • …
  • 6. How BIG is Big?Tetherless World Constellation
  • 7. BIG Data Tetherless World ConstellationGoogle uses their data in many ways Search => ads => user
  • 8. Big Data is becoming different on the Web Tetherless World Constellation• New Work – is moving away from traditional relational models • cf. NoSQL – Moving towards third party application and extension • cf. Mobile apps for local governments – Includes a focus on interoperability and exchange with “lightweight” semantics • Using ideas from the Semantic Web – Search: – Social Networking: OGP
  • 9. Which in part gives rise to BROAD data Tetherless World Constellation• 4th context: Broad Data – The huge amount of freely available, but widely varied, Open Data on the World Wide Web (Structured and Semi-structured) • Example: The extended Facebook OGP graph (the part outside Facebook’s datasets) • Example: The growing linked open data cloud of freely available RDF linked data • Example: Hundreds of thousands of datasets that are available on the Web free from governments around the world
  • 10. Example: adding “Breadth” Tetherless World Constellation April 2010
  • 11. Facebook’s Open Graph Protocol Tetherless World Constellation• Facebook now allows other sites to extend the graph• Open Graph Protocol uses RDFa to let web sites contain information about the things people “like” og:title - The title of your object as it should appear within the graph, e.g., "The Rock". og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required. og:image - An image URL which should represent your object within the graph. og:url - The canonical URL of your object that will be used as its permanent ID in the graph og:description - A one to two sentence description of your object. og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb". – Not a traditional “ontology”
  • 12. Big Data Tetherless World ConstellationFacebook generates terabytes of data per day What could be learned from this?
  • 13. Creates a platform for SW-powered apps Tetherless World Constellation
  • 14. BROAD data challenges Tetherless World Constellation• For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling – rapid (and potentially ad hoc) integration of datasets – visualization and analysis of only- partially modeled datasets – policies for data use, reuse and combination.
  • 15. Huh? Tetherless World Constellation“The more I work with data, the more Irealize I need Semantics” Huh?The traditional database community has,umm, not always been the first to embracesemanticsWhat is different here?
  • 16. Government Data SharingTetherless World Constellation
  • 17. The Web of OpenGovernment Data is Growing• Analytics based on over 1,000,000 datasets from around the world can be seen at –• The examples that follow are from that pageDatasets 1,028,054Countries 43Catalogs 192Categories 2460Languages 24 2012 International Open Government Data Conference—Open Gov Data Tutorial9 July 2012 17
  • 18. International 2012 International Open Government Data Conference—Open Gov Data Tutorial9 July 2012 18
  • 19. 2012 International Open Government Data Conference—Open Gov Data Tutorial9 July 2012 19
  • 20. Many others… Important note: quantity is not really the most important issue 2012 International Open Government Data Conference—Open Gov Data Tutorial9 July 2012 20
  • 21. Topics (Across All Catalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial9 July 2012 21
  • 22. Topics (Across All Catalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial9 July 2012 22
  • 23. Combining data from different data sharing sites Tetherless World Constellation
  • 24. Data Integration Problems Tetherless World ConstellationHead to head comparions shows thatburglaries in Avon and Somerset (UK) farexceed those in Los Angeles, California(one of the highest crime areas in the US)
  • 25. The problem is (likely) semantics Tetherless World Constellation Same or different?Do the terms mean the same? Are they collected in the same way? Arethey processed differently? …
  • 26. Example: WaterTetherless World Constellation
  • 27. Example: Water/KenyaTetherless World Constellation
  • 28. Finding Data Tetherless World ConstellationWorld Bank: Africa Africover: Agriculture Kenya: Agricultural US Crop
  • 29. 5 Star Data Tetherless World Constellation IOGDC Open Data Tutorial 299 July 2012
  • 30. Broad Data “Integration”requires simple semantics Tetherless World Constellation
  • 31. Example any wikipedia topic! Tetherless World Constellation
  • 32. ArizonaTetherless World Constellation
  • 33. Arizona info (From the previous) Tetherless World Constellation
  • 34. USDA data turns out to be crucial Tetherless World Constellation
  • 35. Metadata is crucial for Broad Data Tetherless World Constellation• Metadata design is crucial to govt data sharing – Needed for search and federation in large data sharing efforts• International data sharing – W3C Govt Linked Data Working Group – Need for vocabularies within govt sectors • Esp for cross-langauge use – How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc. – How can we link local govts (in traditional languages, local dialects, etc) w/national data
  • 36. Database metadataTetherless World Constellation
  • 37. Dataset extension to (pending) Tetherless World Constellation
  • 38. Government Data in the linked open data cloud Tetherless World Constellation Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
  • 39. Research in Govt Data => Broad Data challenges Tetherless World Constellation• Trust – Government data is controversial, and potentially biased • How do we confirm or dispute?• Combination – When we combine data we need to keep the provenance of information (see trust) • How do we make policies explicit and sharable• Scaling – Our project has already converted 9.9B triples from only >2,000 of the 710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages) • Cross-catalog • Cross Langauge• Versioning and updating• Archiving• Visualization
  • 40. Big Data needs bigger ideas for visualization Tetherless World Constellation (Fox &Hendler, Science, 2/11/10)
  • 41. A new idea we’re playing with at RPI Tetherless World Constellation• Data as “exhibition” – Museums/Performing Arts have explored accessibility for real world artifacts, can we extend these to the data web?• Data via physical interaction – Using theatre techniques we can literally move a person through a data landscape, what new metaphors does this open up?
  • 42. Conclusions Tetherless World Constellation• Big data is going Broad – World Wide Web trend towards more and more varied data • In many domains – E-commerce, Open Govt, many more (cf. Health/Medical care)• Broad data requires thinking outside the “Database” box – Including considering access• Broad data opens exciting possibilities for research and innovation – And I hope will help provide tools for making data more accessible