Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Broad Data (India 2015)

1,721 views

Published on

A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.

Published in: Technology
  • Be the first to comment

Broad Data (India 2015)

  1. 1. Tetherless World Constellation Broad Data Jim Hendler Tetherless World Professor of Computer and Cognitive Science Director, The Rensselaer Institute of Data Exploration and Applications (IDEA) Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  2. 2. Tetherless World Constellation This talk • What I’m not going to talk about much – The Semantic Web (per se) • http://www.slideshare.net/jahendler/semantic-web-the-inside-story – Social Machines • http://www.slideshare.net/jahendler/social-machines-oxford-hendler – My work with Watson and Cognitive Computing • http://www.slideshare.net/jahendler/watson-an-academics-perspective • http://www.slideshare.net/jahendler/watson-summer-review82013final • What I am going to present – The rest of the big data story…
  3. 3. Tetherless World Constellation Data is important! • Roughly every 50 years a new power source for the human race is found. Once upon a time it was chemical, then it was electrical, then nuclear, etc. • Information – so not just data, but data being used – is the new power source for our generation. http://www.slideshare.net/jahendler/the-science-of-data-science
  4. 4. 4 The Rensselaer Institute for Data Exploration and Applications Business Systems: Built and Natural Environments: Cyber- Resiliency: Policy, Ethics and Stewardship: Materials Informatics:Data-driven Physical/Life Sciences: Healthcare Analytics and Mobile Health: Social Network Analytics: Agents and Augmented Reality:
  5. 5. Office of Research 5 Developing a “Data Science” Research Agenda Multiscale Sparcity Abductive Agent-oriented
  6. 6. Tetherless World Constellation BIG Data • The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place – 3 main contexts • The large data collections of “big science” projects – in traditional data warehouse or database formats • The enterprise data of large, non-Web-based companies (IBM, TATA, etc.) – Generally in multiple data formats, stores, warehouses, etc. • The data holdings of a Google, Facebook or other large Web company – Include large “unstructured” holdings – Include “graph” data
  7. 7. Tetherless World Constellation But wait, there’s more! • 4th context: Broad Data – The huge amount of freely available, but widely varied, Open Data on the World Wide Web (Structured and Semi-structured) • Example: The extended Facebook OGP graph (the part outside Facebook’s datasets) • Example: dbpedia, yago, wikidata, and other sources of indexed information sources • Example: The growing linked open data cloud of freely available linked data from many domains • Example: millions of datasets that are available on the Web freely available from governments around the world
  8. 8. Tetherless World Constellation The V’s Volume Velocity
  9. 9. Tetherless World Constellation BROAD data challenges • For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling and user testing – rapid (and potentially ad hoc) integration of datasets – visualization and analysis of only-partially modeled datasets – policies for data use, reuse and combination. • Which are an overlooked but critical part of the KDD world
  10. 10. Tetherless World Constellation 10 KDD Pipeline – as usually presented Data Storage (Big Data Warehouse) Data Storage (Big Data Warehouse)
  11. 11. Tetherless World Constellation KDD Pipeline – in the real world • Data is increasingly being brought in from external sources, with mixed provenance, and increasingly outside the analyzers’ control. • At increasing rates and scalesData Storag e Data Storag e Sensors … apps Social Media Customer Behaviors Web Partners Formatting, standards use, data cleansing, data bias analysis, … Open data Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Storag e Data Sources Data Sources … …
  12. 12. Tetherless World Constellation Tough data integration challenges Enterprise analytics Open Data Integration Hard problems!
  13. 13. Tetherless World Constellation DIVE into Data Discover Integrate Validate Explore Thinking outside the Database box
  14. 14. IDEA Discovery needs semantics How do you find the Data you need?How do you find the Data you need? The answer isn’t: Middle Eastern Terrorists for $800 …
  15. 15. IDEA Discovery – there’s a lot out there
  16. 16. IDEA Discovery challenge: keyword search won’t work World Bank: Africa US Data.gov: Crop Africover: Agriculture Kenya: Agricultural
  17. 17. IDEA Integration challenge: need to understand the data Person RIN 660125137 Address # 1118 Address St Pinehurst Address zip 12203 Course topic CSCI Course # 4961 Campus Personnel RPI ID 660125137 Name Hendler Campus Classes CRN 1118 Name Intro to Physics YES NO!!!!
  18. 18. IDEA Semantic Web and Linked Data (UK) County Council Ordnance Survey Royal Mail IOGDC Open Data Tutorial 18
  19. 19. IDEADistribution Statement http://logd.tw.rpi.edu Semantic Web and Linked Data (US examples)
  20. 20. IDEA Validation challenge: easy for humans Easy for us
  21. 21. IDEA But very hard for machines without people (or knowledge) Head to head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California * one of the most dangerous places in the US vs. one of the safest in the UK * fails the “smell test”
  22. 22. IDEA Data + everything else you know Same or different? Do the terms mean the same? Are they collected in the same way? Are they processed differently? …
  23. 23. Office of Research Exploration challenge: develop/test earlier in pipeline 23 Data Storage Data Storage Sensors and apps Social Media Customer Behaviors Web Partners Formatting, standards use, data cleansing, data bias analysis, … Open data Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage ExploreExplore Can we develop mechanisms to rapidly develop/test hypotheses prior to entering the full analytics pipeline? Can human perceptual apparatus help?
  24. 24. Tetherless World Constellation Exploration challenge is to improve human/data interaction Were there really no fires in 1985?
  25. 25. Tetherless World Constellation How do we attack these challenges? DOH? DO! OR
  26. 26. Tetherless World Constellation Traditional Metadata • Traditionally metadata tries to be comprehensive – Example:ISO 19115 (GIS standard) • >400 elements • 14 “packages” • Dozens of UML models (not all consistent w/ each other) • After 50 years this still doesn’t work!
  27. 27. Tetherless World Constellation The alternative: Not your “father’s metadata” • Big Data on the web – is moving away from traditional relational models (cf. NoSQL) – Moving towards third party application and extension (cf. Json) – Focus on interoperability and exchange with “lightweight” semantics • Using ideas from the Semantic Web – Search: Schema.org – Social Networking: OGP
  28. 28. Tetherless World Constellation Semantic Web to Knowledge graph
  29. 29. Tetherless World Constellation Knowledge graph and schema.org
  30. 30. Tetherless World Constellation Google 2014 Google finds embedded metadata on >20% of its crawl – Guha, 2014
  31. 31. Tetherless World Constellation • The schema.org hierarchy and details are all available on line –https://schema.org/docs/full.html
  32. 32. Tetherless World Constellation Schema.org/Dataset Human-readable database description (HTML)
  33. 33. Tetherless World Constellation Schema.org/Dataset Embedded meta- data (RDFa)
  34. 34. Tetherless World Constellation Dataset extension to schema.org - April, 2013 Schema.org/Dataset – add this to your pages!
  35. 35. Tetherless World Constellation Schema.org/Dataset (Schema-labs, data search engne)
  36. 36. Tetherless World Constellation Distribution Statement Big Deal!
  37. 37. Tetherless World Constellation USA “Project Data” – metadata JSON Aimed at developers Based on DCAT
  38. 38. Tetherless World Constellation USA “Project Data” – metadata RDFa Embedded metadata for Search, Web Apps Based on Schema.org/Dataset
  39. 39. Tetherless World Constellation EU moving in similar direction ADMS
  40. 40. Tetherless World Constellation Not just Govt sector • IPTC rNews – Embedded format for online news publications
  41. 41. Tetherless World Constellation Not just Govt sector • Goodrelations – Embedded format for online products/catalogs
  42. 42. Tetherless World Constellation Not just Govt sector • Open Graph Protocol – Embedded format for Facebook relationships
  43. 43. Tetherless World Constellation OGP Use
  44. 44. Tetherless World Constellation Next steps Smith James June 4 Jones Fred May 17 O’Connell Frank April 3 Chang Wu February 21 Hoffman Bernd December 9 Person Date It’s not enough just to describe the data elements…
  45. 45. Tetherless World Constellation Describing a dataset … requires a context Smith James June 4 Jones Fred May 17 O’Connell Frank April 3 Chang Wu February 21 Hoffman Bernd December 9 Person Date 1976 Dates of Birth
  46. 46. Tetherless World Constellation Describing a dataset … requires a context How do we capture more of this information? Smith James June 4 Jones Fred May 17 O’Connell Frank April 3 Chang Wu February 21 Hoffman Bernd December 9 Person Date 1976 Cancer Mortality dates
  47. 47. IDEA Scalable Data Integration (via metadata)
  48. 48. IDEA Semantic Linking
  49. 49. IDEA ARL Network-Science CTA 1 10 100 1000 1 10 100 1000 Count Time interval (# of days) Mentorship first Housing first 1 10 100 1000 1 10 100 1000 Count Time interval (# of days) Mentorship first Housing trust first 1 10 100 1000 1 10 100 1000 Count Time interval (# of days) Housing trust first Mentorship first 1 10 100 1000 1 10 100 1000 Count Time interval (# of days) Housing trust first Mentorship first A C B D 0 50 100 150 200 250 300 350 400 -300 -200 -100 0 100 200 300 Count Time interval (# of days) 0 100 200 300 400 500 600 700 -300 -200 -100 0 100 200 300 Count Time interval (# of days) 0 50 100 150 200 250 300 350 400 450 -300 -200 -100 0 100 200 300 Count Time interval (# of days) 0 50 100 150 200 250 300 350 -300 -200 -100 0 100 200 300 Count Time interval (# of days) A C B D Algorithms designed Y3 were tested against 220GB of data from Everquest II game looking for proxy measures of trust - Performance results on real data showed good correspondence with theoretical results. (but 220GB = 1 month of our 2 yrs of data)
  50. 50. IDEA Scaling inference for discovery, integration & validation AI “rules on graphs” bring (limited) KR languages to supercomputing models Weaver (PhD 2013) showed power of BlueGene/Q for AI computations
  51. 51. 51 From visualization to exploration … Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.
  52. 52. Tetherless World Constellation Conclusions • Our data challenge is becoming “Broad Data” – World Wide Web trend towards more and more varied data • In many domains – E-commerce, Open Govt, many more (cf. Health/Medical care) • Broad data requires – Modern, Web-oriented metadata – LINKING the metadata, not the data • Broad data requires thinking outside the “Database” box – DIVE: discover, integrate, validate and – especially: EXPLORE (early, often, rapidly)
  53. 53. Tetherless World Constellation Questions?

×