Tetherless World Constellation




   Data: Big and Broad
             Jim Hendler
    Tetherless World Constellation
Tetherless World Professor of Computer and Cognitive Science
            Head, Computer Science Department

   Rensselaer Polytechnic Institute
   http://www.cs.rpi.edu/~hendler
         @jahendler (twitter)
Outline (if I stick to it)

                       Tetherless World Constellation


• What is big data?
• How big is big?
• What is big data on the Web?
• What is Broad data?
• Got an example?
• What’s the problem?
• What’s going on
Useful Terms
                                              Tetherless World Constellation

• Machine-readable Data
   – Information available in a form that is accessible and
     manipulable by computer
   – Accessible ≠ Manipulable
      • eg PDF documents can be read in and displayed, but the
        information in the document is not readily available without special
        tooling
• Metadata
   – Information associated with (machine-readable) data that
     provides information about the data set
• Workflow, Provenance, and lots of other terms
   – Useful sorts of metadata with respect to who created the data,
     when, how was it processed, etc.
• Metadata and the other stuff most useful when it is
  machine-readable and openly available in commonly agreed
  upon formats
BIG Data is NOT the Web of Data
                                       Tetherless World Constellation

• The term “Big Data” is widely used
  nowadays to refer to a whole bunch of
  machine-readable data in one accessible
  (to the researcher) place
   – 3 main contexts
    • The large data collections of “big science” projects
       – in traditional data warehouse or database formats
    • The enterprise data of large, non-Web-based
      companies (IBM, TATA, etc.)
       – Generally in multiple
    • The data holdings of a Google, Facebook or other
      large Web company
       – Include large “unstructured” holdings
       – Include “graph” data
Tera, Peta, Zeta
                                            yotta, yotta, yotta…
                                       Tetherless World Constellation


• World Wide Web data is extremely large
• Extremely well “funded”
  – eg. Facebook
     • 25 Terabytes of logged data per day; valuation $33B (US
       NIH budget ~ $31B)
  – eg. Google
     • In 2008 it was estimated at 20 petabytes per day (not
       including youTube); current valuation $190B (about 1/3
       the entire US DoD budget)

• And really, really fascinating stuff
  – Data about people and their relationships
     •   To each other
     •   To products
     •   To activities and actions
     •   …
How BIG is Big?

Tetherless World Constellation
BIG Data

                            Tetherless World Constellation




Google uses their data in many ways
         Search => ads => user
Big Data is becoming different on the Web

                                     Tetherless World Constellation


• New Work
  – is moving away from traditional relational
   models
     • cf. NoSQL
  – Moving towards third party application and
    extension
     • cf. Mobile apps for local governments
  – Includes a focus on interoperability and
    exchange with “lightweight” semantics
     • Using ideas from the Semantic Web
        – Search: Schema.org
        – Social Networking: OGP
Which in part gives rise to BROAD data

                                     Tetherless World Constellation


• 4th context: Broad Data
  – The huge amount of freely available, but widely varied,
    Open Data on the World Wide Web (Structured and
    Semi-structured)
     • Example: The extended Facebook OGP graph (the
       part outside Facebook’s datasets)
     • Example: The growing linked open data cloud of
       freely available RDF linked data
     • Example: Hundreds of thousands of datasets that are
       available on the Web free from governments around
       the world
Example: adding “Breadth”

 Tetherless World Constellation




                    April 2010
Facebook’s Open Graph Protocol

                                                             Tetherless World Constellation

• Facebook now allows other sites to extend the graph
• Open Graph Protocol uses RDFa to let web sites contain
  information about the things people “like”
       og:title - The title of your object as it should appear within the graph, e.g., "The Rock".
       og:type - The type of your object, e.g., "movie". Depending on the type you specify, other
       properties may also be required.
       og:image - An image URL which should represent your object within the graph.
       og:url - The canonical URL of your object that will be used as its permanent ID in the graph
       og:description - A one to two sentence description of your object.
       og:site_name - If your object is part of a larger web site, the name which should be
       displayed for the overall site. e.g., "IMDb".




   – Not a traditional “ontology”
Big Data

                                   Tetherless World Constellation




Facebook generates terabytes of data per day
          What could be learned from this?
Creates a platform for SW-powered apps

              Tetherless World Constellation
BROAD data challenges

                            Tetherless World Constellation


• For broad data the new challenges
  that emerge include
  – (Web-scale) data search
  – “Crowd-sourced” modeling
  – rapid (and potentially ad hoc)
    integration of datasets
  – visualization and analysis of only-
    partially modeled datasets
  – policies for data use, reuse and
    combination.
Huh?

                          Tetherless World Constellation


“The more I work with data, the more I
realize I need Semantics”

 Huh?

The traditional database community has,
umm, not always been the first to embrace
semantics

What is different here?
Government Data Sharing

Tetherless World Constellation
The Web of Open
Government Data is Growing
• Analytics based on over 1,000,000 datasets
  from around the world can be seen at
   – http://logd.tw.rpi.edu/iogds_data_analytics
• The examples that follow are from that page
Datasets                 1,028,054
Countries                43
Catalogs                 192
Categories               2460
Languages                24
          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           17
International




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           18
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           19
Many others…




                                                   Important note:
                                                   quantity is not really the most
                                                   important issue

          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           20
Topics (Across All Catalogs)




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           21
Topics (Across All Catalogs)




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           22
Combining data from different data sharing sites

                       Tetherless World Constellation
Data Integration Problems

                                       Tetherless World Constellation




Head to head comparions shows that
burglaries in Avon and Somerset (UK) far
exceed those in Los Angeles, California
(one of the highest crime areas in the US)
The problem is (likely) semantics

                                          Tetherless World Constellation




                                                        Same or
                                                        different?




Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
Example: Water

Tetherless World Constellation
Example: Water/Kenya

Tetherless World Constellation
Finding Data

                        Tetherless World Constellation




World Bank: Africa     Africover: Agriculture




 Kenya: Agricultural   US Data.gov: Crop
5 Star Data

                                         Tetherless World Constellation




              IOGDC Open Data Tutorial             29
9 July 2012
Broad Data “Integration”
requires simple semantics
 Tetherless World Constellation
Example any wikipedia topic!

   Tetherless World Constellation
Arizona

Tetherless World Constellation
Arizona info (From the previous)

       Tetherless World Constellation
USDA data turns out to be crucial

        Tetherless World Constellation
Metadata is crucial for Broad Data
                                           Tetherless World Constellation


• Metadata design is crucial to govt data
  sharing
  – Needed for search and federation in large data
    sharing efforts
• International data sharing
  – W3C Govt Linked Data Working Group
  – Need for vocabularies within govt sectors
     • Esp for cross-langauge use
        – How can we compare health (or legal, or social, or ….) data
          between countries like US, UK, India, Kenya (English) with
          Norway, China, France, etc.
        – How can we link local govts (in traditional languages, local
          dialects, etc) w/national data
Database metadata

Tetherless World Constellation
Dataset extension to schema.org (pending)

                 Tetherless World Constellation
Government Data in the linked open data cloud

                     Tetherless World Constellation




    Government Data is
    currently over ½ the cloud in
    size (~17B triples), 10s of
    thousands of links to other
    data (within and without)

http://linkeddata.org/
Research in Govt Data => Broad Data challenges

                                             Tetherless World Constellation

• Trust
   – Government data is controversial, and potentially biased
       • How do we confirm or dispute?
• Combination
   – When we combine data we need to keep the provenance of
     information (see trust)
       • How do we make policies explicit and sharable
• Scaling
   – Our project has already converted 9.9B triples from only
     >2,000 of the 710,000 government databases we can identify
     (116 catalogs, 32 countries, 16 languages)
       • Cross-catalog
       • Cross Langauge
• Versioning and updating
• Archiving
• Visualization
Big Data needs bigger ideas
            for visualization
          Tetherless World Constellation




      (Fox &Hendler, Science, 2/11/10)
A new idea we’re playing with at RPI

                               Tetherless World Constellation


• Data as “exhibition”
  – Museums/Performing Arts have explored
    accessibility for real world artifacts, can
    we extend these to the data web?
• Data via physical
  interaction
  – Using theatre techniques
    we can literally move a
    person through a data landscape, what
    new metaphors does this open up?
Conclusions
                                    Tetherless World Constellation

• Big data is going Broad
  – World Wide Web trend towards more and more
    varied data
     • In many domains
        – E-commerce, Open Govt, many more (cf.
          Health/Medical care)

• Broad data requires thinking outside the
  “Database” box
  – Including considering access
• Broad data opens exciting possibilities for
  research and innovation
  – And I hope will help provide tools for making
    data more accessible

Data Big and Broad (Oxford, 2012)

  • 1.
    Tetherless World Constellation Data: Big and Broad Jim Hendler Tetherless World Constellation Tetherless World Professor of Computer and Cognitive Science Head, Computer Science Department Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  • 2.
    Outline (if Istick to it) Tetherless World Constellation • What is big data? • How big is big? • What is big data on the Web? • What is Broad data? • Got an example? • What’s the problem? • What’s going on
  • 3.
    Useful Terms Tetherless World Constellation • Machine-readable Data – Information available in a form that is accessible and manipulable by computer – Accessible ≠ Manipulable • eg PDF documents can be read in and displayed, but the information in the document is not readily available without special tooling • Metadata – Information associated with (machine-readable) data that provides information about the data set • Workflow, Provenance, and lots of other terms – Useful sorts of metadata with respect to who created the data, when, how was it processed, etc. • Metadata and the other stuff most useful when it is machine-readable and openly available in commonly agreed upon formats
  • 4.
    BIG Data isNOT the Web of Data Tetherless World Constellation • The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place – 3 main contexts • The large data collections of “big science” projects – in traditional data warehouse or database formats • The enterprise data of large, non-Web-based companies (IBM, TATA, etc.) – Generally in multiple • The data holdings of a Google, Facebook or other large Web company – Include large “unstructured” holdings – Include “graph” data
  • 5.
    Tera, Peta, Zeta yotta, yotta, yotta… Tetherless World Constellation • World Wide Web data is extremely large • Extremely well “funded” – eg. Facebook • 25 Terabytes of logged data per day; valuation $33B (US NIH budget ~ $31B) – eg. Google • In 2008 it was estimated at 20 petabytes per day (not including youTube); current valuation $190B (about 1/3 the entire US DoD budget) • And really, really fascinating stuff – Data about people and their relationships • To each other • To products • To activities and actions • …
  • 6.
    How BIG isBig? Tetherless World Constellation
  • 7.
    BIG Data Tetherless World Constellation Google uses their data in many ways Search => ads => user
  • 8.
    Big Data isbecoming different on the Web Tetherless World Constellation • New Work – is moving away from traditional relational models • cf. NoSQL – Moving towards third party application and extension • cf. Mobile apps for local governments – Includes a focus on interoperability and exchange with “lightweight” semantics • Using ideas from the Semantic Web – Search: Schema.org – Social Networking: OGP
  • 9.
    Which in partgives rise to BROAD data Tetherless World Constellation • 4th context: Broad Data – The huge amount of freely available, but widely varied, Open Data on the World Wide Web (Structured and Semi-structured) • Example: The extended Facebook OGP graph (the part outside Facebook’s datasets) • Example: The growing linked open data cloud of freely available RDF linked data • Example: Hundreds of thousands of datasets that are available on the Web free from governments around the world
  • 10.
    Example: adding “Breadth” Tetherless World Constellation April 2010
  • 11.
    Facebook’s Open GraphProtocol Tetherless World Constellation • Facebook now allows other sites to extend the graph • Open Graph Protocol uses RDFa to let web sites contain information about the things people “like” og:title - The title of your object as it should appear within the graph, e.g., "The Rock". og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required. og:image - An image URL which should represent your object within the graph. og:url - The canonical URL of your object that will be used as its permanent ID in the graph og:description - A one to two sentence description of your object. og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb". – Not a traditional “ontology”
  • 12.
    Big Data Tetherless World Constellation Facebook generates terabytes of data per day What could be learned from this?
  • 13.
    Creates a platformfor SW-powered apps Tetherless World Constellation
  • 14.
    BROAD data challenges Tetherless World Constellation • For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling – rapid (and potentially ad hoc) integration of datasets – visualization and analysis of only- partially modeled datasets – policies for data use, reuse and combination.
  • 15.
    Huh? Tetherless World Constellation “The more I work with data, the more I realize I need Semantics” Huh? The traditional database community has, umm, not always been the first to embrace semantics What is different here?
  • 16.
  • 17.
    The Web ofOpen Government Data is Growing • Analytics based on over 1,000,000 datasets from around the world can be seen at – http://logd.tw.rpi.edu/iogds_data_analytics • The examples that follow are from that page Datasets 1,028,054 Countries 43 Catalogs 192 Categories 2460 Languages 24 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 17
  • 18.
    International 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 18
  • 19.
    2012 International OpenGovernment Data Conference—Open Gov Data Tutorial 9 July 2012 19
  • 20.
    Many others… Important note: quantity is not really the most important issue 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 20
  • 21.
    Topics (Across AllCatalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 21
  • 22.
    Topics (Across AllCatalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 22
  • 23.
    Combining data fromdifferent data sharing sites Tetherless World Constellation
  • 24.
    Data Integration Problems Tetherless World Constellation Head to head comparions shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California (one of the highest crime areas in the US)
  • 25.
    The problem is(likely) semantics Tetherless World Constellation Same or different? Do the terms mean the same? Are they collected in the same way? Are they processed differently? …
  • 26.
  • 27.
  • 28.
    Finding Data Tetherless World Constellation World Bank: Africa Africover: Agriculture Kenya: Agricultural US Data.gov: Crop
  • 29.
    5 Star Data Tetherless World Constellation IOGDC Open Data Tutorial 29 9 July 2012
  • 30.
    Broad Data “Integration” requiressimple semantics Tetherless World Constellation
  • 31.
    Example any wikipediatopic! Tetherless World Constellation
  • 32.
  • 33.
    Arizona info (Fromthe previous) Tetherless World Constellation
  • 34.
    USDA data turnsout to be crucial Tetherless World Constellation
  • 35.
    Metadata is crucialfor Broad Data Tetherless World Constellation • Metadata design is crucial to govt data sharing – Needed for search and federation in large data sharing efforts • International data sharing – W3C Govt Linked Data Working Group – Need for vocabularies within govt sectors • Esp for cross-langauge use – How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc. – How can we link local govts (in traditional languages, local dialects, etc) w/national data
  • 36.
  • 37.
    Dataset extension toschema.org (pending) Tetherless World Constellation
  • 38.
    Government Data inthe linked open data cloud Tetherless World Constellation Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without) http://linkeddata.org/
  • 39.
    Research in GovtData => Broad Data challenges Tetherless World Constellation • Trust – Government data is controversial, and potentially biased • How do we confirm or dispute? • Combination – When we combine data we need to keep the provenance of information (see trust) • How do we make policies explicit and sharable • Scaling – Our project has already converted 9.9B triples from only >2,000 of the 710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages) • Cross-catalog • Cross Langauge • Versioning and updating • Archiving • Visualization
  • 40.
    Big Data needsbigger ideas for visualization Tetherless World Constellation (Fox &Hendler, Science, 2/11/10)
  • 41.
    A new ideawe’re playing with at RPI Tetherless World Constellation • Data as “exhibition” – Museums/Performing Arts have explored accessibility for real world artifacts, can we extend these to the data web? • Data via physical interaction – Using theatre techniques we can literally move a person through a data landscape, what new metaphors does this open up?
  • 42.
    Conclusions Tetherless World Constellation • Big data is going Broad – World Wide Web trend towards more and more varied data • In many domains – E-commerce, Open Govt, many more (cf. Health/Medical care) • Broad data requires thinking outside the “Database” box – Including considering access • Broad data opens exciting possibilities for research and innovation – And I hope will help provide tools for making data more accessible