Broad Data
Upcoming SlideShare
Loading in...5
×
 

Broad Data

on

  • 5,757 views

In this talk I compare "Broad" data, the idea of thousands of datas

In this talk I compare "Broad" data, the idea of thousands of datas

Statistics

Views

Total Views
5,757
Views on SlideShare
5,073
Embed Views
684

Actions

Likes
13
Downloads
107
Comments
3

7 Embeds 684

https://twitter.com 590
http://homepages.rpi.edu 30
http://a0.twimg.com 25
http://us-w1.rockmelt.com 24
http://www.rpi.edu 11
https://si0.twimg.com 3
http://tweetedtimes.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Oh, you know what Facebook will do in September 2012, eh? Btw, I often get the feeling that various Semantic Web/Linked Data etc. protagonists somehow doom the upper parts of the Semantic Web technology stack (whatever ;) ). However, they are also very important. Of course, could/should start with the easier parts, but propagate the impression the other parts are overly complex or not useful. I guess, appliers will notice when they will need those bits too ;)
    Are you sure you want to
    Your message goes here
    Processing…
  • thanks for catching that - obviously I meant data.gouv.fr - will fix in next version
    Are you sure you want to
    Your message goes here
    Processing…
  • For the french side, maybe you were talking about www.data.gouv.fr?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://www.mkbergman.com/458/new-currents-in-the-deep-web/ http://academics.smcvt.edu/sburks/Definition_search_engine.htm

Broad Data Broad Data Presentation Transcript

  • Broad Data Jim Hendler Tetherless World Constellation Tetherless World Professor of Computer and Cognitive Science Director, Information Technology and Web Science Program Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  • Outline (if I stick to it)
    • Big Data ≠ Broad Data
    • Broad Data problem
    • Broad Data Example
      • Open Government Data
    • Broad Data challenges
    • How can you make money off this stuff?
  • BIG Data
    • The term “Big Data” is widely used nowadays
      • 3 main contexts
        • The large data collections of “big science” projects
        • The data holdings of a Google, Facebook or other large Web company
        • The enterprise data of large, non-Web-based companies (IBM, TATA, etc.)
  • Big Data Challenge: Scaling
    • Most of the focus of (current) Big Data research is on scaling (traditional) database-related technologies
      • Schema Modeling
      • Data Warehousing
      • Datamining
      • Statistical analysis
      • Mathematical Analytics
  • How BIG is Big?
    • Science uses some extremely large databases and many of them are crucial to society
      • Petabytes of Data
    • World Wide Web data is also extremely large
      • With primary resources to explore it held by companies
        • eg. Facebook
          • 25 Terabytes of logged data per day; valuation $100B?
        • eg. Google
          • In 2008 it was estimated at 20 petabytes per day (not including youTube); 2010 valuation >$190B
  • Big Data Facebook generates terabytes of data per day What could be learned from this?
  • BIG Data Google uses their data in many ways Search => ads => user
  • Big Data is becoming different on the Web
    • New Work
      • is moving away from traditional relational models
        • cf . NoSQL
      • Moving towards third party application and extension
        • cf . Mobile apps for local governments
      • Includes a focus on interoperability and exchange with “lightweight” semantics
        • Using ideas from the Semantic Web
          • Search: Schema.org
          • Social Networking: OGP
  • BROAD data
    • 4 th context: Broad Data
      • The huge amount of freely available, but widely varied, Open Data on the World Wide Web (Structured and Semi-structured)
        • Example: The extended Facebook OGP graph (the part outside Facebook’s datasets)
        • Example: The growing linked open data cloud of freely available RDF linked data
        • Example: More than 710,000 datasets that are available on the Web free from governments around the world
  • Example: adding “Breadth” April 2010
  • Facebook ’s Open Graph Protocol
    • Facebook now allows other sites to extend the graph
    • Open Graph Protocol uses RDFa to let web sites contain information about the things people “like”
        • og:title - The title of your object as it should appear within the graph, e.g., "The Rock".
        • og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required.
        • og:image - An image URL which should represent your object within the graph.
        • og:url - The canonical URL of your object that will be used as its permanent ID in the graph
        • og:description - A one to two sentence description of your object.
        • og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb".
      • Not a traditional “ontology”
  • OGP use growing quickly 15,178 sites of top 1,000,000 as of 3/3/11 In Sept 2012 Facebook announced extension of OGP for new uses
  • Goal: OGP-powered social (e-commerce) apps
  • Broad data (in Science)
    • The “ Deep Web ” in Science ( cf . Fox 2011)
      • Data behind web services
      • Data behind query interfaces (databases or files)
    • Introduces a different curation problem
  • Broad Data Science (Fox &Hendler, Science , 2/11/10)
  • BROAD data challenges
    • For broad data the new challenges that emerge include
      • (Web-scale) data search
      • “ Crowd-sourced” modeling
      • rapid (and potentially ad hoc ) integration of datasets
      • visualization and analysis of only-partially modeled datasets
      • policies for data use, reuse and combination.
  • Example: Government Data on the Web
  • Government Data Sharing: “Year 1” January 1, 2009 “ Openness will strengthen our democracy and promote efficiency and effectiveness in Government.” --- President Obama Putting Govt Data online- Data.gov.uk beta May 21, 2009 January 19, 2010 data.gov.uk online May 21, 2010 data.gov online data.gov relaunch with semantic web featured June30,2009 December 8, 2009 “ Open Government Directive ” released 2009 2010 … 57 Data Sets ~6000 Data Set ~2000 Data Sets >305,000 Data Sets
  • Government Data Sharing: Year 2
  • Government Data Sharing: Year 3 2012 so far: http://www.gouv.fr Released 300,000 French databases US/India to release Open Government Platform Kenya announces “Open Africa” project
  • Government Data in the linked open data cloud http://linkeddata.org/ Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
  • Important to the citizens: eg. Education Data.gov.uk RPI NYS demos
    • Government “ Data ” Mashups
  • Data.gov + epa.gov
  •  
  • Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
  • Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!
  • Linking to “context” important Datasets: acres burned, and agency budgets Dbpedia: wikipedia descriptions of major US fires
  • Integrate with Social media
  • Combining data from different data sharing sites
  • http://logd.tw.rpi.edu demos, tutorials, RDF-ized datasets, and more
  • Broad Data “Integration” requires simple semantics
  • Example any wikipedia topic!
  • Metadata is crucial for Broad Data
    • Metadata design is crucial to govt data sharing
      • Needed for search and federation in large data sharing efforts
    • International data sharing
      • W3C Govt Linked Data Working Group
      • Need for vocabularies within govt sectors
        • Esp for cross-langauge use
          • How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc.
          • How can we link local govts (in traditional languages, local dialects, etc) w/national data
  • International Open Government Data Search
  • Searching for data
    • Faceted browser with
      • Keyword search
      • Catalogs
      • Countries
      • Agencies
      • Categories
      • (in any order)
  • Details and download… http://logd.tw.rpi.edu/demo/international_dataset_catalog_search
  • Research in Govt Data => Broad Data challenges
    • Trust
      • Government data is controversial, and potentially biased
        • How do we confirm or dispute?
    • Combination
      • When we combine data we need to keep the provenance of information (see trust)
        • How do we make policies explicit and sharable
    • Scaling
      • Our project has already converted 9.9B triples from only >2,000 of the 710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages)
        • Cross-catalog
        • Cross Langauge
    • Versioning and updating
    • Archiving
    • Visualization
  • Exploring new visualizations Data from http://littlesis.org
  • Reaching beyond the government
  • Broad Data Goes Beyond the Govt http://linkeddata.org/
  • Broad Data Challenges
    • Finding and Using Broad Data is an emerging challenge
      • How do I find a dataset in the many out there that might be of use to me?
        • Cannot keyword search in data
      • How do I know what is in a large data store? In the cloud?
        • What is the coverage?
        • What is the access?
        • Who do I need to ask for what
      • What are the rules about using it?
        • What can I combine it with?
        • How do downstream users know I ’ ve combined it
  • Broad Data Market?
    • Significant and growing commercial interest…
      • Web: Google, Amazon, Travelocity…
      • Web 2.0: Facebook, Wikipedia, YouTube, Twitter…
      • Web 3.0: ??
  • Broad Data Market?
    • Significant and growing commercial interest…
      • Web: Google, Amazon, Travelocity…
      • Web 2.0: Facebook, Wikipedia, YouTube, Twitter…
      • Web 3.0: ??
    Broad Data Goes Here
  • Research (and business) Opportunities
    • Broad Data is a great field for those looking for emerging opportunities
      • Tooling is needed
      • (Business) Models are just starting to emerge
      • Scalability Infrastructure is there
      • Massive Distribution (think mobile) is wide open to Web 3.0 innovation
    • Govt data gives us a place to cooperate (with public good) while exploring all of the above
  • Conclusions
    • Big data is going Broad
      • World Wide Web trend towards more and more varied data
        • In many domains
          • E-commerce, Open Govt, many more (cf. Health/Medical care)
    • Broad data requires thinking outside the “Database” box
    • Broad data opens exciting possibilities for research and innovation
      • Come play!