Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scaling the (evolving) web data –at low cost-

Ad

Scaling the (evolving) web data
–at low cost-
Javier D. Fernández
QuWeDa 2017: Querying the Web of Data
Kosice, 29/05/2017

Ad

A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 A...

Ad

About me:
 since 2015 @WU, Inst. for Information Business
Research interest: Semantic Web, Open Data, Big (Semantic) Data...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 55 Ad
1 of 55 Ad
Advertisement

More Related Content

Similar to Scaling the (evolving) web data –at low cost- (20)

Advertisement

Scaling the (evolving) web data –at low cost-

  1. 1. Scaling the (evolving) web data –at low cost- Javier D. Fernández QuWeDa 2017: Querying the Web of Data Kosice, 29/05/2017
  2. 2. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with jokes
  3. 3. About me:  since 2015 @WU, Inst. for Information Business Research interest: Semantic Web, Open Data, Big (Semantic) Data Management, Databases, Data Compression, Privacy and Security  https://www.wu.ac.at/en/infobiz/team/fernandez/ MadridValladolid Santiago Rome 3 Óscar CorchoPablo de la Fuente Miguel A. Martínez-Prieto Claudio Gutiérrez Maurizio Lenzerini Vienna Axel Polleres
  4. 4. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  5. 5. 5 The Web of Data Eco System
  6. 6. The Web of Data Eco System  First, we better know what we can offer…  What is the Semantic Web/Web of Data/Linked Data?  Who are we? What have we done so far?  What we haven‘t done so far? 6 Linked Data Semantic Web Open Data Big Data
  7. 7. (Big Semantic Data: Linked Data vs. Big Data)  Overlaps:  LD as a whole is big (38B-150B triples)  No rigid (e.g., relational) data model  Big Data technologies (e.g., Hadoop) are used to handle LD  LD can represent knowledge extracted from big unstructured data (specially to deal with variety)  Key Differences:  Individual linked data sets are typically not "big" per se (e.g., English DBpedia dump (zip) currently < 5 GB)  LD is structured, single data model (RDF), "big data lakes" are typically neither  Big data based on distributed data infrastructures within an organization (e.g., Hadoop clusters), LD creates a decentralized, globally distributed data infrastructure
  8. 8. Let’s study the community… Survey practitioner needs, technological challenges, and open research questions on the use of Linked Data  Austrian FFG ICT of the Future project (exploratory study)  Consortium: IDC Austria, Technical University of Vienna, University of Economy Vienna, Semantic Web Company  Project ended in Dec 2016: https://www.linked-data.at/ Standards*Requirements Literature research* * Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis
  9. 9. Interviews  23 interviews:  Domains  Consulting, Engineering, Environment, Finance and Insurance, Government, Healthcare, ICT, IT, Media, Pharmaceutical, Professional Services, Real Estate, Research, Startup, Tourism, Transports & Logistics  Roles  Business Intelligence, CEO, Chief Engineer, Data and Systems Architect, Data Scientist, Director Information Management, Enterprise Architect, Founder, General Secretary, Governance, Risk & Compliance Manager, Head of Communications and Media, Head of Development, Head of HR, Head of R&D, Innovation Manager, Information Architect, IT Project Manager, Management, Managing director, Marketing Analyst, Principle System Analyst, Project Coordinator, Researcher, Technical Specialist
  10. 10. Technologies in need… Analytics Computational linguistics & NLP Concept tagging & annotation Data integration Data management Dynamic data / streaming Extraction, data mining, text mining, entity extraction Logic, formal languages & reasoning Human- Computer Interaction & visualization Knowledge representation Machine learning Ontology/thesaur us/taxonomy management Quality & Provenance Recommendation Robustness, scalability, optimization and performance Searching, browsing & exploration Security and privacy System engineering We ended with most areas of the SW
  11. 11. Standards
  12. 12. Standards Toolbox (incl. W3C member submissions)
  13. 13. What can we offer? Community Analysis  Monitoring SW community major venues (2006-2015):  ISWC (since 2006), ESWC (since 2006), SEMANTiCS (since 2007), JWS (since 2006), SWJ (since 2010)  3 seminal papers: 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
  14. 14. Topic Categorisation
  15. 15. Topic Categorisation Interestingly, the same “empty” topics in standards
  16. 16. Semantic Web/Linked Data over time… Subtopics: Expressing Meaning Knowledge Representation Ontologies Agents
  17. 17. Knowledge Representation & Reasoning
  18. 18. Semantic Web/Linked Data over time… Early adopters: MITRE Chevron British Telecom Boeing Ordnance Survey Eli Lily Pfizer Agfa Food and Drug Administration National Institutes of Health Software adopters/products: Oracle Adobe Altova OpenLink TopQuadrant Software AG Aduna Software Protége SAPHIRE
  19. 19. LD Adopters - Companies
  20. 20. LD Adopters - Companies
  21. 21. LD Adopters - Companies 0 200 400 600 800 1000 1200 1400 1600 Google Oracle Yahoo SAP IEEE Intelligent Systems Franz Bing Expert System IBM Research Poolparty Occurrences Companies Conference Sponsors that appear in papers 2006-2015
  22. 22. To whom we can sell our technology
  23. 23. Semantic Web/Linked Data over time… The authors claim that "early research has transitioned into these larger, more applied systems, today’s Semantic Web research is changing: It builds on the earlier foundations but it has generated a more diverse set of pursuits”.
  24. 24. Big Semantic Data and applied systems
  25. 25. Big Semantic Data and applied systems
  26. 26. Other topics of the QuWeDa workshop
  27. 27. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  28. 28. Motivation  Publication, Exchange and Consumption of large RDF datasets  Most RDF formats (N3, XML, Turtle) are text serializations, designed for human readability (not for machines)  Verbose = High costs to write/exchange/parse  A basic offline search = (decompress)+ index the file + search  Lightweight Binary RDF (HDT)  Highly compact serialization of RDF  Allows fast RDF retrieval in compressed space (without prior decompression)  Includes internal indexes to solve basic queries with small (3%) memory footprint.  Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.  Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x). 431 M.triples~ 63 GB DBpedia NT + gzip 5 GB HDT 6.6 GB HDT + gzip 2.7 GB rdfhdt.org
  29. 29. The real motivation
  30. 30. The real motivation http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/ Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking
  31. 31. The real motivation http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/ Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking consume
  32. 32. Applications  Compress and share ready-to-consume RDF datasets  Transfer large data between servers  Embedded Systems & Phones  Fast –low cost- SPARQL Query Engine  Via LDF  HDT-Jena  HDT-Cliopatra
  33. 33. But what about Web-scale queries  E.g. retrieve all entities in LOD with the label “Tim Berners-Lee“  Options:  Crawl and index LOD locally (-no-)  Follow-your-nose (where should I start?)  Federated querying (as good as the endpoints you query)  Use LOD Laundromat as a “good approximation” (still querying 650K datasets) 36 select distinct ?x { ?x rdfs:label "Tim Berners-Lee" }
  34. 34. 37 LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data SPARQL endpoint (metadata) LOD Laundromat
  35. 35. But what about Web-scale queries 38 LOD-a-lot - flashforward -
  36. 36. But what about Web-scale queries But one could be really hungry 39 https://hwy55burgers.wordpress.com/tag/food-challenge/ LOD-a-lot
  37. 37. 40 LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data LOD-a-lot SPARQL endpoint (metadata) LOD-a-lot Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias 28B triples
  38. 38. LOD-a-lot (some numbers) Disk size:  HDT: 304 GB  HDT-FoQ (additional indexes): 133 GB Memory footprint (to query):  15.7 GB of RAM (3% of the size)  144 seconds loading time  8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS LDF page resolution in milliseconds. 41 305€ (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
  39. 39. 42 LOD-a-lot https://datahub.io/dataset/lod-a-lot
  40. 40. LOD-a-lot (some use cases)  Query resolution at Web scale  Evaluation and Benchmarking  No excuse   RDF metrics and analytics 43 subjects predicates objects
  41. 41. LOD-a-lot (ACKs) 44
  42. 42. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  43. 43. G3b G1b Linked Open Data Cloud Linked Closed Data Cloud dbpedia G3a G4a G1a G2a G1c G2c G2b 1) Linked Open/Close Data “Deep Semantic Web”
  44. 44. 1) Linked Open/Close Data
  45. 45. 1) Linked Open/Close Data  A) Exchange: Encryption + HDT (hdtcrypt) 48
  46. 46. 49 1) Linked Open/Close Data  B) A secure LD Endpoint ESWC’17, THU 16:30-17:00 Self-Enforcing Access Control for Encrypted RDF Javier D. Fernández, Sabrina Kirrane, Axel Polleres and Simon Steyskal
  47. 47. 2) RDF evolution at Scale ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015 Number of sources Update rate month year week day hour minute second 104 105 106101100 102 103 DBpedia BTC Dyldo Internet of Things Virtual/Augmented Reality versions? LOD-a-lot
  48. 48. Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7) last few years: 51 Research projects Archives Tools Benchmarking one of the fundamental problems in the Web of Data BEnchmark of RDF ARchives 2) RDF evolution at Scale
  49. 49. Use mappings to update infoboxes and track pages that need updating. 3) Ontology-based Data Management Use case: Dbpedia & SPARQL Update to maintain Wikipedia?
  50. 50. Our approach to OBDM over curated sources 1. Ensure consistency in all cases, automatically resolve updates on the best-effort basis. 2. Learn from existing data and from principled belief revision semantics.  E.g.: many football players with only one foaf:name in English DBpedia have both name and full name Infobox properties set. 3. Record, extract and apply best / typical practices. name foaf:name full_name A minimal-change insert translation would only update one infobox property. ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov 3) Ontology-based Data Management
  51. 51. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  52. 52. Dept. of Information Systems & Operations Institute for Information Business Welthandelsplatz 1, 1020 Vienna, Austria DR. Javier D. Fernández T +43-1-313 36-5241 F +43-1-313 36-739 jfernand@wu.ac.at www.ai.wu.ac.at Thanks!  Big (Semantic) Data  Versions  Evolving Data  Encryption  Compression rdfhdt.org

Editor's Notes

  • After some years pushing for the Web of Data, now it should be the moment to see the ecosystem and think what have we done so far, and what we haven‘t done so far
  • Outlines quite clearly what they thought back then the Semantic Web should be…
  • LEDS:Linked Enterprise Data Services

×