SlideShare a Scribd company logo
1 of 55
Download to read offline
Scaling the (evolving) web data
–at low cost-
Javier D. Fernández
QuWeDa 2017: Querying the Web of Data
Kosice, 29/05/2017
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with jokes
About me:
 since 2015 @WU, Inst. for Information Business
Research interest: Semantic Web, Open Data, Big (Semantic) Data Management,
Databases, Data Compression, Privacy and Security
 https://www.wu.ac.at/en/infobiz/team/fernandez/
MadridValladolid Santiago Rome
3
Óscar CorchoPablo de la Fuente
Miguel A. Martínez-Prieto
Claudio Gutiérrez Maurizio Lenzerini
Vienna
Axel Polleres
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
5
The Web of Data Eco System
The Web of Data Eco System
 First, we better know what we can offer…
 What is the Semantic Web/Web of Data/Linked Data?
 Who are we? What have we done so far?
 What we haven‘t done so far?
6
Linked Data Semantic Web
Open Data
Big Data
(Big Semantic Data: Linked Data vs.
Big Data)
 Overlaps:
 LD as a whole is big (38B-150B triples)
 No rigid (e.g., relational) data model
 Big Data technologies (e.g., Hadoop) are used to handle LD
 LD can represent knowledge extracted from big unstructured
data (specially to deal with variety)
 Key Differences:
 Individual linked data sets are typically not "big" per se
(e.g., English DBpedia dump (zip) currently < 5 GB)
 LD is structured, single data model (RDF), "big data lakes" are
typically neither
 Big data based on distributed data infrastructures within an
organization (e.g., Hadoop clusters), LD creates a
decentralized, globally distributed data infrastructure
Let’s study the community…
Survey practitioner needs, technological challenges, and
open research questions on the use of Linked Data
 Austrian FFG ICT of the Future project (exploratory study)
 Consortium: IDC Austria, Technical University of Vienna,
University of Economy Vienna, Semantic Web Company
 Project ended in Dec 2016: https://www.linked-data.at/
Standards*Requirements Literature research*
* Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis
Interviews
 23 interviews:
 Domains
 Consulting, Engineering, Environment, Finance and Insurance,
Government, Healthcare, ICT, IT, Media, Pharmaceutical,
Professional Services, Real Estate, Research, Startup, Tourism,
Transports & Logistics
 Roles
 Business Intelligence, CEO, Chief Engineer, Data and Systems
Architect, Data Scientist, Director Information Management,
Enterprise Architect, Founder, General Secretary, Governance, Risk
& Compliance Manager, Head of Communications and Media, Head
of Development, Head of HR, Head of R&D, Innovation Manager,
Information Architect, IT Project Manager, Management, Managing
director, Marketing Analyst, Principle System Analyst, Project
Coordinator, Researcher, Technical Specialist
Technologies in need…
Analytics
Computational
linguistics & NLP
Concept tagging
& annotation
Data integration
Data
management
Dynamic data /
streaming
Extraction, data
mining, text
mining, entity
extraction
Logic, formal
languages &
reasoning
Human-
Computer
Interaction &
visualization
Knowledge
representation
Machine learning
Ontology/thesaur
us/taxonomy
management
Quality &
Provenance
Recommendation
Robustness,
scalability,
optimization and
performance
Searching,
browsing &
exploration
Security and
privacy
System
engineering
We ended
with most
areas of
the SW
Standards
Standards Toolbox (incl. W3C member submissions)
What can we offer?
Community Analysis
 Monitoring SW community major venues (2006-2015):
 ISWC (since 2006), ESWC (since 2006), SEMANTiCS (since
2007), JWS (since 2006), SWJ (since 2010)
 3 seminal papers:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Topic Categorisation
Topic Categorisation
Interestingly, the
same “empty”
topics in
standards
Semantic Web/Linked Data over
time…
Subtopics:
Expressing Meaning
Knowledge Representation
Ontologies
Agents
Knowledge Representation
& Reasoning
Semantic Web/Linked Data over
time…
Early adopters:
MITRE
Chevron
British Telecom
Boeing
Ordnance Survey
Eli Lily
Pfizer
Agfa
Food and Drug Administration
National Institutes of Health
Software adopters/products:
Oracle
Adobe
Altova
OpenLink
TopQuadrant
Software AG
Aduna Software
Protége
SAPHIRE
LD Adopters - Companies
LD Adopters - Companies
LD Adopters - Companies
0
200
400
600
800
1000
1200
1400
1600
Google Oracle Yahoo SAP IEEE
Intelligent
Systems
Franz Bing Expert
System
IBM Research Poolparty
Occurrences
Companies
Conference Sponsors that appear in papers 2006-2015
To whom we can sell our technology
Semantic Web/Linked Data over
time…
The authors claim that "early research has
transitioned into these larger, more
applied systems, today’s Semantic Web
research is changing: It builds on the
earlier foundations but it has generated a
more diverse set of pursuits”.
Big Semantic Data and applied
systems
Big Semantic Data and applied
systems
Other topics of the QuWeDa
workshop
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
Motivation
 Publication, Exchange and Consumption of large RDF datasets
 Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
 Verbose = High costs to write/exchange/parse
 A basic offline search = (decompress)+ index the file + search
 Lightweight Binary RDF (HDT)
 Highly compact serialization of RDF
 Allows fast RDF retrieval in compressed space (without prior decompression)
 Includes internal indexes to solve basic queries with small (3%) memory footprint.
 Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.
 Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x).
431 M.triples~
63 GB
DBpedia
NT + gzip
5 GB
HDT
6.6 GB
HDT + gzip
2.7 GB
rdfhdt.org
The real motivation
The real motivation
http://www.kunsan.af.mil/News/
Article/413995/serving-the-masses/
Oh man I’m hungry and
I don’ t even know if I
will like whatever you
are cooking
The real motivation
http://www.kunsan.af.mil/News/
Article/413995/serving-the-masses/
Oh man I’m hungry and
I don’ t even know if I
will like whatever you
are cooking
consume
Applications
 Compress and share ready-to-consume RDF datasets
 Transfer large data between servers
 Embedded Systems & Phones
 Fast –low cost- SPARQL Query Engine
 Via LDF
 HDT-Jena
 HDT-Cliopatra
But what about Web-scale queries
 E.g. retrieve all entities in LOD with the label “Tim
Berners-Lee“
 Options:
 Crawl and index LOD locally (-no-)
 Follow-your-nose (where should I start?)
 Federated querying (as good as the endpoints you query)
 Use LOD Laundromat as a “good approximation” (still
querying 650K datasets)
36
select distinct ?x {
?x rdfs:label "Tim Berners-Lee"
}
37
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
SPARQL
endpoint
(metadata)
LOD Laundromat
But what about Web-scale queries
38
LOD-a-lot
- flashforward -
But what about Web-scale queries
But one could be really hungry
39
https://hwy55burgers.wordpress.com/tag/food-challenge/
LOD-a-lot
40
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
LOD-a-lot
SPARQL
endpoint
(metadata)
LOD-a-lot
Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias
28B triples
LOD-a-lot (some numbers)
Disk size:
 HDT: 304 GB
 HDT-FoQ (additional indexes): 133 GB
Memory footprint (to query):
 15.7 GB of RAM (3% of the size)
 144 seconds loading time
 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
LDF page resolution in milliseconds.
41
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
42
LOD-a-lot
https://datahub.io/dataset/lod-a-lot
LOD-a-lot (some use cases)
 Query resolution at Web scale
 Evaluation and Benchmarking
 No excuse 
 RDF metrics and analytics
43
subjects predicates objects
LOD-a-lot (ACKs)
44
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
G3b
G1b
Linked Open Data
Cloud
Linked Closed Data
Cloud
dbpedia
G3a G4a
G1a G2a
G1c G2c
G2b
1) Linked Open/Close Data
“Deep Semantic Web”
1) Linked Open/Close Data
1) Linked Open/Close Data
 A) Exchange: Encryption + HDT (hdtcrypt)
48
49
1) Linked Open/Close Data
 B) A secure LD Endpoint
ESWC’17, THU 16:30-17:00
Self-Enforcing Access Control for Encrypted RDF
Javier D. Fernández, Sabrina Kirrane, Axel Polleres and
Simon Steyskal
2) RDF evolution at Scale
ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015
Number of
sources
Update rate
month
year
week
day
hour
minute
second
104 105 106101100 102 103
DBpedia
BTC
Dyldo
Internet
of Things
Virtual/Augmented
Reality
versions?
LOD-a-lot
Managing the Evolution and
Preservation of the Data Web (FP7)
Preserving Linked Data (FP7)
last few years:
51
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEnchmark of RDF ARchives
2) RDF evolution at Scale
Use mappings to update
infoboxes and track
pages that need
updating.
3) Ontology-based Data Management
Use case: Dbpedia & SPARQL Update to maintain Wikipedia?
Our approach to OBDM over curated sources
1. Ensure consistency in all cases, automatically resolve
updates on the best-effort basis.
2. Learn from existing data and from principled belief
revision semantics.
 E.g.: many football players with only one foaf:name in
English DBpedia have both name and full name Infobox
properties set.
3. Record, extract and apply best / typical practices.
name foaf:name
full_name
A minimal-change insert translation
would only update one infobox
property.
ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and
SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov
3) Ontology-based Data Management
A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour
Dept. of Information Systems & Operations
Institute for Information Business
Welthandelsplatz 1, 1020 Vienna, Austria
DR. Javier D. Fernández
T +43-1-313 36-5241
F +43-1-313 36-739
jfernand@wu.ac.at
www.ai.wu.ac.at
Thanks!
 Big (Semantic) Data
 Versions
 Evolving Data
 Encryption
 Compression
rdfhdt.org

More Related Content

What's hot

Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding FormJakob .
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFSNilesh Wagmare
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFOpenLink Software
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetAlexandre Rademaker
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolLaura Po
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data VisualizationLaura Po
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." Avalon Media System
 
morph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementationmorph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementationNandana Mihindukulasooriya
 

What's hot (16)

NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
 
Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding Form
 
5 rdfs
5 rdfs5 rdfs
5 rdfs
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDF
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX tool
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World."
 
morph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementationmorph-LDP: An R2RML-based Linked Data Platform implementation
morph-LDP: An R2RML-based Linked Data Platform implementation
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Fedora Migration Considerations
Fedora Migration ConsiderationsFedora Migration Considerations
Fedora Migration Considerations
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 

Similar to Scaling the (evolving) web data –at low cost-

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projectszsrlibrary
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Swapnaja Tandale
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 

Similar to Scaling the (evolving) web data –at low cost- (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Hadoop
HadoopHadoop
Hadoop
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of ...
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
The Future of LOD
The Future of LODThe Future of LOD
The Future of LOD
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Scaling the (evolving) web data –at low cost-

  • 1. Scaling the (evolving) web data –at low cost- Javier D. Fernández QuWeDa 2017: Querying the Web of Data Kosice, 29/05/2017
  • 2. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with jokes
  • 3. About me:  since 2015 @WU, Inst. for Information Business Research interest: Semantic Web, Open Data, Big (Semantic) Data Management, Databases, Data Compression, Privacy and Security  https://www.wu.ac.at/en/infobiz/team/fernandez/ MadridValladolid Santiago Rome 3 Óscar CorchoPablo de la Fuente Miguel A. Martínez-Prieto Claudio Gutiérrez Maurizio Lenzerini Vienna Axel Polleres
  • 4. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 5. 5 The Web of Data Eco System
  • 6. The Web of Data Eco System  First, we better know what we can offer…  What is the Semantic Web/Web of Data/Linked Data?  Who are we? What have we done so far?  What we haven‘t done so far? 6 Linked Data Semantic Web Open Data Big Data
  • 7. (Big Semantic Data: Linked Data vs. Big Data)  Overlaps:  LD as a whole is big (38B-150B triples)  No rigid (e.g., relational) data model  Big Data technologies (e.g., Hadoop) are used to handle LD  LD can represent knowledge extracted from big unstructured data (specially to deal with variety)  Key Differences:  Individual linked data sets are typically not "big" per se (e.g., English DBpedia dump (zip) currently < 5 GB)  LD is structured, single data model (RDF), "big data lakes" are typically neither  Big data based on distributed data infrastructures within an organization (e.g., Hadoop clusters), LD creates a decentralized, globally distributed data infrastructure
  • 8. Let’s study the community… Survey practitioner needs, technological challenges, and open research questions on the use of Linked Data  Austrian FFG ICT of the Future project (exploratory study)  Consortium: IDC Austria, Technical University of Vienna, University of Economy Vienna, Semantic Web Company  Project ended in Dec 2016: https://www.linked-data.at/ Standards*Requirements Literature research* * Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis
  • 9. Interviews  23 interviews:  Domains  Consulting, Engineering, Environment, Finance and Insurance, Government, Healthcare, ICT, IT, Media, Pharmaceutical, Professional Services, Real Estate, Research, Startup, Tourism, Transports & Logistics  Roles  Business Intelligence, CEO, Chief Engineer, Data and Systems Architect, Data Scientist, Director Information Management, Enterprise Architect, Founder, General Secretary, Governance, Risk & Compliance Manager, Head of Communications and Media, Head of Development, Head of HR, Head of R&D, Innovation Manager, Information Architect, IT Project Manager, Management, Managing director, Marketing Analyst, Principle System Analyst, Project Coordinator, Researcher, Technical Specialist
  • 10. Technologies in need… Analytics Computational linguistics & NLP Concept tagging & annotation Data integration Data management Dynamic data / streaming Extraction, data mining, text mining, entity extraction Logic, formal languages & reasoning Human- Computer Interaction & visualization Knowledge representation Machine learning Ontology/thesaur us/taxonomy management Quality & Provenance Recommendation Robustness, scalability, optimization and performance Searching, browsing & exploration Security and privacy System engineering We ended with most areas of the SW
  • 12. Standards Toolbox (incl. W3C member submissions)
  • 13.
  • 14.
  • 15.
  • 16. What can we offer? Community Analysis  Monitoring SW community major venues (2006-2015):  ISWC (since 2006), ESWC (since 2006), SEMANTiCS (since 2007), JWS (since 2006), SWJ (since 2010)  3 seminal papers: 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
  • 18. Topic Categorisation Interestingly, the same “empty” topics in standards
  • 19. Semantic Web/Linked Data over time… Subtopics: Expressing Meaning Knowledge Representation Ontologies Agents
  • 21. Semantic Web/Linked Data over time… Early adopters: MITRE Chevron British Telecom Boeing Ordnance Survey Eli Lily Pfizer Agfa Food and Drug Administration National Institutes of Health Software adopters/products: Oracle Adobe Altova OpenLink TopQuadrant Software AG Aduna Software Protége SAPHIRE
  • 22. LD Adopters - Companies
  • 23. LD Adopters - Companies
  • 24. LD Adopters - Companies 0 200 400 600 800 1000 1200 1400 1600 Google Oracle Yahoo SAP IEEE Intelligent Systems Franz Bing Expert System IBM Research Poolparty Occurrences Companies Conference Sponsors that appear in papers 2006-2015
  • 25. To whom we can sell our technology
  • 26. Semantic Web/Linked Data over time… The authors claim that "early research has transitioned into these larger, more applied systems, today’s Semantic Web research is changing: It builds on the earlier foundations but it has generated a more diverse set of pursuits”.
  • 27. Big Semantic Data and applied systems
  • 28. Big Semantic Data and applied systems
  • 29. Other topics of the QuWeDa workshop
  • 30. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 31. Motivation  Publication, Exchange and Consumption of large RDF datasets  Most RDF formats (N3, XML, Turtle) are text serializations, designed for human readability (not for machines)  Verbose = High costs to write/exchange/parse  A basic offline search = (decompress)+ index the file + search  Lightweight Binary RDF (HDT)  Highly compact serialization of RDF  Allows fast RDF retrieval in compressed space (without prior decompression)  Includes internal indexes to solve basic queries with small (3%) memory footprint.  Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.  Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x). 431 M.triples~ 63 GB DBpedia NT + gzip 5 GB HDT 6.6 GB HDT + gzip 2.7 GB rdfhdt.org
  • 33. The real motivation http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/ Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking
  • 34. The real motivation http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/ Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking consume
  • 35. Applications  Compress and share ready-to-consume RDF datasets  Transfer large data between servers  Embedded Systems & Phones  Fast –low cost- SPARQL Query Engine  Via LDF  HDT-Jena  HDT-Cliopatra
  • 36. But what about Web-scale queries  E.g. retrieve all entities in LOD with the label “Tim Berners-Lee“  Options:  Crawl and index LOD locally (-no-)  Follow-your-nose (where should I start?)  Federated querying (as good as the endpoints you query)  Use LOD Laundromat as a “good approximation” (still querying 650K datasets) 36 select distinct ?x { ?x rdfs:label "Tim Berners-Lee" }
  • 38. But what about Web-scale queries 38 LOD-a-lot - flashforward -
  • 39. But what about Web-scale queries But one could be really hungry 39 https://hwy55burgers.wordpress.com/tag/food-challenge/ LOD-a-lot
  • 40. 40 LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data LOD-a-lot SPARQL endpoint (metadata) LOD-a-lot Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias 28B triples
  • 41. LOD-a-lot (some numbers) Disk size:  HDT: 304 GB  HDT-FoQ (additional indexes): 133 GB Memory footprint (to query):  15.7 GB of RAM (3% of the size)  144 seconds loading time  8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS LDF page resolution in milliseconds. 41 305€ (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
  • 43. LOD-a-lot (some use cases)  Query resolution at Web scale  Evaluation and Benchmarking  No excuse   RDF metrics and analytics 43 subjects predicates objects
  • 45. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 46. G3b G1b Linked Open Data Cloud Linked Closed Data Cloud dbpedia G3a G4a G1a G2a G1c G2c G2b 1) Linked Open/Close Data “Deep Semantic Web”
  • 48. 1) Linked Open/Close Data  A) Exchange: Encryption + HDT (hdtcrypt) 48
  • 49. 49 1) Linked Open/Close Data  B) A secure LD Endpoint ESWC’17, THU 16:30-17:00 Self-Enforcing Access Control for Encrypted RDF Javier D. Fernández, Sabrina Kirrane, Axel Polleres and Simon Steyskal
  • 50. 2) RDF evolution at Scale ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015 Number of sources Update rate month year week day hour minute second 104 105 106101100 102 103 DBpedia BTC Dyldo Internet of Things Virtual/Augmented Reality versions? LOD-a-lot
  • 51. Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7) last few years: 51 Research projects Archives Tools Benchmarking one of the fundamental problems in the Web of Data BEnchmark of RDF ARchives 2) RDF evolution at Scale
  • 52. Use mappings to update infoboxes and track pages that need updating. 3) Ontology-based Data Management Use case: Dbpedia & SPARQL Update to maintain Wikipedia?
  • 53. Our approach to OBDM over curated sources 1. Ensure consistency in all cases, automatically resolve updates on the best-effort basis. 2. Learn from existing data and from principled belief revision semantics.  E.g.: many football players with only one foaf:name in English DBpedia have both name and full name Infobox properties set. 3. Record, extract and apply best / typical practices. name foaf:name full_name A minimal-change insert translation would only update one infobox property. ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov 3) Ontology-based Data Management
  • 54. A good recipe for a WS Keynote Ingredients (for approx. 20 persons)  A motivated speaker  Some knowledge in the area  An engaged audience  Slides (number at your convenience) Method  Present yourself  Set the context, give an overall picture of the area  Touch some of the topics of the event  Focus the discussion- Sell your work  Devise future developments in the area • Mix everything with humour
  • 55. Dept. of Information Systems & Operations Institute for Information Business Welthandelsplatz 1, 1020 Vienna, Austria DR. Javier D. Fernández T +43-1-313 36-5241 F +43-1-313 36-739 jfernand@wu.ac.at www.ai.wu.ac.at Thanks!  Big (Semantic) Data  Versions  Evolving Data  Encryption  Compression rdfhdt.org

Editor's Notes

  1. After some years pushing for the Web of Data, now it should be the moment to see the ecosystem and think what have we done so far, and what we haven‘t done so far
  2. Outlines quite clearly what they thought back then the Semantic Web should be…
  3. LEDS:Linked Enterprise Data Services