Your SlideShare is downloading. ×
0
Open Data for Agriculture
Intro to Big Data
29/11/2013
Athens, Greece
Joint offering by

Supported by EU projects
Intro to Big Data

Antonis Koukourikos
NCSR “Demokritos”
Presentation Outline
• What is Big Data?
• Semantic Web Technologies

• What Semantic Web brings into the picture

Slide 3...
Part 1

WHAT IS BIG DATA?
Big Data Is…

Data whose scale, diversity, and complexity
require new architecture, techniques, algorithms,
and analytics ...
Big Data Sources
• Biomedical Information

• Sensor Data
• Logs
• E-mails
• Satellite images
• Audio and Video Streams
• S...
Big Data Challenges – “The Three Vs”
…or is it 4…?

Veracity
Volume

Variety
Velocity

…or is it 6… ??

Visualization

Val...
Big Data demand…
• Storage
– Impractical or impossible to use centralized storage
• Distribution
• Federation

– Indexing ...
Part 2

SEMANTIC WEB TECHNOLOGIES
The Syntactic and the Semantic Web
• The World Wide Web represents information
using natural language, graphics, multimedi...
Semantic Web Technologies
• Common formats for integration and combination of data
drawn from diverse sources, whereas the...
What SW can do
• Handle heterogeneity
• Handle evolution / variability
• Elicit inferred knowledge

• Volume is still the ...
Part 3

WHAT SEMANTIC WEB BRINGS IN THE BIG
DATA PICTURE
Moving Forward with “Old” Technologies
OAI-PMH Service
Provider #1

OAI-PMH Service
Provider #n

Schema #1

Schema #n

HAR...
What Semantic Web can bring into the picture
• One Data Access Point for One Data AccessClient for the entire Data Cloud
P...
The SemaGrow Solution
• Use POWDER to mass-annotate large-subspaces
– Exploit naming convention regularities to compress
t...
The POWDER W3C Recommendation
• Exploits natural groupings of URIs to annotate all
resources in a subset of the URI space
...
The SemaGrow Stack
• Integrates the components in order to offer a single
SPARQL endpoint that federates a number of
heter...
SemaGrow Architecture
Client

SemaGrow
SPARQL endpoint
Query
Resource Discovery

Query Decomposition
query
patterns

Resou...
Use Cases (DLO)

Heterogeneous Data Collections &
Streams
 Big data:
–
–
–
–

Sensor data: soil data, weather
GIS data: l...
Use Cases (FAO)

Reactive Data Analysis
 Big data:
– Document collections: past experiences, analysis and research result...
Use Cases (AK)

Reactive Resource Discovery
 Big data:
– Multimedia content about agriculture and biodiversity

 Problem...
Project Info
• SemaGrow: Data intensive techniques to boost the realtime performance of global agricultural data infrastru...
Thank you!

Antonis Koukourikos
NCSR “Demokritos”
kukurik@iit.Demokritos.gr
Upcoming SlideShare
Loading in...5
×

Introduction to Big data

331

Published on

Introduction to Big Data and Semantic Web technologies for Big Data. I was presented at Intro Course "Big Data in Agriculture" http://wiki.agroknow.gr/agroknow/index.php/Athens_Green_Hackathon_2013

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
331
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Big data"

  1. 1. Open Data for Agriculture Intro to Big Data 29/11/2013 Athens, Greece Joint offering by Supported by EU projects
  2. 2. Intro to Big Data Antonis Koukourikos NCSR “Demokritos”
  3. 3. Presentation Outline • What is Big Data? • Semantic Web Technologies • What Semantic Web brings into the picture Slide 3 of 25
  4. 4. Part 1 WHAT IS BIG DATA?
  5. 5. Big Data Is… Data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it Slide 5 of 25
  6. 6. Big Data Sources • Biomedical Information • Sensor Data • Logs • E-mails • Satellite images • Audio and Video Streams • Social Networks Slide 6 of 25
  7. 7. Big Data Challenges – “The Three Vs” …or is it 4…? Veracity Volume Variety Velocity …or is it 6… ?? Visualization Value Slide 7 of 25
  8. 8. Big Data demand… • Storage – Impractical or impossible to use centralized storage • Distribution • Federation – Indexing is a problem of itself • Computational power – For discovering – For searching / retrieving – For joining • Human effort and expertise – Querying can become complex – Are you sure you exploit all this information? Slide 8 of 25
  9. 9. Part 2 SEMANTIC WEB TECHNOLOGIES
  10. 10. The Syntactic and the Semantic Web • The World Wide Web represents information using natural language, graphics, multimedia... – Humans can process and combine these information easily – However, machines are ignorant! • The Semantic Web is a Web with a meaning – A web of data that is understandable by the machines Slide 10 of 25
  11. 11. Semantic Web Technologies • Common formats for integration and combination of data drawn from diverse sources, whereas the original Web mainly concentrated on the interchange of documents. • For defining – RDFS http://www.w3.org/TR/rdf-schema/ – OWL http://www.w3.org/TR/owl2-overview/ • For describing – RDF http://www.w3.org/RDF/ • For querying – SPARQL http://www.w3.org/TR/2013/REC-sparql11-query-20130321/ Slide 11 of 25
  12. 12. What SW can do • Handle heterogeneity • Handle evolution / variability • Elicit inferred knowledge • Volume is still the challenge Slide 12 of 25
  13. 13. Part 3 WHAT SEMANTIC WEB BRINGS IN THE BIG DATA PICTURE
  14. 14. Moving Forward with “Old” Technologies OAI-PMH Service Provider #1 OAI-PMH Service Provider #n Schema #1 Schema #n HARVESTER SPARQL endpoint SPARQL endpoint (Data Source #1) (Data Source #n) Common Schema RDF Triple Store How Many? Is it feasible? Aggregated XML Repository INDEXER AGRIS AP Schema BigData Problem! IEEE LOM Schema INDEXER DC Schema ... SPARQL endpoint Web Portals Web Portals Open AGRIS (FAO) AgLR/GLN (ARIADNE) Organic.Edunet (UAH) VOA3R (UAH) ... NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES 2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES Slide 14 of 25
  15. 15. What Semantic Web can bring into the picture • One Data Access Point for One Data AccessClient for the entire Data Cloud Point – Enabling Service-Data level agreements with Data providers • Application-level Vocabularies / Thesauri / Ontologies SemaGrow SPARQL endpoint – Enabling different application facets for different communities of users over the SAME data pool Query Resource Discovery Query Decomposition query patterns Query Decomposer • Going beyond existing Distributed Triple Store Implementations Resource Selector query pattern Set of query patterns Candidate Source(s) List Instance Statistics Load Info Semantic Proximity equivalent Semantic patterns Proximity Query Pattern Discovery Service Instance Statistics Ctrl Data Source(s) Selector Reactivity parameters –Link Heterogeneous but Semantically Connected Data –Index Extremely Large Information Volumes (Peta Sizes) –Improve Information Retrieval response query fragment, Source (#1) query fragment, Source (#n) Query results Ctrl Load Info Data Summaries SPARQL endpoint Instance Statistics query fragment, target Source POWDER Inference Layer Query Transformation Service Query Manager Ctrl transformed query query request #1 Schema Mappings query request #n • Instance Statistics SPARQL query query results query results schema Data Summaries Query Results Merger P-Store transformed schema SPARQL query query results Federated endpoint Wrapper Data (+Metadata) physically stored in Data Provider No need for harvesting • Vocabularies / Thesauri / Ontologies of Data Provider SPARQL endpoint (Data choice Source #n) – No need for aligning according to common schemas SPARQL endpoint – (Data Source #1) Slide 15 of 25
  16. 16. The SemaGrow Solution • Use POWDER to mass-annotate large-subspaces – Exploit naming convention regularities to compress the indexes used by the system • Partition triple patterns in the original query • Annotate each fragment with an ordered list of data sources most likely to contain relevant data • Distribute and transform the query fragments • Collect and align the results Slide 16 of 25
  17. 17. The POWDER W3C Recommendation • Exploits natural groupings of URIs to annotate all resources in a subset of the URI space • Regular expression based grouping • Allows properties and their values to be associated with an arbitrary number of subjects within a fully-defined semantic framework • • POWDER Description Resources: http://www.w3.org/TR/powder-dr/ POWDER Formal Semantics: http://www.w3.org/TR/powder-formal/ Slide 17 of 25
  18. 18. The SemaGrow Stack • Integrates the components in order to offer a single SPARQL endpoint that federates a number of heterogeneous data sources • Targets the federation of independently provided data sources Slide 18 of 25
  19. 19. SemaGrow Architecture Client SemaGrow SPARQL endpoint Query Resource Discovery Query Decomposition query patterns Resource Selector Resource Discovery query pattern Set of query patterns Candidate Source(s) List Instance Statistics Load Info Semantic Proximity equivalent Semantic patterns Proximity Query Pattern Discovery Service Instance Statistics Query Decomposer Ctrl Query Decomposition Data Source(s) Selector Reactivity parameters query fragment, Source (#1) query fragment, Source (#n) Query results Ctrl Load Info Data Summaries SPARQL endpoint Instance Statistics query fragment, target Source Data Summaries Endpoint POWDER Inference Layer Query Transformation Service Query Manager Ctrl transformed query Federated Endpoint Wrapper query request #1 Schema Mappings query request #n Instance Statistics SPARQL query query results query results schema Data Summaries SPARQL endpoint (Data Source #1) Query Results Merger P-Store transformed schema SPARQL query query results SPARQL endpoint (Data Source #n) Federated endpoint Wrapper Slide 19 of 25
  20. 20. Use Cases (DLO) Heterogeneous Data Collections & Streams  Big data: – – – – Sensor data: soil data, weather GIS data: land usage, forest and natural resources management data Historical data: crop yield, economic data Forecasts: climate change models  Problem: – Combine heterogeneous sources to analyze past food production and forecast future trends – Cannot clone and translate: large scale, live data streams – Cannot immediately and directly affect radical re-design of all sensing and processing currently in place 3rd Plenary & ESG Meeting 21/10/2013 Slide 24 of 25
  21. 21. Use Cases (FAO) Reactive Data Analysis  Big data: – Document collections: past experiences, analysis and research results – Databases: climate conditions and crop yield observations, economic data (land and food prices)  Problem: – Retrieving complete and accurate information to compile reports • Raw data and reports, scientific publications, etc. – Wastes human resources that could analyze data and synthesize useful knowledge and advice for food production • Too much time spent cross-relating responses from different sources – Too many different organizations and processes rely on the different schemas to make re-design viable – Cloning is inefficient: large and constantly updated stores 3rd Plenary & ESG Meeting 21/10/2013 Slide 25 of 25
  22. 22. Use Cases (AK) Reactive Resource Discovery  Big data: – Multimedia content about agriculture and biodiversity  Problem: – Real-time retrieval of relevant content – Used to compile educational activities – Schema heterogeneity: • Different providers (Oganic edunet, Europeana, VOA3R, etc.) – Too many different organizations and processes rely on the different schema to make re-design viable – Cloning is inefficient: large and constantly updated stores 3rd Plenary & ESG Meeting 21/10/2013 Slide 26 of 25
  23. 23. Project Info • SemaGrow: Data intensive techniques to boost the realtime performance of global agricultural data infrastructures • FP7-ICT-2011.4.4 (Intelligent Information Management) No. Name 1 Universidad de Alcala 2 NCSR “Demokritos” 3 Universita Degli Studi di Roma Tor Vergata 4 Semantic Web Company 5 Institut Za Fiziku 6 Stichting Dienst Landbouwkundik Onderzoek 7 Food and Agriculture Organization of the UN 8 Countr y Agroknow Technologies Slide 27 of 25
  24. 24. Thank you! Antonis Koukourikos NCSR “Demokritos” kukurik@iit.Demokritos.gr
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×