TOWARDS
AN ARCHITECTURE
FOR MANAGING
BIG SEMANTIC DATA
IN REAL-TIME
Carlos E. Cuesta, VorTIC3, URJC, Spain
Miguel A. Martí...
CONTENTS
 Introduction
 Problem Statement
 Context: the RDF world
 Proposal: SOLID Architecture
 Unfolding in five La...
INTRODUCTION
 Big Data has become an important topic
 When the size of the data itself becomes part of the
problem (Louk...
INTRODUCTION
 One of the dimensions gets always critical
 E.g. storage in mobile applications, velocity in real-
time ap...
PROBLEM STATEMENT
 Most solutions to manage Big Data intend to
maximize the volume dimension
 Therefore promoting effici...
OUR PROPOSAL: SOLID ARCHITECTURE
 We propose an specific architecture to manage
Real-Time flows in this context
 A multi...
CONTEXT: RDF
 RDF: Resource Description Framework
 Data described as (subject, predicate, object) triples
 An RDF datas...
CONTEXT: RDF
 The origin of the Web of Data
 Two datasets can become connected by a single triple
<“Station #123, locati...
SOLID ARCHITECTURE
10
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LA...
SOLID ARCHITECTURE
11
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LA...
SOLID ARCHITECTURE
 Online Layer
 Receives incoming new data
 Deals with real-time needs
 Data Layer
 The core of the...
SOLID ARCHITECTURE
 Service Layer
 The façade to the external user
 Able to ask federated SPARQL queries to the
separat...
SOLID IN PRACTICE
 This abstract architecture is possible due to
application to existing technology
 In particular, the ...
SOLID IN PRACTICE
 RDF/HDT format
 Designed for machine processing
 About 15 times less space than equivalent formats
...
SOLID IN PRACTICE
 Online Layer
 Copes with the incoming flow of real-time data
 HDT is inadequate (designed for read-o...
SOLID IN PRACTICE
 Merge Layer
 Able to combine incoming data from the Online Layer
with the existing datastore in the D...
SOLID ARCHITECTURE IN PRACTICE
18
INDEX LAYER
New Data
Dump
Rd
NoSQL
DATA LAYER
RDF/HDT
MERGE LAYER
(BATCH)
HADOOP
SPARQL
...
CONCLUSIONS & FUTURE WORK
 We propose SOLID as a generic architecture for
managing Big Semantic Data
 Our particular imp...
THANKS FOR YOUR ATTENTION
20
Upcoming SlideShare
Loading in …5
×

ECSA 2013 (Cuesta)

301 views
231 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
301
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ECSA 2013 (Cuesta)

  1. 1. TOWARDS AN ARCHITECTURE FOR MANAGING BIG SEMANTIC DATA IN REAL-TIME Carlos E. Cuesta, VorTIC3, URJC, Spain Miguel A. Martínez-Prieto, UVa, Spain Javier D. Fernández, UVa, Spain & UChile, Chile Montpellier, France, 02/07/2013
  2. 2. CONTENTS  Introduction  Problem Statement  Context: the RDF world  Proposal: SOLID Architecture  Unfolding in five Layers  SOLID in Practice  The RDF/HDT format  The SOLID/HDT Architecture  Conclusions & Future work 2
  3. 3. INTRODUCTION  Big Data has become an important topic  When the size of the data itself becomes part of the problem (Loukides)  Characterized by the “three Vs”  Volume: large amounts of data gathered and stored  The challenge is storage, but also computing  Volume is relative: depends on available resources  Velocity: different flows of data at different rates  Variety: the kind of structures within the data  Each source has its own semantics  Need of a logical model to allow data integration  Architecture for Big Data must consider all these 3
  4. 4. INTRODUCTION  One of the dimensions gets always critical  E.g. storage in mobile applications, velocity in real- time applications (vs. batch processes)  We promote variety  The dataset value is increased when multiple sources are integrated, achieving more knowledge  This also influences velocity and volume  We choose a graph-based model  Allows to manage higher levels of variety  Data can be linked and queried together  In practice, this means using RDF as data model  The cornerstone of the “practical” Semantic Web  The basis of the emergent Web of Data 4
  5. 5. PROBLEM STATEMENT  Most solutions to manage Big Data intend to maximize the volume dimension  Therefore promoting efficient storage  Datastores able to cope with large datasets  Indexing strategies to achieve high space  Datastores must be assumed to be stable  In spite of the assumed immutability property  But, the volume of incoming data is also big  Datastores must be periodically updated & reindexed  This is very complex in a Real-Time context  Data must be received and integrated in real time  No time to process the flow of incoming data 5
  6. 6. OUR PROPOSAL: SOLID ARCHITECTURE  We propose an specific architecture to manage Real-Time flows in this context  A multi-tiered architecture  Separate comsuption of Big Semantic Data…  … from the complexities of Real-Time operation  Data must be preserved compact  It is stored and indexed in a compressed way  Data & Index Layers  Needs to efficiently cope with data updates  The reason for the Online Layer  Needs to query all of this together  The reason for the Service Layer 6
  7. 7. CONTEXT: RDF  RDF: Resource Description Framework  Data described as (subject, predicate, object) triples  An RDF dataset is a graph of knowledge  Entities linked to values via labelled edges  Essential for Linked Open Data  Adopted in many different contexts  Simple integration: everything has an URI 7 John Car owns
  8. 8. CONTEXT: RDF  The origin of the Web of Data  Two datasets can become connected by a single triple <“Station #123, location, Canal Street>  The web becomes data-centric  Every unit is a small piece of data  “The Big Data’s long tail”  But their integration in large contexts become complex: Big Semantic Data  A variety of sources become easily integrated  RDF is not a serialization format  Describes what data is stored, not how this is done 8
  9. 9. SOLID ARCHITECTURE 10 INDEX LAYER New Data Dump Rd DataStore DATA LAYER Big Data MERGE LAYER (BATCH) Query Join SERVICE LAYER ONLINE LAYER Parallelizable Processing
  10. 10. SOLID ARCHITECTURE 11 INDEX LAYER New Data Dump Rd DataStore DATA LAYER Big Data MERGE LAYER (BATCH) Query Join SERVICE LAYER ONLINE LAYER Parallelizable Processing RDF SPARQL
  11. 11. SOLID ARCHITECTURE  Online Layer  Receives incoming new data  Deals with real-time needs  Data Layer  The core of the architecture  The main datastore: the Big Data repository  Stores compressed RDF  Index Layer  Provides an index for the Data Layer, to make possible high-speed access  Most accesses to the repository are made through it 12
  12. 12. SOLID ARCHITECTURE  Service Layer  The façade to the external user  Able to ask federated SPARQL queries to the separate datastores in different layers  Every query is distributed, and the different answers are joined  Merge Layer  Makes possible to integrate the two datastores  Receives a dump of data of the online layer  Integrates that with the existing repository  Producing a fresh copy of the Data Layer  Immutability properties are preserved 13
  13. 13. SOLID IN PRACTICE  This abstract architecture is possible due to application to existing technology  In particular, the RDF/HDT binary format  Decisions must be taken, layer by layer, about how to actually implement it  Other alternatives would also be possible (and some of them are also being implemented)  Data-Centric Layers  Do not use a textual RDF representation  Inefficient, prevents some potential uses  RDF/HDT is a binary format  Conceived specifically for serialization purposes 14
  14. 14. SOLID IN PRACTICE  RDF/HDT format  Designed for machine processing  About 15 times less space than equivalent formats  Uses compact (compressed) data structures  Data Layer  Big Semantic Data in RDF/HDT  Data saving and guaranteed immutability  Instant mapping to memory  Allow querying withoug decompressing  Index Layer  Implements the HDT/FoQ proposal  Lightweight index on top of the HDT binary format  Efficient SPARQL retrieval without decompressing 15
  15. 15. SOLID IN PRACTICE  Online Layer  Copes with the incoming flow of real-time data  HDT is inadequate (designed for read-only)  Must resolve SPARQL efficiently  Choose a general-purpose NoSQL technology  Still able to dump data in an RDF format  Service Layer  Resolves any potential queries  SPARQL considered expressive enough  Queries are forwarded to Online and Index Layers  Their results are retrieved and combined  Using an (scalable) Pipe-Filter approach 16
  16. 16. SOLID IN PRACTICE  Merge Layer  Able to combine incoming data from the Online Layer with the existing datastore in the Data Layer  The data dump is merged into a copy of the datastore  Then the fresh datastore replaces the previous one  Periodical process, can also be manually triggered  Requires high-performance computation  In practice, this means a Map/Reduce approach  Raw RDF data from Online Layer is converted  Then ordered for internal merging  Depends on the size of the smaller store  Also triggers reindexing the Index Layer 17
  17. 17. SOLID ARCHITECTURE IN PRACTICE 18 INDEX LAYER New Data Dump Rd NoSQL DATA LAYER RDF/HDT MERGE LAYER (BATCH) HADOOP SPARQL SPARQL + P/F SERVICE LAYER ONLINE LAYER Semantic Data
  18. 18. CONCLUSIONS & FUTURE WORK  We propose SOLID as a generic architecture for managing Big Semantic Data  Our particular implementation relies on HDT  Also NoSQL for real-time incoming data  Cassandra, but (still) not the only choice  Map/Reduce (Hadoop) for intensive processing  Highly effective in terms of space & time  Initial empirical results are very significant  Currently developing an optimized prototype  Already working on variants of the architecture  Limited version for mobile devices  The Merge Layer is not directly requred 19
  19. 19. THANKS FOR YOUR ATTENTION 20

×