A Higher-Order Data Flow Model for Heterogeneous Big Data

A Higher-Order Data Flow Model for
Heterogeneous Big Data
Simon Price and Peter Flach

2
Outline of this presentation
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary

3
2. JSONMatch
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary

4
JSONMatch
JSON is the de facto data
format for Web 2.0 and
mobile apps.
JSON is the 'record' in
many NoSQL databases.
JSONMatch compares the
similarity of JSON
documents.
Use case: interactive web
applications for profiling and
matching Big (Variety)
Data.
http://jsonmatch.com

JSONMatch
• A web service for analyzing and integrating data
from heterogeneous sources in these formats:
• JSON (default)
• CSV
• HTML
• RDF
• XML
• YAML
• Plain text
• Prolog terms
• Weka AARF machine learning datasets
5

JSONMatch
• Stores and retrieves structured data (e.g. JSON
documents) like a NoSQL database.
• Processes data using data flows defined
dynamically in JSON using the REST API.
• Aims to produce results:
o quickly for small datasets
o eventually for larger datasets.
6

7
3. Data Flow Model
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary

Data Model
• Each dataset is a relation. E.g. S
• Each relation is a set of key-value pairs. E.g. S1,S2,...,Sn
• Values can be 'unstructured', semi-structured or
structured data.
• In JSONMatch: value = JSON document
8
s
s1
s2
...
sn

9
Example Data Flow
s u
t v w
Φ1
ﬁlter
Φ2
join
Φ3
group
w = Φ3(Φ1(s), Φ2(t))

10
Another Example Data Flow
t
s u
v
w
x
y
z
Φ5
reduce
Φ1
split
Φ2
split
Φ3
split
Φ4
map
Φ4
map
Φ4
map

11
Higher-Order Transformation
s
t
u
v...
...
Φ(g)(h)
v = Φ(g)(h)(s, t, u, ...)
Function Φ transforms relations s,t,u,... into relation v.
Functions g and h are the higher-order parameters.

Generator Function (g)
• Choose one of three:
o Map
o Product
o Lambda
12

Generator Function (g=map)
13
vs v
s1
s2
...
sn
v1
v2
...
vn
s
t
s1
s2
...
sn
v1
v2
...
vn
t1
t2
...
tn
u
vs
t
s1
s2
...
sn
v1
v2
...
vn
t1
t2
...
tm
vn+1
vn+2
...
vn+m
...
...
...

Generator Function (g=product)
14
s v
t
s1
s2
t1
t2
t3
v1
v2
v3
v4
v6
v5

Generator Function (g=lambda)
15
vs
s1
s2
...sn
v1
v2
...
vm
vm+1
vm+2
...
vm+o
...
...

Template Function (h)
• Template data item with embedded functions
that are expanded by Φ to produce an output item.
• The embedded functions have access to the
"current" items from the input relations. i.e. items
selected by g.
• The embedded functions use JSONPath
expressions (i.e. simplified XPath for JSON) to
access sub-parts of the input items.
• $.person.title
• $.person.paper[*].author[0].name
• $[0][3][1].foo
16

• One input relation S. Each item si is an array like this.
• g=map and h is:
• Output relation V has items si like this.
17
Example JSONMatch template data item (h)
[ "Ad Feelders",
"http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.",
"Rankings_and_Partial_Orders",
"Active_Learning; Bioinformatics; ..." ]
{ "name": "$.items[0][0]",
"url": "$.items[0][1]",
"text": ["jm:http_get", "$.items[0][1]"],
"primary": "$.items[0][2]",
"keywords": ["jm:split", ";", "$.items[0][3]"] }
{ "name": "Ad Feelders",
"url": "http://dblp.uni-trier.de/...",
"text": "<html><title>A. J. Feeld...</html>",
"primary": "Rankings_and_Partial_Orders",
"keywords": [ "Active_Learning", "Bioinformatics", ... ] }

18
4. Example
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary

19
SubSift
SubSift is a prototype
application to support
academic peer review.
SubSift matches submitted
conference/journal papers
to potential peer reviewers
based on similarity to
published works.
Website:
http://subsift.ilrt.bris.ac.uk

20
Recreating SubSift in JSONMatch
• All the nice features of SubSift are preserved.
• JSONMatch implementation adds other
advantages:
• Functionality defined by application as data flow at
runtime.
• REST API much smaller and simpler because
functionality defined in item template h.
• Does not require a separate web harvester robot.
• External web services can be embedded in data flow.
• Handles much larger numbers of reviewers and
papers.

21
5. Summary
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary

22
Higher-Order Data Flow Model
Concise formalism for Big Variety data flows specified
dynamically from interactive web applications.
JSONMatch proof-of-concept implementation:
• For analyzing and integrating data from heterogeneous sources
• http://jsonmatch.com
Nice properties for analyzing data serially over
extended periods of time without Big Data
infrastructure.

http://simonprice.info
Get in touch:

A Higher-Order Data Flow Model for Heterogeneous Big Data

More Related Content

What's hot

Viewers also liked

Similar to A Higher-Order Data Flow Model for Heterogeneous Big Data

More from Simon Price

Recently uploaded

A Higher-Order Data Flow Model for Heterogeneous Big Data