A Higher-Order Data Flow Model for
Heterogeneous Big Data
Simon Price and Peter Flach
2
Outline of this presentation
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
3
2. JSONMatch
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
4
JSONMatch
JSON is the de facto data
format for Web 2.0 and
mobile apps.
JSON is the 'record' in
many NoSQL databases.
JSONMatch compares the
similarity of JSON
documents.
Use case: interactive web
applications for profiling and
matching Big (Variety)
Data.
http://jsonmatch.com
JSONMatch
• A web service for analyzing and integrating data
from heterogeneous sources in these formats:
• JSON (default)
• CSV
• HTML
• RDF
• XML
• YAML
• Plain text
• Prolog terms
• Weka AARF machine learning datasets
5
JSONMatch
• Stores and retrieves structured data (e.g. JSON
documents) like a NoSQL database.
• Processes data using data flows defined
dynamically in JSON using the REST API.
• Aims to produce results:
o quickly for small datasets
o eventually for larger datasets.
6
7
3. Data Flow Model
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
Data Model
• Each dataset is a relation. E.g. S
• Each relation is a set of key-value pairs. E.g. S1,S2,...,Sn
• Values can be 'unstructured', semi-structured or
structured data.
• In JSONMatch: value = JSON document
8
s
s1
s2
...
sn
9
Example Data Flow
s u
t v w
Φ1
filter
Φ2
join
Φ3
group
w = Φ3(Φ1(s), Φ2(t))
10
Another Example Data Flow
t
s u
v
w
x
y
z
Φ5
reduce
Φ1
split
Φ2
split
Φ3
split
Φ4
map
Φ4
map
Φ4
map
11
Higher-Order Transformation
s
t
u
v...
...
Φ(g)(h)
v = Φ(g)(h)(s, t, u, ...)
Function Φ transforms relations s,t,u,... into relation v.
Functions g and h are the higher-order parameters.
Generator Function (g)
• Choose one of three:
o Map
o Product
o Lambda
12
Generator Function (g=map)
13
vs v
s1
s2
...
sn
v1
v2
...
vn
s
t
s1
s2
...
sn
v1
v2
...
vn
t1
t2
...
tn
u
vs
t
s1
s2
...
sn
v1
v2
...
vn
t1
t2
...
tm
vn+1
vn+2
...
vn+m
...
...
...
Generator Function (g=product)
14
s v
t
s1
s2
t1
t2
t3
v1
v2
v3
v4
v6
v5
Generator Function (g=lambda)
15
vs
s1
s2
...sn
v1
v2
...
vm
vm+1
vm+2
...
vm+o
...
...
Template Function (h)
• Template data item with embedded functions
that are expanded by Φ to produce an output item.
• The embedded functions have access to the
"current" items from the input relations. i.e. items
selected by g.
• The embedded functions use JSONPath
expressions (i.e. simplified XPath for JSON) to
access sub-parts of the input items.
• $.person.title
• $.person.paper[*].author[0].name
• $[0][3][1].foo
16
• One input relation S. Each item si is an array like this.
• g=map and h is:
• Output relation V has items si like this.
17
Example JSONMatch template data item (h)
[ "Ad Feelders",
"http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.",
"Rankings_and_Partial_Orders",
"Active_Learning; Bioinformatics; ..." ]
{ "name": "$.items[0][0]",
"url": "$.items[0][1]",
"text": ["jm:http_get", "$.items[0][1]"],
"primary": "$.items[0][2]",
"keywords": ["jm:split", ";", "$.items[0][3]"] }
{ "name": "Ad Feelders",
"url": "http://dblp.uni-trier.de/...",
"text": "<html><title>A. J. Feeld...</html>",
"primary": "Rankings_and_Partial_Orders",
"keywords": [ "Active_Learning", "Bioinformatics", ... ] }
18
4. Example
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
19
SubSift
SubSift is a prototype
application to support
academic peer review.
SubSift matches submitted
conference/journal papers
to potential peer reviewers
based on similarity to
published works.
Website:
http://subsift.ilrt.bris.ac.uk
20
Recreating SubSift in JSONMatch
• All the nice features of SubSift are preserved.
• JSONMatch implementation adds other
advantages:
• Functionality defined by application as data flow at
runtime.
• REST API much smaller and simpler because
functionality defined in item template h.
• Does not require a separate web harvester robot.
• External web services can be embedded in data flow.
• Handles much larger numbers of reviewers and
papers.
21
5. Summary
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
22
Higher-Order Data Flow Model
Concise formalism for Big Variety data flows specified
dynamically from interactive web applications.
JSONMatch proof-of-concept implementation:
• For analyzing and integrating data from heterogeneous sources
• http://jsonmatch.com
Nice properties for analyzing data serially over
extended periods of time without Big Data
infrastructure.
http://simonprice.info
Get in touch:

A Higher-Order Data Flow Model for Heterogeneous Big Data

  • 1.
    A Higher-Order DataFlow Model for Heterogeneous Big Data Simon Price and Peter Flach
  • 2.
    2 Outline of thispresentation 1. Introduction 2. JSONMatch 3. Data Flow Model 4. Example 5. Summary
  • 3.
    3 2. JSONMatch 1. Introduction 2.JSONMatch 3. Data Flow Model 4. Example 5. Summary
  • 4.
    4 JSONMatch JSON is thede facto data format for Web 2.0 and mobile apps. JSON is the 'record' in many NoSQL databases. JSONMatch compares the similarity of JSON documents. Use case: interactive web applications for profiling and matching Big (Variety) Data. http://jsonmatch.com
  • 5.
    JSONMatch • A webservice for analyzing and integrating data from heterogeneous sources in these formats: • JSON (default) • CSV • HTML • RDF • XML • YAML • Plain text • Prolog terms • Weka AARF machine learning datasets 5
  • 6.
    JSONMatch • Stores andretrieves structured data (e.g. JSON documents) like a NoSQL database. • Processes data using data flows defined dynamically in JSON using the REST API. • Aims to produce results: o quickly for small datasets o eventually for larger datasets. 6
  • 7.
    7 3. Data FlowModel 1. Introduction 2. JSONMatch 3. Data Flow Model 4. Example 5. Summary
  • 8.
    Data Model • Eachdataset is a relation. E.g. S • Each relation is a set of key-value pairs. E.g. S1,S2,...,Sn • Values can be 'unstructured', semi-structured or structured data. • In JSONMatch: value = JSON document 8 s s1 s2 ... sn
  • 9.
    9 Example Data Flow su t v w Φ1 filter Φ2 join Φ3 group w = Φ3(Φ1(s), Φ2(t))
  • 10.
    10 Another Example DataFlow t s u v w x y z Φ5 reduce Φ1 split Φ2 split Φ3 split Φ4 map Φ4 map Φ4 map
  • 11.
    11 Higher-Order Transformation s t u v... ... Φ(g)(h) v =Φ(g)(h)(s, t, u, ...) Function Φ transforms relations s,t,u,... into relation v. Functions g and h are the higher-order parameters.
  • 12.
    Generator Function (g) •Choose one of three: o Map o Product o Lambda 12
  • 13.
    Generator Function (g=map) 13 vsv s1 s2 ... sn v1 v2 ... vn s t s1 s2 ... sn v1 v2 ... vn t1 t2 ... tn u vs t s1 s2 ... sn v1 v2 ... vn t1 t2 ... tm vn+1 vn+2 ... vn+m ... ... ...
  • 14.
    Generator Function (g=product) 14 sv t s1 s2 t1 t2 t3 v1 v2 v3 v4 v6 v5
  • 15.
  • 16.
    Template Function (h) •Template data item with embedded functions that are expanded by Φ to produce an output item. • The embedded functions have access to the "current" items from the input relations. i.e. items selected by g. • The embedded functions use JSONPath expressions (i.e. simplified XPath for JSON) to access sub-parts of the input items. • $.person.title • $.person.paper[*].author[0].name • $[0][3][1].foo 16
  • 17.
    • One inputrelation S. Each item si is an array like this. • g=map and h is: • Output relation V has items si like this. 17 Example JSONMatch template data item (h) [ "Ad Feelders", "http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.", "Rankings_and_Partial_Orders", "Active_Learning; Bioinformatics; ..." ] { "name": "$.items[0][0]", "url": "$.items[0][1]", "text": ["jm:http_get", "$.items[0][1]"], "primary": "$.items[0][2]", "keywords": ["jm:split", ";", "$.items[0][3]"] } { "name": "Ad Feelders", "url": "http://dblp.uni-trier.de/...", "text": "<html><title>A. J. Feeld...</html>", "primary": "Rankings_and_Partial_Orders", "keywords": [ "Active_Learning", "Bioinformatics", ... ] }
  • 18.
    18 4. Example 1. Introduction 2.JSONMatch 3. Data Flow Model 4. Example 5. Summary
  • 19.
    19 SubSift SubSift is aprototype application to support academic peer review. SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works. Website: http://subsift.ilrt.bris.ac.uk
  • 20.
    20 Recreating SubSift inJSONMatch • All the nice features of SubSift are preserved. • JSONMatch implementation adds other advantages: • Functionality defined by application as data flow at runtime. • REST API much smaller and simpler because functionality defined in item template h. • Does not require a separate web harvester robot. • External web services can be embedded in data flow. • Handles much larger numbers of reviewers and papers.
  • 21.
    21 5. Summary 1. Introduction 2.JSONMatch 3. Data Flow Model 4. Example 5. Summary
  • 22.
    22 Higher-Order Data FlowModel Concise formalism for Big Variety data flows specified dynamically from interactive web applications. JSONMatch proof-of-concept implementation: • For analyzing and integrating data from heterogeneous sources • http://jsonmatch.com Nice properties for analyzing data serially over extended periods of time without Big Data infrastructure.
  • 23.