SociaLite: High-level Query Language
for Big Data Analysis
Jiwon Seo, *Jongsoo Park, Jaeho Shin, Stephen Guo, and Monica S...
Problems in existing platforms
 Too difficult (low-level primitives)
 Inefficient (not network bound)
 Too many (sub) f...
SociaLite is a high-level query language
 Easy & efficient
 Compiled to distributed code
 1,000x hadoop
 Hadoop compat...
 Concepts in SociaLite
 Distributed Tables
 Rules
 Python Integration (Jython & CPython)
 Analysis Algorithms
 Short...
 Primary data structure in SociaLite
 Column oriented storage
 <type>
 Primitive types
 Object types
 opts
 indexby...
Distributed In-Memory Tables
Foo(int x, int y).
1 9
1 10
2 5
Bar[int x](int y).
Foo(int x, (int y)).
9 7
1
2
9
1 2
3 4
9 7...
Table options
 indexby <column>
 sortby <column>
 multiset
Column options
 range
 (distributed) partition
Distributed...
Rules
Foo[a](c) :- Bar[a](b), Qux[b](c).
Rule head Rule body
Rules
Foo[a](c) :- Bar[a](b), Qux[b](c).
1 2
1 3
8 4
8 7
9 11
2 9
2 10
5 4
10
711
9
Bar QuxFoo
Rules
Foo[a](c) :- Bar[a](b), Qux[b](c).
1 2
1 3
8 4
8 7
9 11
2 9
2 10
5 4
10
711
9
1 9
1 10
Bar QuxFoo
Rules
Foo[a](c) :- Bar[a](b), Qux[b](c).
1 2
1 3
8 4
8 7
9 11
2 9
2 10
5 4
10
711
9
1 9
1 10
9 9
Bar QuxFoo
Distributed Join
Foo[a](c) :- Bar[a](b), Qux[b](c).
1 2Bar
2 9 Qux
1 9
Qux
Foo
Bar
Foo
Machine 1 Machine 2
join
1 9
Distributed Join
Foo[a](c) :- Qux[b](c), Bar[a](b).
1 2Bar
2 9 QuxQux
Foo
Bar
Foo
Machine 1 Machine 2
Parallel Evaluation
Foo[a](c) :- Bar[a](b), Qux[b](c).
Machine 1 Machine 2
Bara
Barb
Barc
Bard
Bara
Barb
Barc
Bard
Parallel Evaluation
Foo[a](c) :- Bar[a](b), Qux[b](c).
Foo[a](c) :- Bar1a[a](b), Qux[b](c).
Foo[a](c) :- Bar1b[a](b), Qux[...
Aggregation
Foo[a]($min(c)) :- Bar[a](b), Qux[b](c).
The $min aggregate function is applied to tuples in Foo
having the sa...
 Built-in aggregate functions
 min, max, sum, avg, argmin
 User-defined functions
 in Java or Python
Aggregation
 Head table also appears in rule body
Foo(a,c) :- Foo(a,b), Bar(b,c).
 Semantics
– rule executed repeatedly until no cha...
SociaLite: Datalog Extensions for Efficient Social Network Analysis, ICDE’13
Distributed SociaLite: A Datalog-Based Langua...
 SociaLite queries in Python code
 `Queries are quoted in backtick`
 Python  SociaLite
 Python functions, variables ...
Python Integration
print “This is Python code!”
Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i...
Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i...
Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i...
Python Integration
print “This is Python code!”
# now we use SociaLite queries below
`Foo[int i](String s).
Foo[i](s) :- i...
 Graph algorithms
 Shortest Paths
 PageRank
 Data mining algorithms
 K-Means
 Logistic regression
Analysis Algorithms
 Shortest Path
Graph Algorithm
`Edge(int s, (int t, double len)) indexby s.
Path(int n, double dist) indexby n. `
`Path(t...
 PageRank
Graph Algorithm
`Rank(n, 0, r) :- Node(n), r=1.0/$N.`
for t in range(30):
`Rank(pi, $t+1, $sum(r)) :- Node(pi),...
 PageRank
Graph Algorithm
`Rank(n, 0, r) :- Node(n), r=1.0/$N.`
for t in range(30):
`Rank(pi, $t+1, $sum(r)) :- Node(pi),...
 K-Means
Data Mining Algorithm
for i in range(50):
`Center(cid, $avg(p)) :- Data(id, p), Cluster(id, $i, c),
cid=c.value....
 Logistic Regression
Data Mining Algorithm
for i in range(0, 100):
`Gradient($i, $sum(w)) :- Data(id, p), Weight($i, w1),...
 Single-thread performance
 Multi-thread performance (on 16-core machine)
 Distributed performance (up to 64 machines)
...
Single-thread
0
1
2
3
Shortest
Paths
PageRank Mutual
Neighbors
Connected
Components
Triangles Clustering
Coefficients
Opti...
Multi-thread
0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
16
18
20
1 2 4 6 8 10 12 14 16
ParallelizationSpeedup
ExecutionTi...
Distributed
0
1
10
100
1000
1 4 16 64
Exectime(sec.)
Native Combblas Graphlab Socialite Giraph
Breadth First Search
0.1
1
...
SociaLite is
 Distributed query language
 Easy and efficient
 Integration with Python
 Algorithms in SociaLite (graph,...
jiwon @ stanford.edu
http://socialite.stanford.edu
Questions?
Two experimental front-end
IPython
Gephi
GitHub data analysis
SociaLite + Gephi
Project/developer network
 Edge if de...
 Custom memory allocator (temporary table)
 Optimized serialization
 Direct ByteBuffer (network buffer)
 Multiple netw...
Inside Worker Node
Recv’er
Worker
worker
master
Sender
Network
Buffer Pool
System Overview standalone mode
Compiler
Python Integration (preprocessing)
Worker threadWorker threadWorker thread
Eval T...
System Overview distributed mode
Worker
Worker
Worker
Master
Distributed File System (HDFS)
 Table column can be
 Bloom filter
 Sketches
Approximaton
Bloom Filter
 Probabilistic set data structure
 Elements represented as bits
 Cannot enumerate elements
 Quickly (appr...
Analysis example
 Social Network (friendship)
 Each person’s friends-of-friends
 Count the # of people in startup
 Cal...
Approximaton
Foaf(i, f) :- Friend(i, f).
Foaf(i, ff) :- Friend(i, f), Friend(f, ff).
StartupScore(i, $inc(1)) :- Foaf(i, f...
Approximaton
Foaf(i, f) :- Friend(i, f).
Foaf(i, ff) :- Friend(i, f), Friend(f, ff).
StartupScore(i, $inc(1)) :- Foaf(i, f...
Approximaton
Foaf(i, f) :- Friend(i, f).
Foaf(i, ff) :- Friend(i, f), Friend(f, ff).
StartupScore(i, $inc(1)) :- Foaf(i, f...
System Overview
Worker
Worker
Worker
Master
Distributed File System
query
compiled
query
compiled
query
System Overview
Worker
Worker
Worker
Master
Distributed File System
idle msg
idle msg
 Query compiler
 Parser
 Analyzer
 Code generator (Java source code)
 Bytecode compiler
 Task scheduler
 Worker thr...
SociaLite: High-level Query Language for Big Data Analysis
Upcoming SlideShare
Loading in …5
×

SociaLite: High-level Query Language for Big Data Analysis

1,930 views

Published on

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,930
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

SociaLite: High-level Query Language for Big Data Analysis

  1. 1. SociaLite: High-level Query Language for Big Data Analysis Jiwon Seo, *Jongsoo Park, Jaeho Shin, Stephen Guo, and Monica S. Lam STANFORD MOBISOCIAL RESEARCH GROUP * INTEL PARALLEL R ESEARCH LA B
  2. 2. Problems in existing platforms  Too difficult (low-level primitives)  Inefficient (not network bound)  Too many (sub) frameworks  Graph analysis  Data mining (or machine learning)  Relational query Why Another Big Data Platform?
  3. 3. SociaLite is a high-level query language  Easy & efficient  Compiled to distributed code  1,000x hadoop  Hadoop compatible  Pythonintegration  Good for  Graph analysis  Data mining  Relational queries Introducing SociaLite
  4. 4.  Concepts in SociaLite  Distributed Tables  Rules  Python Integration (Jython & CPython)  Analysis Algorithms  Shortest Paths, PageRank  K-Means, Logistic Regression  Evaluation  Demo Outline
  5. 5.  Primary data structure in SociaLite  Column oriented storage  <type>  Primitive types  Object types  opts  indexby, sortby, … Distributed In-Memory Tables Table (<type> cx, …, (<type> cy,… (<type> cz…))) opts.
  6. 6. Distributed In-Memory Tables Foo(int x, int y). 1 9 1 10 2 5 Bar[int x](int y). Foo(int x, (int y)). 9 7 1 2 9 1 2 3 4 9 7 2 8 Machine 1 Machine 2 Bar[int x:0..10](int y). Machine 1 Machine 2 1 2 2 8 3 4 9 7 9 10 5 7
  7. 7. Table options  indexby <column>  sortby <column>  multiset Column options  range  (distributed) partition Distributed In-Memory Tables Foo(int x, int y) indexby x. Foo(int x, int y) sortby x. Foo(int x, int y) multiset. Foo(int x:0..100, int y). Foo[int x](int y).
  8. 8. Rules Foo[a](c) :- Bar[a](b), Qux[b](c). Rule head Rule body
  9. 9. Rules Foo[a](c) :- Bar[a](b), Qux[b](c). 1 2 1 3 8 4 8 7 9 11 2 9 2 10 5 4 10 711 9 Bar QuxFoo
  10. 10. Rules Foo[a](c) :- Bar[a](b), Qux[b](c). 1 2 1 3 8 4 8 7 9 11 2 9 2 10 5 4 10 711 9 1 9 1 10 Bar QuxFoo
  11. 11. Rules Foo[a](c) :- Bar[a](b), Qux[b](c). 1 2 1 3 8 4 8 7 9 11 2 9 2 10 5 4 10 711 9 1 9 1 10 9 9 Bar QuxFoo
  12. 12. Distributed Join Foo[a](c) :- Bar[a](b), Qux[b](c). 1 2Bar 2 9 Qux 1 9 Qux Foo Bar Foo Machine 1 Machine 2 join 1 9
  13. 13. Distributed Join Foo[a](c) :- Qux[b](c), Bar[a](b). 1 2Bar 2 9 QuxQux Foo Bar Foo Machine 1 Machine 2
  14. 14. Parallel Evaluation Foo[a](c) :- Bar[a](b), Qux[b](c). Machine 1 Machine 2 Bara Barb Barc Bard Bara Barb Barc Bard
  15. 15. Parallel Evaluation Foo[a](c) :- Bar[a](b), Qux[b](c). Foo[a](c) :- Bar1a[a](b), Qux[b](c). Foo[a](c) :- Bar1b[a](b), Qux[b](c). Foo[a](c) :- Bar1c[a](b), Qux[b](c). Foo[a](c) :- Bar1d[a](b), Qux[b](c). Foo[a](c) :- Bar2a[a](b), Qux[b](c). Foo[a](c) :- Bar2b[a](b), Qux[b](c). Foo[a](c) :- Bar2c[a](b), Qux[b](c). Foo[a](c) :- Bar2d[a](b), Qux[b](c). Machine 1 Machine 2
  16. 16. Aggregation Foo[a]($min(c)) :- Bar[a](b), Qux[b](c). The $min aggregate function is applied to tuples in Foo having the same first column value.
  17. 17.  Built-in aggregate functions  min, max, sum, avg, argmin  User-defined functions  in Java or Python Aggregation
  18. 18.  Head table also appears in rule body Foo(a,c) :- Foo(a,b), Bar(b,c).  Semantics – rule executed repeatedly until no changes to Foo Recursive Rules
  19. 19. SociaLite: Datalog Extensions for Efficient Social Network Analysis, ICDE’13 Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis, VLDB’14 Recursive Rules `Edge(int s, (int t, double len)) indexby s. Path(int n, double dist) indexby n. ` `Path(t, $min(d)) :- t=$SRC, d=0; :- Path(n, d1), Edge(n, t, d2), d=d1+d2.` Shortest Path algorithm in recursion + aggregation
  20. 20.  SociaLite queries in Python code  `Queries are quoted in backtick`  Python  SociaLite  Python functions, variables are accessible in SociaLite queries  SociaLite tables are readable from Python Python Integration (Jython)
  21. 21. Python Integration print “This is Python code!”
  22. 22. Python Integration print “This is Python code!” # now we use SociaLite queries below `Foo[int i](String s). Foo[i](s) :- i=42, s=“the answer”.`
  23. 23. Python Integration print “This is Python code!” # now we use SociaLite queries below `Foo[int i](String s). Foo[i](s) :- i=42, s=“the answer”.` v=“Python variable” `Foo[i](s) :- i=43, s=$v.`
  24. 24. Python Integration print “This is Python code!” # now we use SociaLite queries below `Foo[int i](String s). Foo[i](s) :- i=42, s=“the answer”.` v=“Python variable” `Foo[i](s) :- i=43, s=$v.` @returns(str) def func(): return “Python func” `Foo[i](s) :- i=44, s=$func().`
  25. 25. Python Integration print “This is Python code!” # now we use SociaLite queries below `Foo[int i](String s). Foo[i](s) :- i=42, s=“the answer”.` v=“Python variable” `Foo[i](s) :- i=43, s=$v.` @returns(str) def func(): return “Python func” `Foo[i](s) :- i=44, s=$func().` for i, s in `Foo[i](s)`: print i, s
  26. 26.  Graph algorithms  Shortest Paths  PageRank  Data mining algorithms  K-Means  Logistic regression Analysis Algorithms
  27. 27.  Shortest Path Graph Algorithm `Edge(int s, (int t, double len)) indexby s. Path(int n, double dist) indexby n. ` `Path(t, $min(d)) :- t=$SRC, d=0; :- Path(n, d1), Edge(n, t, d2), d=d1+d2.`
  28. 28.  PageRank Graph Algorithm `Rank(n, 0, r) :- Node(n), r=1.0/$N.` for t in range(30): `Rank(pi, $t+1, $sum(r)) :- Node(pi), r=0.15*1.0/$N; :- Rank(pj, $t, r1), Edge(pj, pi), EdgeCnt(pj, cnt), r=0.85*r1/cnt.`
  29. 29.  PageRank Graph Algorithm `Rank(n, 0, r) :- Node(n), r=1.0/$N.` for t in range(30): `Rank(pi, $t+1, $sum(r)) :- Node(pi), r=0.15*1.0/$N; :- Rank(pj, $t, r1), Edge(pj, pi), EdgeCnt(pj, cnt), r=0.85*r1/cnt.` d=damping factor (we used 0.85)
  30. 30.  K-Means Data Mining Algorithm for i in range(50): `Center(cid, $avg(p)) :- Data(id, p), Cluster(id, $i, c), cid=c.value.` `Cluster(id, $i+1, $argmin(idx, d)) :- Data(id, p), Center(idx, a), d=$getDiff(p, a).`
  31. 31.  Logistic Regression Data Mining Algorithm for i in range(0, 100): `Gradient($i, $sum(w)) :- Data(id, p), Weight($i, w1), dot=$dot(w1, p), y=$sigmoid(dot), w = $computeWeights(p, y).` `Weight($i+1, w) :- Weight($i, w1), Gradient($i, g), w=$vecSum (w1, g).`
  32. 32.  Single-thread performance  Multi-thread performance (on 16-core machine)  Distributed performance (up to 64 machines) Evaluation
  33. 33. Single-thread 0 1 2 3 Shortest Paths PageRank Mutual Neighbors Connected Components Triangles Clustering Coefficients Optimized Java vs SociaLite SociaLite is as fast as highly optimized Java, or ~30% slower than optimized C++
  34. 34. Multi-thread 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 20 1 2 4 6 8 10 12 14 16 ParallelizationSpeedup ExecutionTime(Min.) Number of Cores time speedup ideal speedup 0 2 4 6 8 10 12 14 16 18 0 10 20 30 40 50 60 70 1 2 4 6 8 10 12 14 16 ParallelizationSpeedup ExecutionTime(Min.) Number of Threads 0 2 4 6 8 10 12 14 16 18 0 20 40 60 80 100 120 1 2 4 6 8 10 12 14 16 ParallelizationSpeedup ExecutionTime(Sec.) Number of Threads 0 2 4 6 8 10 12 14 16 18 0 10 20 30 40 50 60 70 80 90 100 1 2 4 6 8 10 12 14 16 ParallelizationSpeedup ExecutionTime(Sec.) Number of Threads 0 2 4 6 8 10 12 14 16 18 0 50 100 150 200 250 1 2 4 6 8 10 12 14 16 ParallelizationSpeedup ExecutionTime(Min.) Number of Threads 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 1 2 4 6 8 10 12 14 16 ParallelizationSpeedup ExecutionTime(Hours) Number of Threads PageRank Mutual Neighbors Connected Components Triangle Clustering Coefficients Shortest Paths
  35. 35. Distributed 0 1 10 100 1000 1 4 16 64 Exectime(sec.) Native Combblas Graphlab Socialite Giraph Breadth First Search 0.1 1 10 100 1 4 16 64 Timeperiter.(sec.) PageRank 1 10 100 1000 10000 1 4 16 64 Timeperiter.(sec.) 0 1 10 100 1000 1 4 16 64 Exectime(sec.) TriangleCollaborative Filtering
  36. 36. SociaLite is  Distributed query language  Easy and efficient  Integration with Python  Algorithms in SociaLite (graph, data mining)  Competitive performance Summary
  37. 37. jiwon @ stanford.edu http://socialite.stanford.edu Questions?
  38. 38. Two experimental front-end IPython Gephi GitHub data analysis SociaLite + Gephi Project/developer network  Edge if developer contributes to project Demo
  39. 39.  Custom memory allocator (temporary table)  Optimized serialization  Direct ByteBuffer (network buffer)  Multiple network channels among workers System Optimizations
  40. 40. Inside Worker Node Recv’er Worker worker master Sender Network Buffer Pool
  41. 41. System Overview standalone mode Compiler Python Integration (preprocessing) Worker threadWorker threadWorker thread Eval Task Builder
  42. 42. System Overview distributed mode Worker Worker Worker Master Distributed File System (HDFS)
  43. 43.  Table column can be  Bloom filter  Sketches Approximaton
  44. 44. Bloom Filter  Probabilistic set data structure  Elements represented as bits  Cannot enumerate elements  Quickly (approximately) computes set membership  can have false-positives, but not false-negatives Approximaton
  45. 45. Analysis example  Social Network (friendship)  Each person’s friends-of-friends  Count the # of people in startup  Call it a Startup Score Approximaton A
  46. 46. Approximaton Foaf(i, f) :- Friend(i, f). Foaf(i, ff) :- Friend(i, f), Friend(f, ff). StartupScore(i, $inc(1)) :- Foaf(i, ff), WorkAt(ff, “Startup”).
  47. 47. Approximaton Foaf(i, f) :- Friend(i, f). Foaf(i, ff) :- Friend(i, f), Friend(f, ff). StartupScore(i, $inc(1)) :- Foaf(i, ff), WorkAt(ff, “Startup”). (2nd column of Foaf table is represented with a Bloom filter)
  48. 48. Approximaton Foaf(i, f) :- Friend(i, f). Foaf(i, ff) :- Friend(i, f), Friend(f, ff). StartupScore(i, $inc(1)) :- Foaf(i, ff), WorkAt(ff, “Startup”). Exact Approximation Comparison Exec time (min) 28.9 19.4 32.8% faster Memory usage(GB) 26.0 3.0 11.5% usage Accuracy(<10% error) 100.0% 92.5% (2nd column of Foaf table is represented with a Bloom filter)
  49. 49. System Overview Worker Worker Worker Master Distributed File System query compiled query compiled query
  50. 50. System Overview Worker Worker Worker Master Distributed File System idle msg idle msg
  51. 51.  Query compiler  Parser  Analyzer  Code generator (Java source code)  Bytecode compiler  Task scheduler  Worker threads  Network IO threads System Components Master Node Worker Node

×