MLconf NYC Shan Shan Huang

2,691 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,691
On SlideShare
0
From Embeds
0
Number of Embeds
1,561
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MLconf NYC Shan Shan Huang

  1. 1. Smart database for next-generation applications LOGICBLOX - SIMPLIFYING YOUR DATA STACK MLConf NY, 2014.04.11
  2. 2. AREN’T THERE ENOUGH DATABASES? ©2014. LogicBlox. All Rights Reserved.
  3. 3. IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES ©2014. LogicBlox. All Rights Reserved.
  4. 4. IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES ©2014. LogicBlox. All Rights Reserved. Is a similar revolution coming in databases?
  5. 5. OUR MISSION ▪ Be the iPhone of databases ▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014 ▪ One database to replace many specialized databases ▪ Transactional (e.g. Oracle, VoltDB, NuoDB) ▪ Analytical (e.g. Teradata, Redshift, Hadoop) ▪ Graphs ▪ Documents ▪ ... Footnote: for certain class of applications ©2014. LogicBlox. All Rights Reserved.
  6. 6. OUR MISSION ▪ Be the iPhone of databases. ▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014 ▪ One database to replace many specialized databases ▪ Transactional (e.g. Oracle, VoltDB, NuoDB) ▪ Analytical (e.g. Teradata, Redshift, Hadoop) ▪ Graphs ▪ Documents ▪ ... Footnote: for certain class of applications ©2014. LogicBlox. All Rights Reserved.
  7. 7. SHOW ME ©2013. LogicBlox. All Rights Reserved.
  8. 8. FIRST THING FIRST ▪ Declarative query language ▪ Based on Datalog ▪ ACID transactions ▪ In fact… full serializability ▪ Built from scratch -- not by stitching together different databases under the hood. ©2014. LogicBlox. All Rights Reserved.
  9. 9. CLIQUES IN LOGIQL 3 Clique - Triangle Queries 4 Clique ©2014. LogicBlox. All Rights Reserved. 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). 4cliques(a, b, c, d) <- edge(a, b), edge(a, c), edge(a, d), edge(b, c), edge(b, d), edge(c, d).
  10. 10. 3 CLIQUE in LOGIQL vs. SQL ©2013. LogicBlox. All Rights Reserved. SELECT DISTINCT v1.x AS x, v2.x AS y, v3.x AS w FROM edge AS v1, edge AS v2, edge AS v3 WHERE v1.y = v2.x AND v2.y = v3.x AND EXISTS( SELECT 1 FROM edge AS vv1 WHERE vv1.x = v1.x AND vv1.y = v3.x); SQL 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). LogiQL
  11. 11. 3 CLIQUE in LOGIQL vs SPARQL ©2013. LogicBlox. All Rights Reserved. sparql PREFIX g: <http://logicblox.com/graph> SELECT DISTINCT ?av ?bv ?cv FROM <$database> WHERE { ?a g:edge ?b . ?a g:edge ?c . ?b g:edge ?c . ?a g:value ?av . ?b g:value ?bv . ?c g:value ?cv . FILTER (xsd:int(?av) < xsd:int(?bv) and xsd:int(?bv) < xsd:int(?cv)) }; SPARQL 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). LogiQL
  12. 12. class triangle_count : public graphlab::ivertex_program<graph_type, set_union_gather> { public: bool do_not_scatter; // Gather on all edges edge_dir_type gather_edges(icontext_type& context, const vertex_type& vertex) const { return graphlab::ALL_EDGES; } gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { set_union_gather gather; graphlab::vertex_id_type otherid = edge.target().id() == vertex.id() ?edge.source().id() : edge. target().id(); size_t other_nbrs = (edge.target().id() == vertex.id()) ? (edge.source().num_in_edges() + edge.source().num_out_edges()): (edge.target().num_in_edges() + edge.target().num_out_edges()); size_t my_nbrs = vertex.num_in_edges() + vertex.num_out_edges(); if (PER_VERTEX_COUNT || (other_nbrs > my_nbrs) || (other_nbrs == my_nbrs && otherid > vertex.id())) { gather.v = otherid; } return gather; } void apply(icontext_type& context, vertex_type& vertex, const gather_type& neighborhood { do_not_scatter = false; if (neighborhood.vid_vec.size() == 0) { vertex.data().vid_set.clear(); if (neighborhood.v != (graphlab::vertex_id_type(-1))) vertex.data().vid_set.vid_vec.push_back(neighborhood.v); } else vertex.data().vid_set.assign(neighborhood.vid_vec); do_not_scatter = vertex.data().vid_set.size() == 0; } edge_dir_type scatter_edges(icontext_type& context, const vertex_type& vertex) const { if (do_not_scatter) return graphlab::NO_EDGES; else return graphlab::OUT_EDGES; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { const vertex_data_type& srclist = edge.source().data(); const vertex_data_type& targetlist = edge.target().data(); if (targetlist.vid_set.size() < srclist.vid_set.size()) edge.data() += count_set_intersect(targetlist.vid_set, srclist.vid_set); else edge.data() += count_set_intersect(srclist.vid_set, targetlist.vid_set); 3-CLIQUE IN LOGILQ vs. GRAPHLAB ©2013. LogicBlox. All Rights Reserved. GraphLab - C++ 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). LogiQL
  13. 13. 4 CLIQUE - SYNTHETIC DATA ©2014. LogicBlox. All Rights Reserved.
  14. 14. 4 CLIQUE - REAL DATA ©2014. LogicBlox. All Rights Reserved.
  15. 15. SEMANTIC WEB - LUBM ©2014. LogicBlox. All Rights Reserved.
  16. 16. DATAWAREHOUSE - TPC-H ©2013. LogicBlox. All Rights Reserved.
  17. 17. A NON-TRIVIAL EXAMPLE: PAGERANK IN LOGIQL ©2013. LogicBlox. All Rights Reserved. d[] = 0.85f. // dampening factor tolerance[] = 0.01f. // when to the pr change is small enough to stop pr[p] = 1.0f / node_count[] <- node(p), !pr[p] = _. // initial pr pr[p] = (1.0f - d[]) + (d[] * sum[p]) <- abs[r - pr[p]] > tolerance[]. pr[p] = pr[p] <- r = (1.0f - d[]) + (d[] * sum[p]), !(abs[r - pr[p]] > tolerance[]). pr[p] = pr[p] <- !sum[p] = _. sum[n] = t <- agg<< t = total(r) >> edge(p, n), r = pr[p] / out_count[p].
  18. 18. HOW DOES IT WORK ©2013. LogicBlox. All Rights Reserved.
  19. 19. ALGORITHMS FIRST Computer Science @CompSciFact Sep 28 “Computer science is now about systems. It hasn’t been about algorithms since the 1960’s.” -- Alan Kay #hlf13
  20. 20. PHILOSOPHY: BRAINS BEFORE BRAWN ▪ Algorithmic scalability ▪ New worst-case optimal join algorithm ▪ Incremental maintenance proportional to trace edit distance ▪ Adaptive domain decomposition for parallelization ▪ Data structures ▪ Compression close to info-theoretic limit in some cases ▪ I/O minimization, cache consciousness ▪ Persistent data structures: full serializability, branch & merge, auditability, scalable distribution ▪ Unified declarative programming model ▪ Optimizations through aggressive analysis ▪ Brute force ▪ In-memory when data fits ▪ Distribution across thousands of cores, and GPUs ©2013. LogicBlox. All Rights Reserved.
  21. 21. A SMART JOIN ALGORITHM - LFTJ ▪ “Leapfrog Triejoin: A Simple, Worst- Case Optimal Join Algorithm” T. Veldhuizen, ICDT 2014 ▪ Best Newcomer Award ©2013. LogicBlox. All Rights Reserved.
  22. 22. LFTJ INTUITION: CONSIDER MORE THAN PAIRS ©2013. LogicBlox. All Rights Reserved. ▪ Widely adopted technique: pair-wise joins ▪ Suppose A, B, and C each have 1 million records distributed over 3 months ▪ Pair-wise join: best case scenario, 0.5 million records as intermediate results ▪ LFTJ: no records materialized Jan Feb Mar A(x) B(x) C(x)
  23. 23. SMARTER INCREMENTAL VIEW MAINTENANCE ▪ Incremental Maintenance for Leapfrog Triejoin, T. Veldhuizen, 2013 ▪ http://arxiv.org/abs/1303.5313 ▪ Replaced our implementation of Count and DRed algorithms [Gupta+ 93] ▪ Guarantees that work is done proportional to the trace edit distance between the before and after ▪ Critical for allowing caching analytical views for performance, but still incorporating real-time updates ©2013. LogicBlox. All Rights Reserved.
  24. 24. INCREMENTALIZING 3 CLIQUE VIEW ©2013. LogicBlox. All Rights Reserved. LogicBlox - Algebraic +3cliques(a, b, c) <- +edge(a, b), edge(a, c), edge(b, c). +3cliques(a, b, c) <- edge(a, b), +edge(a, c), edge(b, c). +3cliques(a, b, c) <- edge(a, b), edge(a, c), +edge(b, c). DReD - Synthactic 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). edge(a, b) edge(a, c) edge(b, c)
  25. 25. INCREMENTAL MAINTENANCE OF 4-CLIQUE ©2013. LogicBlox. All Rights Reserved.
  26. 26. A PARTICULAR USE CASE OF LB FOR GRAPHS ©2013. LogicBlox. All Rights Reserved.
  27. 27. SCREAMING FAST PROGRAM ANALYSIS ▪ Order of magnitude faster than prior- art ▪ Program analysis is graph analysis ▪ “Strictly Declarative Specification of Sophisticated Points-to Analyses” (OOPSLA ‘09) ▪ “Exception Analysis and Points-to Analysis - Better Together” (ISSTA ‘09) ▪ “Pick Your Context Well - Understanding Object-Sensitivity” (POPL ’11) ▪ “Efficient and Effective Handling of Exceptions in Java Points-to Analysis” (CC’13) ▪ “Hybrid Context Sensitivity for Points-to Analysis” (PLDI ’13) ▪ “Set-based Pre-processing for Points-to Analysis” (OOPSLA ‘13) ©2013. LogicBlox. All Rights Reserved.
  28. 28. PROGRAM ANALYSIS IS ALL ABOUT GRAPH ANALYSIS ©2013. LogicBlox. All Rights Reserved.
  29. 29. COMPARE TO PRIOR-ART : >10x ©2013. LogicBlox. All Rights Reserved.
  30. 30. ...AND THAT WAS ON PRIOR ART LOGICBLOX ©2013. LogicBlox. All Rights Reserved.
  31. 31. RECAP ▪ LogicBlox: the iPhone of databases ▪ But perhaps the $10k camera of graph queries? ▪ Holy Grails ▪ Declarative query language: LogiQL ▪ ACID transactions ▪ Guiding Principle: Brains before Brawns ▪ Innovate on algorithms: LTFJ, incremental view maintenance, etc. ▪ Innovate on data structures ▪ Declarative language allows aggressive optimizations ▪ Brute force when necessary ©2014. LogicBlox. All Rights Reserved.
  32. 32. THANK YOU ©2014. LogicBlox. All Rights Reserved.

×