MLconf NYC Shan Shan Huang

  • 1,665 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,665
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Smart database for next-generation applications LOGICBLOX - SIMPLIFYING YOUR DATA STACK MLConf NY, 2014.04.11
  • 2. AREN’T THERE ENOUGH DATABASES? ©2014. LogicBlox. All Rights Reserved.
  • 3. IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES ©2014. LogicBlox. All Rights Reserved.
  • 4. IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES ©2014. LogicBlox. All Rights Reserved. Is a similar revolution coming in databases?
  • 5. OUR MISSION ▪ Be the iPhone of databases ▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014 ▪ One database to replace many specialized databases ▪ Transactional (e.g. Oracle, VoltDB, NuoDB) ▪ Analytical (e.g. Teradata, Redshift, Hadoop) ▪ Graphs ▪ Documents ▪ ... Footnote: for certain class of applications ©2014. LogicBlox. All Rights Reserved.
  • 6. OUR MISSION ▪ Be the iPhone of databases. ▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014 ▪ One database to replace many specialized databases ▪ Transactional (e.g. Oracle, VoltDB, NuoDB) ▪ Analytical (e.g. Teradata, Redshift, Hadoop) ▪ Graphs ▪ Documents ▪ ... Footnote: for certain class of applications ©2014. LogicBlox. All Rights Reserved.
  • 7. SHOW ME ©2013. LogicBlox. All Rights Reserved.
  • 8. FIRST THING FIRST ▪ Declarative query language ▪ Based on Datalog ▪ ACID transactions ▪ In fact… full serializability ▪ Built from scratch -- not by stitching together different databases under the hood. ©2014. LogicBlox. All Rights Reserved.
  • 9. CLIQUES IN LOGIQL 3 Clique - Triangle Queries 4 Clique ©2014. LogicBlox. All Rights Reserved. 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). 4cliques(a, b, c, d) <- edge(a, b), edge(a, c), edge(a, d), edge(b, c), edge(b, d), edge(c, d).
  • 10. 3 CLIQUE in LOGIQL vs. SQL ©2013. LogicBlox. All Rights Reserved. SELECT DISTINCT v1.x AS x, v2.x AS y, v3.x AS w FROM edge AS v1, edge AS v2, edge AS v3 WHERE v1.y = v2.x AND v2.y = v3.x AND EXISTS( SELECT 1 FROM edge AS vv1 WHERE vv1.x = v1.x AND vv1.y = v3.x); SQL 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). LogiQL
  • 11. 3 CLIQUE in LOGIQL vs SPARQL ©2013. LogicBlox. All Rights Reserved. sparql PREFIX g: <http://logicblox.com/graph> SELECT DISTINCT ?av ?bv ?cv FROM <$database> WHERE { ?a g:edge ?b . ?a g:edge ?c . ?b g:edge ?c . ?a g:value ?av . ?b g:value ?bv . ?c g:value ?cv . FILTER (xsd:int(?av) < xsd:int(?bv) and xsd:int(?bv) < xsd:int(?cv)) }; SPARQL 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). LogiQL
  • 12. class triangle_count : public graphlab::ivertex_program<graph_type, set_union_gather> { public: bool do_not_scatter; // Gather on all edges edge_dir_type gather_edges(icontext_type& context, const vertex_type& vertex) const { return graphlab::ALL_EDGES; } gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { set_union_gather gather; graphlab::vertex_id_type otherid = edge.target().id() == vertex.id() ?edge.source().id() : edge. target().id(); size_t other_nbrs = (edge.target().id() == vertex.id()) ? (edge.source().num_in_edges() + edge.source().num_out_edges()): (edge.target().num_in_edges() + edge.target().num_out_edges()); size_t my_nbrs = vertex.num_in_edges() + vertex.num_out_edges(); if (PER_VERTEX_COUNT || (other_nbrs > my_nbrs) || (other_nbrs == my_nbrs && otherid > vertex.id())) { gather.v = otherid; } return gather; } void apply(icontext_type& context, vertex_type& vertex, const gather_type& neighborhood { do_not_scatter = false; if (neighborhood.vid_vec.size() == 0) { vertex.data().vid_set.clear(); if (neighborhood.v != (graphlab::vertex_id_type(-1))) vertex.data().vid_set.vid_vec.push_back(neighborhood.v); } else vertex.data().vid_set.assign(neighborhood.vid_vec); do_not_scatter = vertex.data().vid_set.size() == 0; } edge_dir_type scatter_edges(icontext_type& context, const vertex_type& vertex) const { if (do_not_scatter) return graphlab::NO_EDGES; else return graphlab::OUT_EDGES; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { const vertex_data_type& srclist = edge.source().data(); const vertex_data_type& targetlist = edge.target().data(); if (targetlist.vid_set.size() < srclist.vid_set.size()) edge.data() += count_set_intersect(targetlist.vid_set, srclist.vid_set); else edge.data() += count_set_intersect(srclist.vid_set, targetlist.vid_set); 3-CLIQUE IN LOGILQ vs. GRAPHLAB ©2013. LogicBlox. All Rights Reserved. GraphLab - C++ 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). LogiQL
  • 13. 4 CLIQUE - SYNTHETIC DATA ©2014. LogicBlox. All Rights Reserved.
  • 14. 4 CLIQUE - REAL DATA ©2014. LogicBlox. All Rights Reserved.
  • 15. SEMANTIC WEB - LUBM ©2014. LogicBlox. All Rights Reserved.
  • 16. DATAWAREHOUSE - TPC-H ©2013. LogicBlox. All Rights Reserved.
  • 17. A NON-TRIVIAL EXAMPLE: PAGERANK IN LOGIQL ©2013. LogicBlox. All Rights Reserved. d[] = 0.85f. // dampening factor tolerance[] = 0.01f. // when to the pr change is small enough to stop pr[p] = 1.0f / node_count[] <- node(p), !pr[p] = _. // initial pr pr[p] = (1.0f - d[]) + (d[] * sum[p]) <- abs[r - pr[p]] > tolerance[]. pr[p] = pr[p] <- r = (1.0f - d[]) + (d[] * sum[p]), !(abs[r - pr[p]] > tolerance[]). pr[p] = pr[p] <- !sum[p] = _. sum[n] = t <- agg<< t = total(r) >> edge(p, n), r = pr[p] / out_count[p].
  • 18. HOW DOES IT WORK ©2013. LogicBlox. All Rights Reserved.
  • 19. ALGORITHMS FIRST Computer Science @CompSciFact Sep 28 “Computer science is now about systems. It hasn’t been about algorithms since the 1960’s.” -- Alan Kay #hlf13
  • 20. PHILOSOPHY: BRAINS BEFORE BRAWN ▪ Algorithmic scalability ▪ New worst-case optimal join algorithm ▪ Incremental maintenance proportional to trace edit distance ▪ Adaptive domain decomposition for parallelization ▪ Data structures ▪ Compression close to info-theoretic limit in some cases ▪ I/O minimization, cache consciousness ▪ Persistent data structures: full serializability, branch & merge, auditability, scalable distribution ▪ Unified declarative programming model ▪ Optimizations through aggressive analysis ▪ Brute force ▪ In-memory when data fits ▪ Distribution across thousands of cores, and GPUs ©2013. LogicBlox. All Rights Reserved.
  • 21. A SMART JOIN ALGORITHM - LFTJ ▪ “Leapfrog Triejoin: A Simple, Worst- Case Optimal Join Algorithm” T. Veldhuizen, ICDT 2014 ▪ Best Newcomer Award ©2013. LogicBlox. All Rights Reserved.
  • 22. LFTJ INTUITION: CONSIDER MORE THAN PAIRS ©2013. LogicBlox. All Rights Reserved. ▪ Widely adopted technique: pair-wise joins ▪ Suppose A, B, and C each have 1 million records distributed over 3 months ▪ Pair-wise join: best case scenario, 0.5 million records as intermediate results ▪ LFTJ: no records materialized Jan Feb Mar A(x) B(x) C(x)
  • 23. SMARTER INCREMENTAL VIEW MAINTENANCE ▪ Incremental Maintenance for Leapfrog Triejoin, T. Veldhuizen, 2013 ▪ http://arxiv.org/abs/1303.5313 ▪ Replaced our implementation of Count and DRed algorithms [Gupta+ 93] ▪ Guarantees that work is done proportional to the trace edit distance between the before and after ▪ Critical for allowing caching analytical views for performance, but still incorporating real-time updates ©2013. LogicBlox. All Rights Reserved.
  • 24. INCREMENTALIZING 3 CLIQUE VIEW ©2013. LogicBlox. All Rights Reserved. LogicBlox - Algebraic +3cliques(a, b, c) <- +edge(a, b), edge(a, c), edge(b, c). +3cliques(a, b, c) <- edge(a, b), +edge(a, c), edge(b, c). +3cliques(a, b, c) <- edge(a, b), edge(a, c), +edge(b, c). DReD - Synthactic 3cliques(a, b, c) <- edge(a, b), edge(a, c), edge(b, c). edge(a, b) edge(a, c) edge(b, c)
  • 25. INCREMENTAL MAINTENANCE OF 4-CLIQUE ©2013. LogicBlox. All Rights Reserved.
  • 26. A PARTICULAR USE CASE OF LB FOR GRAPHS ©2013. LogicBlox. All Rights Reserved.
  • 27. SCREAMING FAST PROGRAM ANALYSIS ▪ Order of magnitude faster than prior- art ▪ Program analysis is graph analysis ▪ “Strictly Declarative Specification of Sophisticated Points-to Analyses” (OOPSLA ‘09) ▪ “Exception Analysis and Points-to Analysis - Better Together” (ISSTA ‘09) ▪ “Pick Your Context Well - Understanding Object-Sensitivity” (POPL ’11) ▪ “Efficient and Effective Handling of Exceptions in Java Points-to Analysis” (CC’13) ▪ “Hybrid Context Sensitivity for Points-to Analysis” (PLDI ’13) ▪ “Set-based Pre-processing for Points-to Analysis” (OOPSLA ‘13) ©2013. LogicBlox. All Rights Reserved.
  • 28. PROGRAM ANALYSIS IS ALL ABOUT GRAPH ANALYSIS ©2013. LogicBlox. All Rights Reserved.
  • 29. COMPARE TO PRIOR-ART : >10x ©2013. LogicBlox. All Rights Reserved.
  • 30. ...AND THAT WAS ON PRIOR ART LOGICBLOX ©2013. LogicBlox. All Rights Reserved.
  • 31. RECAP ▪ LogicBlox: the iPhone of databases ▪ But perhaps the $10k camera of graph queries? ▪ Holy Grails ▪ Declarative query language: LogiQL ▪ ACID transactions ▪ Guiding Principle: Brains before Brawns ▪ Innovate on algorithms: LTFJ, incremental view maintenance, etc. ▪ Innovate on data structures ▪ Declarative language allows aggressive optimizations ▪ Brute force when necessary ©2014. LogicBlox. All Rights Reserved.
  • 32. THANK YOU ©2014. LogicBlox. All Rights Reserved.