In-Database Predictive Analytics

3,572 views

Published on

Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store.

This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and survey the tools and technologies necessary to head down the path.

Published in: Technology

In-Database Predictive Analytics

  1. 1. In-DatabasePredictive Analytics John A. De Goes @jdegoes, john@precog.com
  2. 2. Agenda • Introduction • Abusing SQL • Painful by Design • Database Extensions • MADlib • Other Approaches • Summary
  3. 3. Introduction In-Database Predictive Analytics In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.
  4. 4. Introduction Traditional Predictive Analytics R database SAS
  5. 5. Introduction R database SAS Data Bottleneck: Painful, Slow
  6. 6. Introduction What’s the answer?
  7. 7. Introduction Move the Code, not the Data! Advanced Analytics “MapReduce”
  8. 8. Abusing SQL Let’s Do K-Means in SQL!
  9. 9. Abusing SQL General Approach in RDBMS SQL Driver Database Feedback
  10. 10. Abusing SQL Our Initial Model model d k n iteration avg_q number of dimensions number of points variance number of clusters number of iterations
  11. 11. Abusing SQL Our Initial Data Set Y Y1 Y2 Y3 Y3 n rows
  12. 12. Abusing SQL Projection & Numbering Y YH Y1 Y2 Y3 ... i Y1 ... Yd 1 1 2 2 3 3 4 4 ... ... ... ... n n INSERT INTO YH SELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., Yd FROM Y;
  13. 13. Abusing SQL Flattening YH YV i Y1 ... Yd i l val 1 1 1 2 1 2 3 1 ... ... 4 1 d ... 2 1 ... ... ... n n d n x d rows INSERT INTO YV SELECT i,1,Y1 FROM YH; ... INSERT INTO YV SELECT i,d,Yd FROM YH;
  14. 14. Abusing SQL Initializing k Cluster Centers YH CH i Y1 ... Yd j Y1 ... Yd 1 1 2 2 3 3 4 4 ... ... ... ... n k INSERT INTO CH SELECT 1,Y1, ..., Yd FROM YH SAMPLE 1; ... INSERT INTO CH SELECT k,Y1, ..., Yd FROM YH SAMPLE 1;
  15. 15. Abusing SQL Flattening CH C j Y1 ... Yd l j val 1 1 1 2 1 2 3 ... ... 4 1 k ... 2 1 ... ... ... k d k d x k rows INSERT INTO C SELECT 1, 1, Y1 FROM CH WHERE j = 1; ... INSERT INTO C SELECT d, k, Yd FROM CH WHERE j = k;
  16. 16. Abusing SQL Computing Distances to Clusters YD i j dist 1 1 1 2 INSERT INTO YD ... ... SELECT i, j, sum((YV.val - C.val)**2) 1 k FROM YV, C WHERE YV.l = C.l GROUP BY i, j; 2 1 ... ... n k n x k rows
  17. 17. Abusing SQL Computing Nearest Neighbors YNN nearest clusters i j 1 2 INSERT INTO YNN SELECT YD.i,Y D.j 3 FROM YD, 4 (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMIND 5 WHERE Y D.i = YMIND.i ... and Y D.distance = YMIND.mindist; n n rows
  18. 18. Abusing SQL Count Points Per Cluster INSERT INTO W SELECT j, count(*) FROM YNN GROUP BY j; UPDATE W SET w = w/model.n;
  19. 19. Abusing SQL Compute New Centroids INSERT INTO C SELECT l, j, avg(YV.val) FROM YV, YNN WHERE YV.i = YNN.i GROUP BY l, j;
  20. 20. Abusing SQL Compute Variances INSERT INTO R SELECT C.l, C.j, avg((YV.val- C.val)**2) FROM C, YV, YNN WHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.j GROUP BY C.l, C.j;
  21. 21. Abusing SQL Update Model INSERT INTO R SELECT C.l, C.j, avg((YV.val- C.val)**2) FROM C, YV, YNN WHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.j GROUP BY C.l, C.j;
  22. 22. Abusing SQL Let’s not do that again!
  23. 23. Painful by Design Why are predictive analytics so hard to express in SQL?
  24. 24. Painful by Design #1: No Arrays Sets Tuples Arrays rows columns
  25. 25. Painful by Design #2: Relational Algebra Sucks Projection Selection Rename Natural Join R S Semijoin Antijoin Division Theta Join R S R S R ÷ S Left outer join Right outer join Full outer join Aggregation R ⟕ S R ⟖ S R⟗ S G1, G2, ..., Gm g f1(A1), f2(A2), ..., fk(Ak) (r) Iteration Recursion Multiple Dimensions
  26. 26. Database Extensions There’s GOT to be a better way!
  27. 27. Database Extensions C Extension
  28. 28. Database Extensions UDF UDA User-Defined Function User-Defined Aggregate Map Reduce map(a) init(a) op2(a,b) accum(a, b) merge(a, b) final(a)
  29. 29. MADlib MADlib is an open-source library for scalable in-database analytics. It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.
  30. 30. MADlib 1. Download the binary Mac OS X http://www.madlib.net/files/madlib-0.6- Darwin.dmg Linux http://www.madlib.net/files/madlib-0.6- Linux.rpm
  31. 31. MADlib 2. Start the Installation Mac OS X Double-click on installer Linux yum install $MADLIB_PACKAGE --nogpgcheck
  32. 32. MADlib 3. Verify Locatability Greenplum source /path/to/greenplum/ greenplum_path.sh PostgreSQL Make sure psql is in PATH
  33. 33. MADlib 4. Register MADlib Greenplum /usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install PostgreSQL /usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install
  34. 34. MADlib 5. Test Installation Greenplum /usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check PostgreSQL /usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check
  35. 35. MADlib Clustering in MADlib SELECT * FROM kmeans_random( rel_source, expr_point, k, [ fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned ] );
  36. 36. MADlib Ahhhhhh......
  37. 37. MADlib Our Way or the Highway Composability
  38. 38. Other Approaches RDBMS Isn’t the Only Game in Town!
  39. 39. Other Approaches 1. Embrace Coding • Hadoop Ecosystem • Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and, of course, MapReduce • BDAS Ecosystem • Spark
  40. 40. Other Approaches 2. Reject RDBMS • Datalog + variants • In theory, ideal for many kinds of predictive analytics • Suffers from a lack of distributed, feature-complete implementations
  41. 41. Other Approaches 2. Reject RDBMS • Rasdaman / RASQL • Arrays but not analytics Community Editions http://www.rasdaman.org
  42. 42. Other Approaches 2. Reject RDBMS • MonetDB / SciQL • Array extension of SQL • Poor analytics Community Editions http://www.monetdb.org
  43. 43. Other Approaches 2. Reject RDBMS • SciDB / AFL (AQL) • Excellent analytics • Limited composability Community Editions http://www.scidb.org/forum/viewtopic.php?f=16&t=364/
  44. 44. Other Approaches 2. Reject RDBMS • Precog / Quirrel (simple “R for big data”) • Multidimensional, arrays + functions • Still immature Community Editions http://www.precog.com/editions/precog-for-mongodb (MongoDB) http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)
  45. 45. Summary • Increase performance, reduce friction by doing more inside the database • Not a panacea • Hard to do in SQL • Hard to do in C (but you may not have to: MADlib) • Pre-canned & brittle in most databases • Ultimately what’s needed is tech designed for advanced analytics
  46. 46. Q&A John A. De Goes@jdegoes, john@precog.com
  47. 47. References • Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)

×