MAD Skills: New                          Analysis Practices for                                Big Data                   ...
MAD Skills: New Analysis                               Practices for Big Data                     •    Authors            ...
If you are looking for a career where your services will be                          in high demand, you should find someth...
• Storage is Cheap                      • The worlds largest data warehouse of                          ~15 years ago can ...
• Traditionally, a lot of time is spent building                          well structured data warehouses.                ...
• Data analysis is common culture                       • Culture is to collect and analyze data in                       ...
• Data analysis is common culture                       • Culture is to collect and analyze data in                       ...
• Data analysis is common culture                       • Culture is to collect and analyze data in                       ...
• Magnetic – Attract data and practitioners.                          The Database must be painless and efficient.         ...
Outline                     •    Background                     •    FOX Audience Network                     •    Being M...
Background                     • OLAP and Data Cubes                          • Data structure to                         ...
Background                     •    Databases and Statistical Packages                          •   Many analysts download...
Background                     • MapReduce and Parallel Programming                      • Strengths of MR are scalable ba...
Background                     • Data Mining and Analytics in the DB                      • Plenty of work in this area   ...
FOX Audience Network                     • Served MySpace.com, IGN.com, Scout.com                          etc.           ...
Fox Audience Network                     •    42 Nodes (2 Masters, 40 Workers)                     •    Sun X4500 Thumper ...
Fox Audience Network                     •    Use Cases                     •    Queries to the system are vastly differen...
Being MAD                     • Analysts (Data Scientists) trump DBAs                      • They crave data              ...
Being MAD                     • Get data to the warehouse as soon as                          possible                    ...
Being MAD                     • Three logical layers for data stores:                      • Reporting – Specialized stati...
Being MAD                     • Data Scientists need power tools Math +                          Stats + ML               ...
In-Database Operations                     • User Defined Functions (UDF)                       • SQL or PL/pgSQL          ...
In-Database Operations                     •    Value abstraction levels                     •    1.0                     ...
In-Database Operations                          Scalar ArithmeticThursday, August 25, 11                       22
In-Database Operations                     • Scalar Arithmetic                      • SELECT 5*4;                      • S...
In-Database Operations                          Vector ArithmeticThursday, August 25, 11                       24
In-Database Operations                     • Vector                      • array objects: float8[]                      • a...
In-Database Operations                     • Vector                                technically violates 1st normal formThu...
In-Database Operations                          Matrix ArithmeticThursday, August 25, 11                       27
In-Database Operations                • Matrix Representation #1                          CREATE TABLE B (                ...
In-Database Operations                • Matrix addition:                          SELECT A. row_number, A.vector + B.vecto...
In-Database Operations                • Matrix and vector multiplication: Av                          SELECT 1, array_accu...
In-Database Operations                •         Matrix Representation #2                          CREATE TABLE A (        ...
In-Database Operations                • Matrix Multiplication (using rep #2):                          SELECT A. row_numbe...
In-Database Operations                • These operations require one pass over the                          data and can b...
In-Database Operations                     • Document similarity for fraud detection                       • Check to see ...
In-Database Operations                • For document similarity we use tf-idf. This is a                          score fo...
In-Database Operations                 •        Table docs contains triples (document, term, count)                 • tf U...
In-Database Operations                •         Each doc will have a vector with the tf-idf scores,                       ...
In-Database Operations                •         Ordinary Least Squares is used to fit a line to a collection of points.    ...
In-Database Operations                          Functional ArithmeticThursday, August 25, 11                           39
In-Database Operations                     • A/B Testing (Multivariate testing)                      • Web companies show ...
In-Database Operations                     •    Mann-Whitney U Test                          •   This is a popular test to...
In-Database Operations                •         Mann-Whitney U Test                           •   Given Table T with (samp...
In-Database Operations                •         Mann-Whitney U Test                          •   With large sample count w...
In-Database Operations                •         Mann-Whitney U Test                          •   With large sample count w...
In-Database Operations                     • Resampling                       • Sampling several subsets instead of the   ...
In-Database Operations                     •    Bootstrapping                          •   Pre-select which trials contain...
In-Database Operations                     •    Bootstrapping                     •    Calculate the sampling distribution...
In-Database Operations                     • Bootstrapping                     • Final distribution parameters            ...
MAD DBMS                     • Loading/Unloading                      • Scatter/Gather Streaming – fast!                  ...
MAD DBMS                     •    Storage                          •   Tunable table types:                              •...
MAD DBMS                     •    Partitioning                          •   Partition by range of values or columns (list)...
MAD DBMS                     •    MAD Programming                          •   Use your favorite programming language exte...
MAD DBMS                     • Future                      • Need a better repository for the library                     ...
QuestionsThursday, August 25, 11               53
Questions                     • Why would one be apposed to the M.A.D.                          paradigm?Thursday, August ...
Questions                     • Why would one be apposed to the M.A.D.                          paradigm?                 ...
Questions                     • Why would one be apposed to the M.A.D.                          paradigm?                 ...
Conclusion                     • This paper describes the M.A.D. (Magnetic                          Agile Deep) mindset an...
Thank You!Thursday, August 25, 11                55
MADLib Installation                              (ubuntu)                     •    sudo apt-get install postgresql flex bis...
Matrix Multiplication                                       Demo                          -- Create tables in the new matr...
Upcoming SlideShare
Loading in...5
×

MAD Skills: New Analysis Practices for Big Data

3,237

Published on

This is a presentation based on the paper MAD Skills: New Analysis Practices for Big Data.
portal.acm.org/citation.cfm?id=1687576

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,237
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
107
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

MAD Skills: New Analysis Practices for Big Data

  1. 1. MAD Skills: New Analysis Practices for Big Data Presented By: Christan Grant cgrant@cise.ufl.eduThursday, August 25, 11 1
  2. 2. MAD Skills: New Analysis Practices for Big Data • Authors • Jeff Cohen – Greenplum • Brian Dolan – Fox Audience Network • Mark Dunlap – Evergreen Technologies • Joseph M Hellerstein – UC Berkeley • Caleb Welton – Greenplum • Presented at Very Large Database Conference 2009 in Lyon, FranceThursday, August 25, 11 2
  3. 3. If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. – Prof. Hal Varian, Chief Economist at GoogleThursday, August 25, 11 3
  4. 4. • Storage is Cheap • The worlds largest data warehouse of ~15 years ago can be stored on disks for about $2000.Thursday, August 25, 11 4
  5. 5. • Traditionally, a lot of time is spent building well structured data warehouses. • Contains summaries of data • Main analytics location • Jealously guarded by ITThursday, August 25, 11 5
  6. 6. • Data analysis is common culture • Culture is to collect and analyze data in difference business units.Thursday, August 25, 11 6
  7. 7. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data”Thursday, August 25, 11 6
  8. 8. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” • Analytics are crucial – Analysts need more access and more permissions • Hence M.A.D.Thursday, August 25, 11 6
  9. 9. • Magnetic – Attract data and practitioners. The Database must be painless and efficient. • Agile – Analysts need to ingest and analyze on the fly. • Deep – Sophisticated analytics at scale. Don’t force analyst to work on samples (where the long tail is lost). • Library of toolsThursday, August 25, 11 7
  10. 10. Outline • Background • FOX Audience Network • Being MAD • In-Database Operations • MAD DBMS • Conclusions • QuestionThursday, August 25, 11 8
  11. 11. Background • OLAP and Data Cubes • Data structure to provides descriptive statistics – Simple summaries (sum, average, std) • We want inferential/ inductive statistics – Predictions, causality analysis, distributional comparisonThursday, August 25, 11 9
  12. 12. Background • Databases and Statistical Packages • Many analysts download data to use in Excel/SAS/ Matlab/R or their favorite programming language? FORTRAN?? • Use matrix/vector operations • Most of these stat packages require data to fit in RAM • Taking samples from the full data to fit into ram results in loss of precision • External toolkits may also lack parallelismThursday, August 25, 11 10
  13. 13. Background • MapReduce and Parallel Programming • Strengths of MR are scalable batch parallelism and fault tolerance. • There is work on putting ML algorithms in MapReduce Framework (Mahout) • MR is just a programming framework and works well with MADThursday, August 25, 11 11
  14. 14. Background • Data Mining and Analytics in the DB • Plenty of work in this area • Methods are typically implemented as a black box • Algorithms are not flexible as in programmed packagesThursday, August 25, 11 12
  15. 15. FOX Audience Network • Served MySpace.com, IGN.com, Scout.com etc. • About 150 Million users • Ad network, bought by Rubicon in Nov 2010Thursday, August 25, 11 13
  16. 16. Fox Audience Network • 42 Nodes (2 Masters, 40 Workers) • Sun X4500 Thumper • 48 500GB drives • 16GB RAM • ~ 5TB Daily • 1 Table 1.5 Trillion Rows • Many different types of users workloads • Dynamic query ecosystemThursday, August 25, 11 14
  17. 17. Fox Audience Network • Use Cases • Queries to the system are vastly different • Example queries: • How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? • Expensive • How are these people similar to those that visited Nissan? • Open-ended, requires some statistics and the analyst to be in the loop.Thursday, August 25, 11 15
  18. 18. Being MAD • Analysts (Data Scientists) trump DBAs • They crave data • They can work with dirty data • They like all data (not samples) • Have a belt of toolsThursday, August 25, 11 16
  19. 19. Being MAD • Get data to the warehouse as soon as possible • Cleaning and integration should be done intelligentlyThursday, August 25, 11 17
  20. 20. Being MAD • Three logical layers for data stores: • Reporting – Specialized static aggregates • Production – Aggregates for most users • Stages – Raw tables and logs • Sandboxes – Play ground for analysts • The logical levels encourages creativityThursday, August 25, 11 18
  21. 21. Being MAD • Data Scientists need power tools Math + Stats + ML • The MADLib approach is to develop a hierarchy of mathematical concepts in the database.Thursday, August 25, 11 19
  22. 22. In-Database Operations • User Defined Functions (UDF) • SQL or PL/pgSQL • Internal (c) • Dynamically Loaded (c, c++, java, javascript, perl, python, R, ruby, scheme, etc) • SELECT f1(city) FROM states WHERE f2(capital)Thursday, August 25, 11 20
  23. 23. In-Database Operations • Value abstraction levels • 1.0 Scalar • [0.3, 0.3, 0.4] Vector • [[0.3, 0.7], [0.6, 0.4]] Matrix Function • f( • ) → valueThursday, August 25, 11 21
  24. 24. In-Database Operations Scalar ArithmeticThursday, August 25, 11 22
  25. 25. In-Database Operations • Scalar Arithmetic • SELECT 5*4; • SELECT sqrt(64); • SELECT cos(-3.14159 * sqrt(2) / 2 );Thursday, August 25, 11 23
  26. 26. In-Database Operations Vector ArithmeticThursday, August 25, 11 24
  27. 27. In-Database Operations • Vector • array objects: float8[] • array operations are implemented as UDFs. technically violates 1st normal formThursday, August 25, 11 25
  28. 28. In-Database Operations • Vector technically violates 1st normal formThursday, August 25, 11 26
  29. 29. In-Database Operations Matrix ArithmeticThursday, August 25, 11 27
  30. 30. In-Database Operations • Matrix Representation #1 CREATE TABLE B ( row_number integer, vector numeric[] ); http://www.postgresql.org/docs/current/static/arrays.htmlThursday, August 25, 11 28
  31. 31. In-Database Operations • Matrix addition: SELECT A. row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_numbner This can be easily implemented as a primitive operator: SELECT A ** B from A ,BThursday, August 25, 11 29
  32. 32. In-Database Operations • Matrix and vector multiplication: Av SELECT 1, array_accum(row_number, vector*v) FROM A array_accum(x,v) is a custom function. It takes the value v and puts it in the row indexed by x.Thursday, August 25, 11 30
  33. 33. In-Database Operations • Matrix Representation #2 CREATE TABLE A ( row_number INTEGER, column_number INTEGER, value DECIMAL ); • Great sparse representation! http://www.postgresql.org/docs/current/static/arrays.htmlThursday, August 25, 11 31
  34. 34. In-Database Operations • Matrix Multiplication (using rep #2): SELECT A. row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_numberThursday, August 25, 11 32
  35. 35. In-Database Operations • These operations require one pass over the data and can be done in parallel. • What about Matrix Division, and Matrix Inverse? • SQL is awkward bc of it’s lack of loops for iteration.Thursday, August 25, 11 33
  36. 36. In-Database Operations • Document similarity for fraud detection • Check to see if several advertisers are linking to pages to the same content. • If several advertisers have out links to similar documents it is possible they are using stolen credit card. • Especially if the outbound documents are malicious.Thursday, August 25, 11 34
  37. 37. In-Database Operations • For document similarity we use tf-idf. This is a score for a term in a given document. For a term: tf-idft,d = tft,d * idft tft,d = term count in doc idft = log { |docs| / |docs w. term|}Thursday, August 25, 11 35
  38. 38. In-Database Operations • Table docs contains triples (document, term, count) • tf UDF in SQL takes a term and document as parameters SELECT count(term) FROM docs WHERE term = t AND document = d • idf UDF in SQL takes a term as a parameter SELECT log ( (SELECT count(document) FROM docs ) / (SELECT count(document) FROM docs WHERE term = t GROUP BY document)) FROM docsThursday, August 25, 11 36
  39. 39. In-Database Operations • Each doc will have a vector with the tf-idf scores, we can compare the results using cosine similarity. cos θ = v * u / (||v|| * ||u||) small θ means v and u are similar • SQL is straight forward using the vector operatorsThursday, August 25, 11 37
  40. 40. In-Database Operations • Ordinary Least Squares is used to fit a line to a collection of points. • This helps to describe seasonal trends - its a first look at the data. Solve Y = Xβ where X is an n × k Matrix. Estimate β using β* = (XTX)-1XTy We can Parallellize using computation Distribute: A ← XTX, b ← XTy Collect: A and b. Calculate: A-1 Multiply: A-1 * bThursday, August 25, 11 38
  41. 41. In-Database Operations Functional ArithmeticThursday, August 25, 11 39
  42. 42. In-Database Operations • A/B Testing (Multivariate testing) • Web companies show people different versions of their website to to see which one is the best. • Ad companies compare different add campaigns and may want to find the one with the higher clicks-through rate.Thursday, August 25, 11 40
  43. 43. In-Database Operations • Mann-Whitney U Test • This is a popular test to compare non- parametric data. • non-parametric def= data set that does not fit in a well known distribution • Combine data values and score the values, find the average location of the list for each value. • Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12)Thursday, August 25, 11 41
  44. 44. In-Database Operations • Mann-Whitney U Test • Given Table T with (sample_id, value) CREATE VIEW R AS SELECT sample_id, avg(value) AS sample_avg sum(rown) AS rank_sum, count(*) AS sample_n, sum(rown) - count(*) * count(*) + 1 AS sample_us FROM ( SELECT sample_id, row_number() OVER (ORDER BY value DESC) AS rown, value FROM T) AS ordered GROUP BY sample_idThursday, August 25, 11 42
  45. 45. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_uThursday, August 25, 11 43
  46. 46. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u This can be easily implemented as a stored procedure: SELECT man_whitney(value) FROM tableThursday, August 25, 11 43
  47. 47. In-Database Operations • Resampling • Sampling several subsets instead of the whole population. • Bootstrapping def= Sampling to buckets (w/ replacement) then combine buckets • Jackknifing def= Leave some data out and sample; then combine runsThursday, August 25, 11 44
  48. 48. In-Database Operations • Bootstrapping • Pre-select which trials contain which rows • Suppose we have population N = 100 and we have subsamples of M = 3 CREATE VIEW design AS SELECT a.trial_id, floor(100 * random()) AS row_id FROM generate_series(1, 10000) AS a (trial_id), generate_series(1, 3) AS b (subsample_id)Thursday, August 25, 11 45
  49. 49. In-Database Operations • Bootstrapping • Calculate the sampling distribution CREATE VIEW trials AS SELECT d.trial_id, AVG(T.value) AS avg_value FROM design d,T WHERE d.row_id = T.row_id GROUP BY d.trial_idThursday, August 25, 11 46
  50. 50. In-Database Operations • Bootstrapping • Final distribution parameters SELECT AVG(avg_value), STDDEV(avg_value) FROM trialsThursday, August 25, 11 47
  51. 51. MAD DBMS • Loading/Unloading • Scatter/Gather Streaming – fast! • Extract-Transform-Load vs Extract-Load- Transform • Greenplum has MapReduce support!Thursday, August 25, 11 48
  52. 52. MAD DBMS • Storage • Tunable table types: • external tables (e.g. files) • heap tables (frequent updates) compressible • append-only tables (rare updates) • column-stores flexibility • All tables have a distribution policyThursday, August 25, 11 49
  53. 53. MAD DBMS • Partitioning • Partition by range of values or columns (list) • i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage. • Query optimizer knows the partitioning scheme • This is useful for staging dataThursday, August 25, 11 50
  54. 54. MAD DBMS • MAD Programming • Use your favorite programming language extension • Programmers must think out the code works w/o shared memory. (data-parallel) • Map and Reduce functions can be written in Python, Perl, or R. • They run using the Scatter/Gather technique. (Any table type) • SQL is not necessary!Thursday, August 25, 11 51
  55. 55. MAD DBMS • Future • Need a better repository for the library of functions. (Morpheus SIGMOD06) • Automatically choose data and tables layouts automatically • Use online aggregation for faster answers to queries.Thursday, August 25, 11 52
  56. 56. QuestionsThursday, August 25, 11 53
  57. 57. Questions • Why would one be apposed to the M.A.D. paradigm?Thursday, August 25, 11 53
  58. 58. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement?Thursday, August 25, 11 53
  59. 59. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? • Is there a market for an AppStore for functions/utilities for MADLib?Thursday, August 25, 11 53
  60. 60. Conclusion • This paper describes the M.A.D. (Magnetic Agile Deep) mindset and the M.A.D. library. • It demonstrated how to perform analytics inside the DB using SQL and stored procedures. • It also described the Greenplum database. A general data-parallel database for simple and complex queries from flexible tables.Thursday, August 25, 11 54
  61. 61. Thank You!Thursday, August 25, 11 55
  62. 62. MADLib Installation (ubuntu) • sudo apt-get install postgresql flex bison oxygen libboost1.42-* git graphviz libarmadillo-dev libarmadillo-doc • sudo apt-get install lapack++ blas++ • git clone https://github.com/madlib/madlib.git • ./configure • cd build • cd make https://github.com/madlib/madlib/wiki/Building-MADlib-from-SourceThursday, August 25, 11 56
  63. 63. Matrix Multiplication Demo -- Create tables in the new matrix format (2, 1, 7), CREATE TABLE A ( (2, 2, 8); row_number INTEGER, column_number INTEGER, -- Check to see that the values are multiplying correctly value DECIMAL SELECT A.row_number, B.column_number, ); A.value::text || * || B.value::text INSERT INTO A VALUES FROM A, B (1, 1, 1), WHERE A.column_number = B.row_number (1, 2, 2), (2, 1, 3), -- Add the Groupby to see the sum work (2, 2, 4); SELECT A.row_number, B.column_number, SUM(A.value * B.value) CREATE TABLE B ( FROM A, B row_number INTEGER, WHERE A.column_number = B.row_number column_number INTEGER, GROUP BY A.row_number, B.column_number; value DECIMAL ); DROP TABLE A; INSERT INTO B VALUES DROP TABLE B; (1, 1, 5), (1, 2, 6),Thursday, August 25, 11 57
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×