6. 6
Collaborate on software in open and productive ways
Need strong community for innovation
MADlib and HAWQ are complementary technologies
Why Apache?
10. 10
5 • Up to 30x SQL-on-Hadoop performance
advantage
• Faster time to insight
• Massive MPP scalability to petabytes
Benefits: Near real-time latency, complex
queries and advanced analytics
at scale
1. Advanced Analytics Performance
Key Features
of
HAWQ
11. 11
5 • ANSI SQL-92, -99, -2003
• All 99 TPC-DS queries tested, no
modifications
• Plus, OLAP extensions
• Complete ACID integrity and reliability
Benefits: 100% SQL compliant
No risk to SQL applications
All native on HDP via HAWQ
2. 100% ANSI SQL Compliant
Key Features
of
HAWQ
12. 12
HAWQ Performance vs Impala
HAWQ
Faster
Impala
Faster
2 28 46 66 73 76 79 80 88 90
96
HAWQ
• Faster on 46 of 62
TPC-DS queries
completed*
• 4.55x mean avg.
• 12 hrs faster total
* Impala supported 74 of 99
queries, 12 crashed mid-run
13. 13
HAWQ vs Apache Hive w/Tez
HAWQ
Faster
Hive
Faster
3 7 15 25 27 34 46 48 76 79 89 90
96
HAWQ
• Faster on 45 of 60
TPC-DS queries
completed*
• 3.44x mean avg.
• 9 hrs faster total
* Hive supported 65 of 99 queries,
5 crashed mid-run
14. 14
5 • Advanced machine learning for big data
• Local, in-database operation
• Exceptional MPP/parallel performance
• Open source, Postgres-based
Benefits: Advanced, highly scalable,
machine learning, directly on
data in Hadoop
3. Integrated Machine Learning
Key Features
of
HAWQ
15. 15
5 • HDP, PHD, other ODPi-derived distros
• Easily managed via Ambari
• On premises, in cloud, or PaaS
• HBase, Avro, Parquet and more
• Connectors to make HAWQ data
available to other SQL query tools
Benefits: Flexibility
Accessibility
Portability
4. Flexible Deployment
Key Features
of
HAWQ
16. 16
Open Data Platform
A shared industry effort to advance the state of Apache Hadoop® and Big Data
technologies for the enterprise
22. 22
Example Use Cases
Smart/connected car
• PHD, HAWQ
• Ability to have numerous data
in Hadoop
• Generate new business models
• Predictive analytics
Network & Call Center Analysis
• PHD, HAWQ
• Store and maintain 2B records/day
• Analyze drop and completed calls
• Analyze networks, care-center
responsiveness
• 5X capacity of EDW at half the cost
Revenue Prediction
• PHD, HAWQ, GPDB
• Predict ad revenue
to within 1%
• Transform into data-driven
company that builds
close relationships with
customers
Archive Analytics, Customer
Behavior Analytics
• PHD, HAWQ
• Mainframe alternative
• Archive analytics
• Customer behavior
profiling and analytics
24. 24
Scalable, In-Database
Machine Learning
• Open Source https://github.com/apache/incubator-madlib
• Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL
• Downloads and Docs: http://madlib.incubator.apache.org/
Apache (incubating)
25. 25
History
MADlib project was initiated in 2011 by EMC/Greenplum architects and
Joe Hellerstein from Univ. of California, Berkeley.
• MAD stands for:
• lib stands for SQL library of:
• advanced (mathematical, statistical, machine learning)
• parallel & scalable in-database functions
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.
26. 26
Functions
Predictive Modeling Library
Linear Systems
• Sparse and Dense Solvers
• Linear Algebra
Matrix Factorization
• Singular Value Decomposition (SVD)
• Low Rank
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Apriori)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Random Forest
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
• Naïve Bayes
• Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators
• CountMin (Cormode-Muth.)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series
• ARIMA
Oct 2014
27. 27
MADlib Advantages
Better parallelism
– Algorithms designed to leverage MPP and
Hadoop architecture
Better scalability
– Algorithms scale as your data set scales
Better predictive accuracy
– Can use all data, not a sample
ASF open source (incubating)
– Available for customization and optimization
30. 30
Linear Regression on 10 Million Rows in Seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.
31. 31
Pivotal is very proud to deepen
our relationship with the ASF to
advance SQL-on-Hadoop and
machine learning technologies.
Please join us!