SlideShare a Scribd company logo
1 of 75
Download to read offline
PIVOTAL NY – BIG DATA &
ANALYTICS MEETUP
SQL and Machine Learning on HDFS using HAWQ
Prasad Radhakrishnan
Amey Banarse
Josh Plotkin April 7th, 2015
Pivotal Labs, New York
Pivotal NY – Big Data & Analytics Meetup
Hadoop-HDFS
vs
MPP Databases
Pivotal NY – Big Data & Analytics Meetup
Quick overview of Hadoop-HDFS
¨  Hadoop is a Framework to store and process large volumes of
data.
¨  HDFS is a fault-tolerant file system:
¤  Assumes that Hardware will fail
¤  Combines cluster’s local storage into a single namespace.
¤  All data is replicated to multiple machines and Infinitely (read
highly) Scalable.
¨  A classic Master-Slave Architecture.
Pivotal NY – Big Data & Analytics Meetup
Hadoop & HDFS Architecture – Simplistic View	

Zookeeper, Yarn
Resource Manager
Node 3
…Data Node
Zookeeper, Yarn
Resource Manager
Node 4
Data Node
Zookeeper, Yarn
Resource Manager
Node N
Data Node
NameNode
Node 1
Quorum
Journal
Manager
Zoo
Keeper,
YARN
Secondary
NameNode
Node 2
Quorum
Journal
Manager
Zoo
Keeper,
YARN
Pivotal NY – Big Data & Analytics Meetup
Hadoop – HDFS: 2 sides of the same coin
[……. Side Bar: This is the
first US Penny, 223 years
old. Sold for $1.2 mil……]
¨  How is Data stored in
HDFS?
¨  How is Data accessed, how does
the framework support
processing / computations?
Pivotal NY – Big Data & Analytics Meetup
¨  Everything is a File.
¨  A File is broken into small blocks.
¨  Stores multiple copies of each
block.
¨  If a node goes down, there are
other nodes that can provide the
missing blocks.
¨  Schema-less, it’s a distributed FS
that will accept any kind of data
(Structured, un/semi-Structured).
Hadoop – HDFS: Storage side
Pivotal NY – Big Data & Analytics Meetup
¨  Map-Reduce is the most popular
processing framework for access
and computational needs.
¨  MR is slow. Suited for batch work
loads and to process un/semi-
structured data.
¨  SQL support is minimal and not
matured.
Hadoop – HDFS: Processing side
Pivotal NY – Big Data & Analytics Meetup
Quick overview of MPP Databases
¨  MPPs predate Hadoop-HDFS.
¨  Most MPPs have a Master-Slave design and a shared-
nothing Architecture
¨  A very few software based MPPs, most run on custom
h/w.
¨  They are known for their sheer horse power to process
and compute big data work loads.
Pivotal NY – Big Data & Analytics Meetup
Simplistic view of Greenplum MPP DB
Architecture
Master
Segment 1
Segment 2
Segment N
Node 1
Standby
Master
Node 2
Segment Host 1
Segment 1
Segment 2
Segment N
Segment Host 2
Segment N
Segment Host N
…
…
…
…
Segment 1
Segment 2
Pivotal NY – Big Data & Analytics Meetup
MPP Databases: 2 sides of the same coin
¨  How is Data stored in
MPPs?
¨  How is Data accessed, how does
the framework support
processing / computations?
Pivotal NY – Big Data & Analytics Meetup
¨  MPPs too, have a distributed FS but
only for a table’s data. Tables are
database objects.
¨  A File needs to be loaded into a
table, pre-knowledge of it’s schema is
required.
¨  MPPs are fault-tolerant as well. They
store redundant copies of data.
¨  Handling un/semi-structured data is
not it’s sweet spot and could get very
challenging.
MPP Databases: Storage side of things
Pivotal NY – Big Data & Analytics Meetup
¨  MPPs are SQL engines.
¨  Known for their horse power in
being able to handle & process big
data work loads.
¨  Extensive/Full ANSI SQL support.
¨  Rich ETL and BI tools for easy dev
& consumption.
¨  Some MPPs like Greenplum support
Machine Learning use cases
extensively.
MPP Databases: Processing side
Pivotal NY – Big Data & Analytics Meetup
The Winning Combination
¨  HDFS as the distributed
FS.
¨  Schema-less, fault-
tolerant.
¨  Huge Apache eco-system
of tools/products.
¨  MPP SQL Engine for Data
access, processing and ML work
loads.
¨  Blend with and leverage BI and
ETL tools
¨  Blend with Hadoop File formats
and integrate with Apache HDFS
tools/products.
Pivotal NY – Big Data & Analytics Meetup
HAWQ, the MPP SQL Engine for HDFS
¨  Derived from Greenplum MPP, shortening many years engineering
efforts that will go into building an MPP from scratch.
¨  Stores data in HDFS using Parquet file format.
¨  Familiar SQL interface, 100% ANSI compliant.
¨  All kinds of joins, analytical functions etc.
¨  Doesn’t use MapReduce, no dependencies on Hive or Hive
metastore.
¨  Way faster than MR, not even a valid comparison.
¨  Enables In-Database Machine Learning.
Pivotal NY – Big Data & Analytics Meetup
Pivotal HAWQ Architecture
HAWQ Master
HAWQ Segment(s)
Node 1
HAWQ Secondary
Master
Node 2
Node 5
…
NameNode
Node 3
Secondary
NameNode
Node 4
Quorum
Journal
Manager
Quorum
Journal
Manager
Zoo
Keeper
Quorum
Journal
Manager
Data Node
HAWQ Segment(s)
Node 6
Data Node
HAWQ Segment(s)
Node N
Data Node
Zoo
Keeper
Zoo
Keeper
High speed Interconnect
Pivotal NY – Big Data & Analytics Meetup
Pivotal HAWQ Architecture
¨ Master
•  Master and Standby Master Host
•  Master coordinates work with Segment Hosts
¨ Segment
•  One or more Segment Instances
•  Process queries in parallel
•  Dedicated CPU, Memory and Disk
¨ Interconnect
•  High Speed Interconnect for continuous pipelining of data processing
Pivotal NY – Big Data & Analytics Meetup
Key components of HAWQ Master
HAWQ Master
Query Parser
Query Optimizer
Query Executor
Transaction
Manager
Process Manager
Metadata
Catalog
HAWQ Standby Master
Query Parser
Query Optimizer
Query Executor
Transaction
Manager
Process Manager
Metadata
CatalogWAL
replication
Pivotal NY – Big Data & Analytics Meetup
Key components of HAWQ segments
HAWQ Segment HDFS Datanode
Segment Data Directory
Local Filesystem
Spill Data Directory
Query Executor
libhdfs3
PXF
Pivotal NY – Big Data & Analytics Meetup
What
differentiates
HAWQ?
Pivotal NY – Big Data & Analytics Meetup
What differentiates HAWQ: Interconnect
& Dynamic Pipelining
q  Parallel data flow using the high speed UDP interconnect
q  No materialization of intermediate data
q  Map Reduce will always “write” to disk, HAWQ will “spill”
to disk only when insufficient memory
q  Guarantees that queries will always complete
HAWQ Segment
PXF
Local Temp Storage
Segment Host
Query Executor
DataNode
PXF
HAWQ Segment
PXF
Local Temp Storage
Segment Host
Query Executor
DataNode
PXF
HAWQ Segment
PXF
Local Temp Storage
Segment Host
Query Executor
DataNode
PXF
Interconnect
Dynamic Pipelining
Pivotal NY – Big Data & Analytics Meetup
What differentiates HAWQ: libhdfs3
¨  Pivotal rewrote libhdfs in C++ resulting in libhdfs3
¤  C based library
¤  Leverages protocol buffers to achieve greater performance
¨  libhdfs3 is used to access HDFS from HAWQ
¨  Could benefit from short circuit reads
¨  Open sourced libhdfs3
https://github.com/PivotalRD/libhdfs3
Pivotal NY – Big Data & Analytics Meetup
What differentiates HAWQ: HDFS Truncate
¨  HDFS is designed as append-only, POSIX-like FS
¨  ACID compliant –
¤  HDFS guarantees the atomicity of Truncate operations i.e.
either it succeeds or fails
¨  Added HDFS Truncate(HDFS-3107) to support rollbacks
Pivotal NY – Big Data & Analytics Meetup
What differentiates HAWQ: ORCA
¨  Multi-core query optimizer(codename ORCA)
¨  Cost based optimizer
¨  Window and Analytic Functions
¨  1.2 Billion options evaluated in 250ms as an ex. for
TPC-DS Query #21
¤  Partition Elimination
¤  Subquery Unnesting
¨  Presented at ACM SIGMOD 2014 Conference
Pivotal NY – Big Data & Analytics Meetup
Pivotal Xtensions Framework(PXF)
¨  External table interface in HAWQ to read data stored in Hadoop
ecosystem
¨  External tables can be used to
¤  Load data into HAWQ from Hadoop
¤  Query Hadoop data without materializing it into HAWQ
¨  Enables loading and querying of data stored in
¤  HDFS Text
¤  Avro
¤  HBase
¤  Hive
n Text, Sequence and RCFile formats
Pivotal NY – Big Data & Analytics Meetup
PXF example to query HBase
¨  Get data from an HBase table called‘sales’. In this example we are only
interested in the rowkey, the qualifier ‘saleid’ inside column family ‘cf1’, and
the qualifier ‘comments’ inside column family ‘cf8’
CREATE EXTERNAL TABLE hbase_sales (
recordkey bytea,
“cf1:saleid” int,
“cf8:comments” varchar
)
LOCATION(‘pxf://10.76.72.26:50070/sales?
Fragmenter=HBaseDataFragmenter&
Accessor=HBaseAccessor&
Resolver=HBaseResolver')
FORMAT ‘custom’ (formatter='gpxfwritable_import');
Pivotal NY – Big Data & Analytics Meetup
Source Data to
HDFS
HAWQ External
Table for immediate
Ad-Hoc
Load or Append a
regular HAWQ
table
Load or Append
directly to a regular
HAWQ Table
Data outside HDFS
Example Data Flow in HAWQ
BI Tools for Reporting
& Dashboard
In-Database Machine
Learning using MADLib, PL/
Python, PL/R
Pivotal NY – Big Data & Analytics Meetup
HAWQ Demo
¨  HAWQ Cluster Topology
¨  HAWQ Ingest
¨  HAWQ PXF
¨  Dashboards
Pivotal NY – Big Data & Analytics Meetup
HAWQ Demo DataSet
¨  23 Million rows of fake data, Customer Credit Card
spend.
¨  Time Dimension
¨  Aggregate Tables
MACHINE LEARNING ON
HDFS USING HAWQ
SQL and Machine Learning on HDFS using HAWQ
Pivotal NY – Big Data & Analytics Meetup
Analytics on HDFS, Popular Notions
¨  Store	
  it	
  all	
  in	
  HDFS	
  and	
  Query	
  it	
  all	
  using	
  HAWQ	
  
¤  Scalable	
  Storage	
  and	
  Scalable	
  Compute	
  
¨  “Gravity”	
  of	
  Data	
  changes	
  everything	
  
¤  Move	
  code,	
  not	
  data	
  
¤  Parallelize	
  everything	
  
¤  Look	
  at	
  enGre	
  data	
  set	
  and	
  not	
  just	
  a	
  sample	
  
¤  More	
  data	
  could	
  mean	
  simpler	
  models	
  
Pivotal NY – Big Data & Analytics Meetup
MADLib for Machine Learning
¨  MADlib	
  is	
  an	
  open-­‐source	
  library	
  for	
  scalable	
  in-­‐database	
  
analyGcs.	
  It	
  provides	
  data-­‐parallel	
  implementaGons	
  of	
  
mathemaGcal,	
  staGsGcal	
  and	
  machine	
  learning	
  methods	
  for	
  
structured	
  and	
  unstructured	
  data.	
  
Pivotal NY – Big Data & Analytics Meetup
UDF - PL/X : X in {pgsql, R, Python,
Java, Perl, C etc.}
¨  Allows users to write functions in the R/Python/Java,
Perl, pgsql or C languages
¨  The interpreter/VM of the language ‘X’ is installed
on each node of the Greenplum Database Cluster
¨  Can install Python extensions like Numpy, NLTK,
Scikit-learn, Scipy.
¨  Data Parallelism: PL/X piggybacks on MPP
architecture
Pivotal NY – Big Data & Analytics Meetup
UDF - PL/X : X in {pgsql, R, Python,
Java, Perl, C etc.}
HAWQ Master
HAWQ Segment(s)
Node 1
HAWQ Secondary
Master
Node 2
Node 5
…
NameNode
Node 3
Secondary
NameNode
Node 4
Quorum
Journal
Manager
Quorum
Journal
Manager
Zoo
Keeper
Quorum
Journal
Manager
Data Node
HAWQ Segment(s)
Node 6
Data Node
HAWQ Segment(s)
Node N
Data Node
Zoo
Keeper
Zoo
Keeper
High speed Interconnect
Pivotal NY – Big Data & Analytics Meetup
Data Analytics in SQL w/ HAWQ
¨  A Data Science workflow in SQL
¨  Re-training is often necessary to take the step from traditional BI to DS
¨  Many parts in parallel
¤  Data manipulation
¤  Training (e.g. Linear regression, elastic net, SVM)
¤  Model evaluation
¨  MADLib for native SQL interface
¤  Simple syntax
¨  Extensible with PL/Python and PL/R
¤  All your favorite packages
Pivotal NY – Big Data & Analytics Meetup
Workflow
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
Pivotal NY – Big Data & Analytics Meetup
Exploratory Data Analysis/Viz
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  SQL
-  MADLib
-  BI Tools
Pivotal NY – Big Data & Analytics Meetup
Explore Data with MADLib
¨  Select a row
¨  Group by customer (+ age and gender) to see total spend
¤  Call it ‘summary_stats’
¤  Select a row
¨  Use MADLib Summary Stats function
¤  select madlib.summary('summary_stats', 'summary_stats_out’,
‘total_spent');
¤  Creates a table ‘summary_stats_out’ we can query
¤  Can group by: select madlib.summary('summary_stats',
'summary_stats_out_grouped', ‘total_spent’, ‘gender, age’);
Pivotal NY – Big Data & Analytics Meetup
Ask Questions
¤  Start with simple questions and simple (linear) models.
¤  Imagine we get new data with masked demographic data…
¤  What is the simplest possible question we can ask of this dataset?
n  Can we predict age by total amount spent?
¤  What needs to be done first?
Pivotal NY – Big Data & Analytics Meetup
Feature Engineering
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  SQL
-  PL/R
-  PL/Python
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset
¤  We want the data to look something like:
ssn | sum | age
-------------+-----------------------
621-95-5488 | 19435.72 | 30
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset
¤  We want the data to look something like:
ssn | sum | age
-------------+-----------------------
621-95-5488 | 19435.72 | 30
CREATE OR REPLACE VIEW cust_age_totalspend AS
SELECT ssn, sum(amt), extract(years from age(NOW(),dob)) as age
FROM mart.trans_fact
GROUP BY ssn, extract(years from age(NOW(),dob));
Pivotal NY – Big Data & Analytics Meetup
Skipping for now…
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  SQL
-  PL/R
-  PL/Python
-  MADLib
Pivotal NY – Big Data & Analytics Meetup
Building Models
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  PL/R
-  PL/Python
-  MADLib
Pivotal NY – Big Data & Analytics Meetup
MADLib’s Interface
¤  Inputs: tables
n  VIEW: cust_age_totalspend
¤  Outputs: tables that can be queried
n  MADLib generates this automatically
n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from
an API
¤  Labels
¤  Features
¤  Parameters
¤  Parallel when possible
Pivotal NY – Big Data & Analytics Meetup
MADLib’s Interface
¤  Inputs: tables
n  VIEW: cust_age_totalspend
¤  Outputs: tables that can be queried
n  MADLib generates this automatically
n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from
an API
¤  Labels
¤  Features
¤  Parameters
¤  Parallel when possible
SELECT madlib.linregr_train( 'cust_age_totalspend',
'linear_model',
'age',
'ARRAY[1, sum]’);
Pivotal NY – Big Data & Analytics Meetup
MADLib’s Interface
¤  Inputs: tables
n  VIEW: cust_age_totalspend
¤  Outputs: tables that can be queried
n  MADLib generates this automatically
n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from
an API
¤  Labels
¤  Features
¤  Parameters
¤  Parallel when possible
SELECT madlib.linregr_train( 'cust_age_totalspend',
'linear_model',
'age',
'ARRAY[1, sum]’);
Pivotal NY – Big Data & Analytics Meetup
MADLib’s Interface
¤  Inputs: tables
n  VIEW: cust_age_totalspend
¤  Outputs: tables that can be queried
n  MADLib generates this automatically
n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from
an API
¤  Labels
¤  Features
¤  Parameters
¤  Parallel when possible
SELECT madlib.linregr_train( 'cust_age_totalspend',
'linear_model',
'age',
'ARRAY[1, sum]’);
Pivotal NY – Big Data & Analytics Meetup
MADLib’s Interface
¤  Inputs: tables
n  VIEW: cust_age_totalspend
¤  Outputs: tables that can be queried
n  MADLib generates this automatically
n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from
an API
¤  Labels
¤  Features
¤  Parameters
¤  Parallel when possible
SELECT madlib.linregr_train( 'cust_age_totalspend',
'linear_model',
'age',
'ARRAY[1, sum]’);
Pivotal NY – Big Data & Analytics Meetup
Scoring Chosen Model
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  PL/R
-  PL/Python
-  MADLib
Pivotal NY – Big Data & Analytics Meetup
Viewing Coefficients
¤  As simple as querying a table:
SELECT * FROM linear_model;
¤  Unnest arrays:
SELECT unnest(ARRAY['intercept', 'sum']) as attribute,
unnest(coef) as coefficient,
unnest(std_err) as standard_error,
unnest(t_stats) as t_stat,
unnest(p_values) as pvalue
FROM linear_model;
Pivotal NY – Big Data & Analytics Meetup
Making Predictions
SELECT ssn, sum, age,
-- calculate predictions: array and m.coef are vectors
madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as predict,
-- calculate residuals: array and m.coef are vectors
age - madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as residual
FROM cust_age_totalspend, linear_model m;
Pivotal NY – Big Data & Analytics Meetup
Making Predictions
SELECT ssn, sum, age,
-- calculate predictions: array and m.coef are vectors
madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as predict,
-- calculate residuals: array and m.coef are vectors
age - madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as residual
FROM cust_age_totalspend, linear_model m;
Pivotal NY – Big Data & Analytics Meetup
Making/Evaluating Predictions
SELECT ssn, sum, age,
-- calculate predictions: array and m.coef are vectors
madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as predict,
-- calculate residuals: array and m.coef are vectors
age - madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as residual
FROM cust_age_totalspend, linear_model m;
Pivotal NY – Big Data & Analytics Meetup
Scoring (Mean Squared Error)
SELECT avg(residual^2) AS mse
FROM (SELECT age - madlib.linregr_predict( ARRAY[1, sum],
m.coef
) as residual
FROM cust_age_totalspend, linear_model m) PREDICTIONS;
Pivotal NY – Big Data & Analytics Meetup
Training, Testing, Cross-Validation Sets
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  SQL
-  PL/R
-  PL/Python
-  MADLib
Pivotal NY – Big Data & Analytics Meetup
SQL To Create 80/20 Test/Train
¤  MADLib has a built-in cross validation function
¤  Stratified Sampling with pure SQL
n  Random seed (-1, 1):
n  SELECT setseed(0.444);
n  Partition the data by group
n  Sort by the random #
n  Assign a row count
n  Use a correlated subquery to select the lowest 80% of row counts by partition into the training
set
n  The code…
Pivotal NY – Big Data & Analytics Meetup
SQL To Create 80/20 Test/Train
-- 1. perform the same aggregation step as before in creating the view
WITH assign_random AS (
SELECT ssn, extract(years from age(NOW(),dob)) as age, sum(amt) as total_spent, random() as random
FROM mart.trans_fact
GROUP BY ssn, extract(years from age(NOW(),dob))
),
-- 2. add a different row count for group type (in this case, just age)
training_rownums AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY age ORDER BY random) AS rownum FROM assign_random
)
-- 3. sample the lowest 80% of rownums as the training set by group using
----- a correlated subquery (iterate over groups) because HAWQ is ANSI complient
SELECT ssn, age, total_spent
FROM training_rownums tr1
WHERE rownum <= (
(SELECT MAX(rownum) * 0.8 FROM training_rownums tr2 WHERE tr1.age = tr2.age
)));
Pivotal NY – Big Data & Analytics Meetup
Another question
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
Pivotal NY – Big Data & Analytics Meetup
Back Here Again…
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  SQL
-  PL/R
-  PL/Python
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset
¤  Remember the initial dataset:
n  select ssn, gender, trans_date, category, round(amt, 2) as amount from
mart.trans_fact limit 5;
¨  ssn | gender | trans_date | category | amount
¨  -------------+--------+------------+----------------+-------
¨  587-27-8957 | M | 2013-12-21 | health_fitness | 10.59
¨  587-27-8957 | M | 2014-10-19 | shopping_pos | 0.80
¨  587-27-8957 | M | 2013-05-04 | entertainment | 5.17
¨  587-27-8957 | M | 2014-03-27 | utilities | 9.59
¨  587-27-8957 | M | 2013-03-30 | home | 10.64
¤  How can we shape the data?
n  Transpose using case statements
n  Aggregate and sum
n  The query…
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset
¤  Remember the initial dataset:
n  select ssn, gender, trans_date, category, round(amt, 2) as amount from
mart.trans_fact limit 5;
¨  ssn | gender | trans_date | category | amount
¨  -------------+--------+------------+----------------+-------
¨  587-27-8957 | M | 2013-12-21 | health_fitness | 10.59
¨  587-27-8957 | M | 2014-10-19 | shopping_pos | 0.80
¨  587-27-8957 | M | 2013-05-04 | entertainment | 5.17
¨  587-27-8957 | M | 2014-03-27 | utilities | 9.59
¨  587-27-8957 | M | 2013-03-30 | home | 10.64
¤  How can we shape the data?
n  Transpose using case statements
n  Aggregate and sum
n  The query…
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset
¤  Remember the initial dataset:
n  select ssn, gender, trans_date, category, round(amt, 2) as amount from
mart.trans_fact limit 5;
¨  ssn | gender | trans_date | category | amount
¨  -------------+--------+------------+----------------+-------
¨  587-27-8957 | M | 2013-12-21 | health_fitness | 10.59
¨  587-27-8957 | M | 2014-10-19 | shopping_pos | 0.80
¨  587-27-8957 | M | 2013-05-04 | entertainment | 5.17
¨  587-27-8957 | M | 2014-03-27 | utilities | 9.59
¨  587-27-8957 | M | 2013-03-30 | home | 10.64
¤  How can we shape the data?
n  Transpose using case statements
n  Aggregate and sum
n  The query…
Pivotal NY – Big Data & Analytics Meetup
Building Models
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  PL/R
-  PL/Python
-  MADLib
Pivotal NY – Big Data & Analytics Meetup
Learn Age Using Category Spending
DROP TABLE IF EXISTS elastic_output;
SELECT madlib.elastic_net_train( 'training_set',
'elastic_output',
'age',
'array[food_dining, utilities, grocery_net, home,
pharmacy, shopping_pos, kids_pets, personal_care, misc_pos, gas_transport,
misc_net, health_fitness, shopping_net, travel]',
<other parameters>);
Pivotal NY – Big Data & Analytics Meetup
Scoring Chosen Model
Score Models
against test
data
Split
Datasets
Train/Tune
Model
EDA/V
Ask
Questions
Manipulate
Data
Model is
good
Make
Decisions
-  PL/R
-  PL/Python
-  MADLib
Pivotal NY – Big Data & Analytics Meetup
Predictions on Test Set
SELECT ssn, predict, age - predict AS residual
FROM (
SELECT
test_set.*,
madlib.elastic_net_gaussian_predict(
m.coef_all,
m.intercept,
ARRAY[food_dining, utilities, grocery_net, home, pharmacy,
shopping_pos, kids_pets, personal_care, misc_pos, gas_transport, misc_net,
health_fitness, shopping_net, travel]
) AS predict
FROM test_set, elastic_output m) s
ORDER BY ssn;
Pivotal NY – Big Data & Analytics Meetup
Predictions on Test Set
SELECT ssn, predict, age - predict AS residual
FROM (
SELECT
test_set.*,
madlib.elastic_net_gaussian_predict(
m.coef_all,
m.intercept,
ARRAY[food_dining, utilities, grocery_net, home, pharmacy,
shopping_pos, kids_pets, personal_care, misc_pos, gas_transport, misc_net,
health_fitness, shopping_net, travel]
) AS predict
FROM test_set, elastic_output m) s
ORDER BY ssn;
Pivotal NY – Big Data & Analytics Meetup
Score Results
SELECT avg(residual^2)
FROM (<previous slide>) RESIDUALS;
Pivotal NY – Big Data & Analytics Meetup
PL/Python
¤  PL/Python (or PL/R) for data manipulation and training models
¤  Less intuitive than MADLib
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset - UDFs
DROP TYPE IF EXISTS cust_address CASCADE;
CREATE TYPE cust_address AS
(
ssn text,
address text
);
CREATE OR REPLACE FUNCTION
mart.test(ssn text, street text, city text, state text, zip
numeric)
RETURNS cust_address
AS $$
return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)]
$$ LANGUAGE plpythonu;
----------------------------
SELECT mart.test(ssn, street, city, state, zip)
from mart.trans_fact limit 5;
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset - UDFs
CREATE TYPE cust_address AS
(
ssn text,
address text
);
CREATE OR REPLACE FUNCTION
mart.test(ssn text, street text, city text, state text, zip
numeric)
RETURNS cust_address
AS $$
return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)]
$$ LANGUAGE plpythonu;
----------------------------
SELECT mart.test(ssn, street, city, state, zip)
from mart.trans_fact limit 5;
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset - UDFs
DROP TYPE IF EXISTS cust_address CASCADE;
CREATE TYPE cust_address AS
(
ssn text,
address text
);
CREATE OR REPLACE FUNCTION
mart.test(ssn text, street text, city text, state text, zip
numeric)
RETURNS cust_address
AS $$
return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)]
$$ LANGUAGE plpythonu;
----------------------------
SELECT mart.test(ssn, street, city, state, zip)
from mart.trans_fact limit 5;
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset - UDFs
DROP TYPE IF EXISTS cust_address CASCADE;
CREATE TYPE cust_address AS
(
ssn text,
address text
);
CREATE OR REPLACE FUNCTION
mart.test(ssn text, street text, city text, state text, zip
numeric)
RETURNS cust_address
AS $$
return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)]
$$ LANGUAGE plpythonu;
----------------------------
SELECT mart.test(ssn, street, city, state, zip)
from mart.trans_fact limit 5;
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset - UDFs
DROP TYPE IF EXISTS cust_address CASCADE;
CREATE TYPE cust_address AS
(
ssn text,
address text
);
CREATE OR REPLACE FUNCTION
mart.test(ssn text, street text, city text, state text, zip
numeric)
RETURNS cust_address
AS $$
return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)]
$$ LANGUAGE plpythonu;
----------------------------
SELECT mart.test(ssn, street, city, state, zip)
from mart.trans_fact limit 5;
Pivotal NY – Big Data & Analytics Meetup
Shaping the Dataset - UDFs
INSERT INTO addresses
SELECT distinct (mart.test(ssn, street, city, state, zip)).ssn,
(mart.test(ssn, street, city, state, zip)).address
FROM mart.trans_fact
LIMIT 5;
SELECT * FROM addresses;

More Related Content

What's hot

YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHortonworks
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServicePivotalOpenSourceHub
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 

What's hot (20)

YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARN
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Viewers also liked

データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」Masayuki Matsushita
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu
 
công ty dịch vụ giúp việc quận 3 ở hcm
công ty dịch vụ giúp việc quận 3 ở hcmcông ty dịch vụ giúp việc quận 3 ở hcm
công ty dịch vụ giúp việc quận 3 ở hcmtory860
 
Domino-A520i-continuous-inkjet-printer
Domino-A520i-continuous-inkjet-printerDomino-A520i-continuous-inkjet-printer
Domino-A520i-continuous-inkjet-printerMelanie Buckle
 
nhận làm dịch vụ giúp việc quận 6 tại hcm
nhận làm dịch vụ giúp việc quận 6 tại hcmnhận làm dịch vụ giúp việc quận 6 tại hcm
nhận làm dịch vụ giúp việc quận 6 tại hcmkraig109
 
โครงงานคอมพิวเตอร์
โครงงานคอมพิวเตอร์โครงงานคอมพิวเตอร์
โครงงานคอมพิวเตอร์nuttavorn kesornsiri
 
Do you search an office ?
Do you search an office ?Do you search an office ?
Do you search an office ?INOFFICE ROMA
 
Application Test Engineer
Application Test EngineerApplication Test Engineer
Application Test EngineerManoj Pal
 
Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...
Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...
Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...Sutra Sutra
 
โครงงานคอมพิวเตอร์PDF
โครงงานคอมพิวเตอร์PDFโครงงานคอมพิวเตอร์PDF
โครงงานคอมพิวเตอร์PDFnuttavorn kesornsiri
 
Penalized maximum likelihood estimates of genetic covariance matrices wit...
  Penalized maximum likelihood  estimates of genetic covariance  matrices wit...  Penalized maximum likelihood  estimates of genetic covariance  matrices wit...
Penalized maximum likelihood estimates of genetic covariance matrices wit...prettygully
 

Viewers also liked (14)

データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
 
công ty dịch vụ giúp việc quận 3 ở hcm
công ty dịch vụ giúp việc quận 3 ở hcmcông ty dịch vụ giúp việc quận 3 ở hcm
công ty dịch vụ giúp việc quận 3 ở hcm
 
Domino-A520i-continuous-inkjet-printer
Domino-A520i-continuous-inkjet-printerDomino-A520i-continuous-inkjet-printer
Domino-A520i-continuous-inkjet-printer
 
nhận làm dịch vụ giúp việc quận 6 tại hcm
nhận làm dịch vụ giúp việc quận 6 tại hcmnhận làm dịch vụ giúp việc quận 6 tại hcm
nhận làm dịch vụ giúp việc quận 6 tại hcm
 
โครงงานคอมพิวเตอร์
โครงงานคอมพิวเตอร์โครงงานคอมพิวเตอร์
โครงงานคอมพิวเตอร์
 
Jib Cranes
Jib CranesJib Cranes
Jib Cranes
 
MaithiliN_Resume_March_2015
MaithiliN_Resume_March_2015MaithiliN_Resume_March_2015
MaithiliN_Resume_March_2015
 
Do you search an office ?
Do you search an office ?Do you search an office ?
Do you search an office ?
 
Application Test Engineer
Application Test EngineerApplication Test Engineer
Application Test Engineer
 
Scout Trail
Scout TrailScout Trail
Scout Trail
 
Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...
Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...
Undang undang-nomor-7-tahun-1992-tentang-perbankan-sebagaimana-diubah-dengan-...
 
โครงงานคอมพิวเตอร์PDF
โครงงานคอมพิวเตอร์PDFโครงงานคอมพิวเตอร์PDF
โครงงานคอมพิวเตอร์PDF
 
Penalized maximum likelihood estimates of genetic covariance matrices wit...
  Penalized maximum likelihood  estimates of genetic covariance  matrices wit...  Penalized maximum likelihood  estimates of genetic covariance  matrices wit...
Penalized maximum likelihood estimates of genetic covariance matrices wit...
 

Similar to SQL and Machine Learning on Hadoop using HAWQ

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 

Similar to SQL and Machine Learning on Hadoop using HAWQ (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

SQL and Machine Learning on Hadoop using HAWQ

  • 1. PIVOTAL NY – BIG DATA & ANALYTICS MEETUP SQL and Machine Learning on HDFS using HAWQ Prasad Radhakrishnan Amey Banarse Josh Plotkin April 7th, 2015 Pivotal Labs, New York
  • 2. Pivotal NY – Big Data & Analytics Meetup Hadoop-HDFS vs MPP Databases
  • 3. Pivotal NY – Big Data & Analytics Meetup Quick overview of Hadoop-HDFS ¨  Hadoop is a Framework to store and process large volumes of data. ¨  HDFS is a fault-tolerant file system: ¤  Assumes that Hardware will fail ¤  Combines cluster’s local storage into a single namespace. ¤  All data is replicated to multiple machines and Infinitely (read highly) Scalable. ¨  A classic Master-Slave Architecture.
  • 4. Pivotal NY – Big Data & Analytics Meetup Hadoop & HDFS Architecture – Simplistic View  Zookeeper, Yarn Resource Manager Node 3 …Data Node Zookeeper, Yarn Resource Manager Node 4 Data Node Zookeeper, Yarn Resource Manager Node N Data Node NameNode Node 1 Quorum Journal Manager Zoo Keeper, YARN Secondary NameNode Node 2 Quorum Journal Manager Zoo Keeper, YARN
  • 5. Pivotal NY – Big Data & Analytics Meetup Hadoop – HDFS: 2 sides of the same coin [……. Side Bar: This is the first US Penny, 223 years old. Sold for $1.2 mil……] ¨  How is Data stored in HDFS? ¨  How is Data accessed, how does the framework support processing / computations?
  • 6. Pivotal NY – Big Data & Analytics Meetup ¨  Everything is a File. ¨  A File is broken into small blocks. ¨  Stores multiple copies of each block. ¨  If a node goes down, there are other nodes that can provide the missing blocks. ¨  Schema-less, it’s a distributed FS that will accept any kind of data (Structured, un/semi-Structured). Hadoop – HDFS: Storage side
  • 7. Pivotal NY – Big Data & Analytics Meetup ¨  Map-Reduce is the most popular processing framework for access and computational needs. ¨  MR is slow. Suited for batch work loads and to process un/semi- structured data. ¨  SQL support is minimal and not matured. Hadoop – HDFS: Processing side
  • 8. Pivotal NY – Big Data & Analytics Meetup Quick overview of MPP Databases ¨  MPPs predate Hadoop-HDFS. ¨  Most MPPs have a Master-Slave design and a shared- nothing Architecture ¨  A very few software based MPPs, most run on custom h/w. ¨  They are known for their sheer horse power to process and compute big data work loads.
  • 9. Pivotal NY – Big Data & Analytics Meetup Simplistic view of Greenplum MPP DB Architecture Master Segment 1 Segment 2 Segment N Node 1 Standby Master Node 2 Segment Host 1 Segment 1 Segment 2 Segment N Segment Host 2 Segment N Segment Host N … … … … Segment 1 Segment 2
  • 10. Pivotal NY – Big Data & Analytics Meetup MPP Databases: 2 sides of the same coin ¨  How is Data stored in MPPs? ¨  How is Data accessed, how does the framework support processing / computations?
  • 11. Pivotal NY – Big Data & Analytics Meetup ¨  MPPs too, have a distributed FS but only for a table’s data. Tables are database objects. ¨  A File needs to be loaded into a table, pre-knowledge of it’s schema is required. ¨  MPPs are fault-tolerant as well. They store redundant copies of data. ¨  Handling un/semi-structured data is not it’s sweet spot and could get very challenging. MPP Databases: Storage side of things
  • 12. Pivotal NY – Big Data & Analytics Meetup ¨  MPPs are SQL engines. ¨  Known for their horse power in being able to handle & process big data work loads. ¨  Extensive/Full ANSI SQL support. ¨  Rich ETL and BI tools for easy dev & consumption. ¨  Some MPPs like Greenplum support Machine Learning use cases extensively. MPP Databases: Processing side
  • 13. Pivotal NY – Big Data & Analytics Meetup The Winning Combination ¨  HDFS as the distributed FS. ¨  Schema-less, fault- tolerant. ¨  Huge Apache eco-system of tools/products. ¨  MPP SQL Engine for Data access, processing and ML work loads. ¨  Blend with and leverage BI and ETL tools ¨  Blend with Hadoop File formats and integrate with Apache HDFS tools/products.
  • 14. Pivotal NY – Big Data & Analytics Meetup HAWQ, the MPP SQL Engine for HDFS ¨  Derived from Greenplum MPP, shortening many years engineering efforts that will go into building an MPP from scratch. ¨  Stores data in HDFS using Parquet file format. ¨  Familiar SQL interface, 100% ANSI compliant. ¨  All kinds of joins, analytical functions etc. ¨  Doesn’t use MapReduce, no dependencies on Hive or Hive metastore. ¨  Way faster than MR, not even a valid comparison. ¨  Enables In-Database Machine Learning.
  • 15. Pivotal NY – Big Data & Analytics Meetup Pivotal HAWQ Architecture HAWQ Master HAWQ Segment(s) Node 1 HAWQ Secondary Master Node 2 Node 5 … NameNode Node 3 Secondary NameNode Node 4 Quorum Journal Manager Quorum Journal Manager Zoo Keeper Quorum Journal Manager Data Node HAWQ Segment(s) Node 6 Data Node HAWQ Segment(s) Node N Data Node Zoo Keeper Zoo Keeper High speed Interconnect
  • 16. Pivotal NY – Big Data & Analytics Meetup Pivotal HAWQ Architecture ¨ Master •  Master and Standby Master Host •  Master coordinates work with Segment Hosts ¨ Segment •  One or more Segment Instances •  Process queries in parallel •  Dedicated CPU, Memory and Disk ¨ Interconnect •  High Speed Interconnect for continuous pipelining of data processing
  • 17. Pivotal NY – Big Data & Analytics Meetup Key components of HAWQ Master HAWQ Master Query Parser Query Optimizer Query Executor Transaction Manager Process Manager Metadata Catalog HAWQ Standby Master Query Parser Query Optimizer Query Executor Transaction Manager Process Manager Metadata CatalogWAL replication
  • 18. Pivotal NY – Big Data & Analytics Meetup Key components of HAWQ segments HAWQ Segment HDFS Datanode Segment Data Directory Local Filesystem Spill Data Directory Query Executor libhdfs3 PXF
  • 19. Pivotal NY – Big Data & Analytics Meetup What differentiates HAWQ?
  • 20. Pivotal NY – Big Data & Analytics Meetup What differentiates HAWQ: Interconnect & Dynamic Pipelining q  Parallel data flow using the high speed UDP interconnect q  No materialization of intermediate data q  Map Reduce will always “write” to disk, HAWQ will “spill” to disk only when insufficient memory q  Guarantees that queries will always complete HAWQ Segment PXF Local Temp Storage Segment Host Query Executor DataNode PXF HAWQ Segment PXF Local Temp Storage Segment Host Query Executor DataNode PXF HAWQ Segment PXF Local Temp Storage Segment Host Query Executor DataNode PXF Interconnect Dynamic Pipelining
  • 21. Pivotal NY – Big Data & Analytics Meetup What differentiates HAWQ: libhdfs3 ¨  Pivotal rewrote libhdfs in C++ resulting in libhdfs3 ¤  C based library ¤  Leverages protocol buffers to achieve greater performance ¨  libhdfs3 is used to access HDFS from HAWQ ¨  Could benefit from short circuit reads ¨  Open sourced libhdfs3 https://github.com/PivotalRD/libhdfs3
  • 22. Pivotal NY – Big Data & Analytics Meetup What differentiates HAWQ: HDFS Truncate ¨  HDFS is designed as append-only, POSIX-like FS ¨  ACID compliant – ¤  HDFS guarantees the atomicity of Truncate operations i.e. either it succeeds or fails ¨  Added HDFS Truncate(HDFS-3107) to support rollbacks
  • 23. Pivotal NY – Big Data & Analytics Meetup What differentiates HAWQ: ORCA ¨  Multi-core query optimizer(codename ORCA) ¨  Cost based optimizer ¨  Window and Analytic Functions ¨  1.2 Billion options evaluated in 250ms as an ex. for TPC-DS Query #21 ¤  Partition Elimination ¤  Subquery Unnesting ¨  Presented at ACM SIGMOD 2014 Conference
  • 24. Pivotal NY – Big Data & Analytics Meetup Pivotal Xtensions Framework(PXF) ¨  External table interface in HAWQ to read data stored in Hadoop ecosystem ¨  External tables can be used to ¤  Load data into HAWQ from Hadoop ¤  Query Hadoop data without materializing it into HAWQ ¨  Enables loading and querying of data stored in ¤  HDFS Text ¤  Avro ¤  HBase ¤  Hive n Text, Sequence and RCFile formats
  • 25. Pivotal NY – Big Data & Analytics Meetup PXF example to query HBase ¨  Get data from an HBase table called‘sales’. In this example we are only interested in the rowkey, the qualifier ‘saleid’ inside column family ‘cf1’, and the qualifier ‘comments’ inside column family ‘cf8’ CREATE EXTERNAL TABLE hbase_sales ( recordkey bytea, “cf1:saleid” int, “cf8:comments” varchar ) LOCATION(‘pxf://10.76.72.26:50070/sales? Fragmenter=HBaseDataFragmenter& Accessor=HBaseAccessor& Resolver=HBaseResolver') FORMAT ‘custom’ (formatter='gpxfwritable_import');
  • 26. Pivotal NY – Big Data & Analytics Meetup Source Data to HDFS HAWQ External Table for immediate Ad-Hoc Load or Append a regular HAWQ table Load or Append directly to a regular HAWQ Table Data outside HDFS Example Data Flow in HAWQ BI Tools for Reporting & Dashboard In-Database Machine Learning using MADLib, PL/ Python, PL/R
  • 27. Pivotal NY – Big Data & Analytics Meetup HAWQ Demo ¨  HAWQ Cluster Topology ¨  HAWQ Ingest ¨  HAWQ PXF ¨  Dashboards
  • 28. Pivotal NY – Big Data & Analytics Meetup HAWQ Demo DataSet ¨  23 Million rows of fake data, Customer Credit Card spend. ¨  Time Dimension ¨  Aggregate Tables
  • 29. MACHINE LEARNING ON HDFS USING HAWQ SQL and Machine Learning on HDFS using HAWQ
  • 30. Pivotal NY – Big Data & Analytics Meetup Analytics on HDFS, Popular Notions ¨  Store  it  all  in  HDFS  and  Query  it  all  using  HAWQ   ¤  Scalable  Storage  and  Scalable  Compute   ¨  “Gravity”  of  Data  changes  everything   ¤  Move  code,  not  data   ¤  Parallelize  everything   ¤  Look  at  enGre  data  set  and  not  just  a  sample   ¤  More  data  could  mean  simpler  models  
  • 31. Pivotal NY – Big Data & Analytics Meetup MADLib for Machine Learning ¨  MADlib  is  an  open-­‐source  library  for  scalable  in-­‐database   analyGcs.  It  provides  data-­‐parallel  implementaGons  of   mathemaGcal,  staGsGcal  and  machine  learning  methods  for   structured  and  unstructured  data.  
  • 32. Pivotal NY – Big Data & Analytics Meetup UDF - PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} ¨  Allows users to write functions in the R/Python/Java, Perl, pgsql or C languages ¨  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster ¨  Can install Python extensions like Numpy, NLTK, Scikit-learn, Scipy. ¨  Data Parallelism: PL/X piggybacks on MPP architecture
  • 33. Pivotal NY – Big Data & Analytics Meetup UDF - PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} HAWQ Master HAWQ Segment(s) Node 1 HAWQ Secondary Master Node 2 Node 5 … NameNode Node 3 Secondary NameNode Node 4 Quorum Journal Manager Quorum Journal Manager Zoo Keeper Quorum Journal Manager Data Node HAWQ Segment(s) Node 6 Data Node HAWQ Segment(s) Node N Data Node Zoo Keeper Zoo Keeper High speed Interconnect
  • 34. Pivotal NY – Big Data & Analytics Meetup Data Analytics in SQL w/ HAWQ ¨  A Data Science workflow in SQL ¨  Re-training is often necessary to take the step from traditional BI to DS ¨  Many parts in parallel ¤  Data manipulation ¤  Training (e.g. Linear regression, elastic net, SVM) ¤  Model evaluation ¨  MADLib for native SQL interface ¤  Simple syntax ¨  Extensible with PL/Python and PL/R ¤  All your favorite packages
  • 35. Pivotal NY – Big Data & Analytics Meetup Workflow Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions
  • 36. Pivotal NY – Big Data & Analytics Meetup Exploratory Data Analysis/Viz Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  SQL -  MADLib -  BI Tools
  • 37. Pivotal NY – Big Data & Analytics Meetup Explore Data with MADLib ¨  Select a row ¨  Group by customer (+ age and gender) to see total spend ¤  Call it ‘summary_stats’ ¤  Select a row ¨  Use MADLib Summary Stats function ¤  select madlib.summary('summary_stats', 'summary_stats_out’, ‘total_spent'); ¤  Creates a table ‘summary_stats_out’ we can query ¤  Can group by: select madlib.summary('summary_stats', 'summary_stats_out_grouped', ‘total_spent’, ‘gender, age’);
  • 38. Pivotal NY – Big Data & Analytics Meetup Ask Questions ¤  Start with simple questions and simple (linear) models. ¤  Imagine we get new data with masked demographic data… ¤  What is the simplest possible question we can ask of this dataset? n  Can we predict age by total amount spent? ¤  What needs to be done first?
  • 39. Pivotal NY – Big Data & Analytics Meetup Feature Engineering Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  SQL -  PL/R -  PL/Python
  • 40. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset ¤  We want the data to look something like: ssn | sum | age -------------+----------------------- 621-95-5488 | 19435.72 | 30
  • 41. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset ¤  We want the data to look something like: ssn | sum | age -------------+----------------------- 621-95-5488 | 19435.72 | 30 CREATE OR REPLACE VIEW cust_age_totalspend AS SELECT ssn, sum(amt), extract(years from age(NOW(),dob)) as age FROM mart.trans_fact GROUP BY ssn, extract(years from age(NOW(),dob));
  • 42. Pivotal NY – Big Data & Analytics Meetup Skipping for now… Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  SQL -  PL/R -  PL/Python -  MADLib
  • 43. Pivotal NY – Big Data & Analytics Meetup Building Models Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  PL/R -  PL/Python -  MADLib
  • 44. Pivotal NY – Big Data & Analytics Meetup MADLib’s Interface ¤  Inputs: tables n  VIEW: cust_age_totalspend ¤  Outputs: tables that can be queried n  MADLib generates this automatically n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from an API ¤  Labels ¤  Features ¤  Parameters ¤  Parallel when possible
  • 45. Pivotal NY – Big Data & Analytics Meetup MADLib’s Interface ¤  Inputs: tables n  VIEW: cust_age_totalspend ¤  Outputs: tables that can be queried n  MADLib generates this automatically n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from an API ¤  Labels ¤  Features ¤  Parameters ¤  Parallel when possible SELECT madlib.linregr_train( 'cust_age_totalspend', 'linear_model', 'age', 'ARRAY[1, sum]’);
  • 46. Pivotal NY – Big Data & Analytics Meetup MADLib’s Interface ¤  Inputs: tables n  VIEW: cust_age_totalspend ¤  Outputs: tables that can be queried n  MADLib generates this automatically n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from an API ¤  Labels ¤  Features ¤  Parameters ¤  Parallel when possible SELECT madlib.linregr_train( 'cust_age_totalspend', 'linear_model', 'age', 'ARRAY[1, sum]’);
  • 47. Pivotal NY – Big Data & Analytics Meetup MADLib’s Interface ¤  Inputs: tables n  VIEW: cust_age_totalspend ¤  Outputs: tables that can be queried n  MADLib generates this automatically n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from an API ¤  Labels ¤  Features ¤  Parameters ¤  Parallel when possible SELECT madlib.linregr_train( 'cust_age_totalspend', 'linear_model', 'age', 'ARRAY[1, sum]’);
  • 48. Pivotal NY – Big Data & Analytics Meetup MADLib’s Interface ¤  Inputs: tables n  VIEW: cust_age_totalspend ¤  Outputs: tables that can be queried n  MADLib generates this automatically n  This can be piped into PyMADLib, pulled via Tableau connector, or queriable from an API ¤  Labels ¤  Features ¤  Parameters ¤  Parallel when possible SELECT madlib.linregr_train( 'cust_age_totalspend', 'linear_model', 'age', 'ARRAY[1, sum]’);
  • 49. Pivotal NY – Big Data & Analytics Meetup Scoring Chosen Model Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  PL/R -  PL/Python -  MADLib
  • 50. Pivotal NY – Big Data & Analytics Meetup Viewing Coefficients ¤  As simple as querying a table: SELECT * FROM linear_model; ¤  Unnest arrays: SELECT unnest(ARRAY['intercept', 'sum']) as attribute, unnest(coef) as coefficient, unnest(std_err) as standard_error, unnest(t_stats) as t_stat, unnest(p_values) as pvalue FROM linear_model;
  • 51. Pivotal NY – Big Data & Analytics Meetup Making Predictions SELECT ssn, sum, age, -- calculate predictions: array and m.coef are vectors madlib.linregr_predict( ARRAY[1, sum], m.coef ) as predict, -- calculate residuals: array and m.coef are vectors age - madlib.linregr_predict( ARRAY[1, sum], m.coef ) as residual FROM cust_age_totalspend, linear_model m;
  • 52. Pivotal NY – Big Data & Analytics Meetup Making Predictions SELECT ssn, sum, age, -- calculate predictions: array and m.coef are vectors madlib.linregr_predict( ARRAY[1, sum], m.coef ) as predict, -- calculate residuals: array and m.coef are vectors age - madlib.linregr_predict( ARRAY[1, sum], m.coef ) as residual FROM cust_age_totalspend, linear_model m;
  • 53. Pivotal NY – Big Data & Analytics Meetup Making/Evaluating Predictions SELECT ssn, sum, age, -- calculate predictions: array and m.coef are vectors madlib.linregr_predict( ARRAY[1, sum], m.coef ) as predict, -- calculate residuals: array and m.coef are vectors age - madlib.linregr_predict( ARRAY[1, sum], m.coef ) as residual FROM cust_age_totalspend, linear_model m;
  • 54. Pivotal NY – Big Data & Analytics Meetup Scoring (Mean Squared Error) SELECT avg(residual^2) AS mse FROM (SELECT age - madlib.linregr_predict( ARRAY[1, sum], m.coef ) as residual FROM cust_age_totalspend, linear_model m) PREDICTIONS;
  • 55. Pivotal NY – Big Data & Analytics Meetup Training, Testing, Cross-Validation Sets Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  SQL -  PL/R -  PL/Python -  MADLib
  • 56. Pivotal NY – Big Data & Analytics Meetup SQL To Create 80/20 Test/Train ¤  MADLib has a built-in cross validation function ¤  Stratified Sampling with pure SQL n  Random seed (-1, 1): n  SELECT setseed(0.444); n  Partition the data by group n  Sort by the random # n  Assign a row count n  Use a correlated subquery to select the lowest 80% of row counts by partition into the training set n  The code…
  • 57. Pivotal NY – Big Data & Analytics Meetup SQL To Create 80/20 Test/Train -- 1. perform the same aggregation step as before in creating the view WITH assign_random AS ( SELECT ssn, extract(years from age(NOW(),dob)) as age, sum(amt) as total_spent, random() as random FROM mart.trans_fact GROUP BY ssn, extract(years from age(NOW(),dob)) ), -- 2. add a different row count for group type (in this case, just age) training_rownums AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY age ORDER BY random) AS rownum FROM assign_random ) -- 3. sample the lowest 80% of rownums as the training set by group using ----- a correlated subquery (iterate over groups) because HAWQ is ANSI complient SELECT ssn, age, total_spent FROM training_rownums tr1 WHERE rownum <= ( (SELECT MAX(rownum) * 0.8 FROM training_rownums tr2 WHERE tr1.age = tr2.age )));
  • 58. Pivotal NY – Big Data & Analytics Meetup Another question Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions
  • 59. Pivotal NY – Big Data & Analytics Meetup Back Here Again… Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  SQL -  PL/R -  PL/Python
  • 60. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset ¤  Remember the initial dataset: n  select ssn, gender, trans_date, category, round(amt, 2) as amount from mart.trans_fact limit 5; ¨  ssn | gender | trans_date | category | amount ¨  -------------+--------+------------+----------------+------- ¨  587-27-8957 | M | 2013-12-21 | health_fitness | 10.59 ¨  587-27-8957 | M | 2014-10-19 | shopping_pos | 0.80 ¨  587-27-8957 | M | 2013-05-04 | entertainment | 5.17 ¨  587-27-8957 | M | 2014-03-27 | utilities | 9.59 ¨  587-27-8957 | M | 2013-03-30 | home | 10.64 ¤  How can we shape the data? n  Transpose using case statements n  Aggregate and sum n  The query…
  • 61. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset ¤  Remember the initial dataset: n  select ssn, gender, trans_date, category, round(amt, 2) as amount from mart.trans_fact limit 5; ¨  ssn | gender | trans_date | category | amount ¨  -------------+--------+------------+----------------+------- ¨  587-27-8957 | M | 2013-12-21 | health_fitness | 10.59 ¨  587-27-8957 | M | 2014-10-19 | shopping_pos | 0.80 ¨  587-27-8957 | M | 2013-05-04 | entertainment | 5.17 ¨  587-27-8957 | M | 2014-03-27 | utilities | 9.59 ¨  587-27-8957 | M | 2013-03-30 | home | 10.64 ¤  How can we shape the data? n  Transpose using case statements n  Aggregate and sum n  The query…
  • 62. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset ¤  Remember the initial dataset: n  select ssn, gender, trans_date, category, round(amt, 2) as amount from mart.trans_fact limit 5; ¨  ssn | gender | trans_date | category | amount ¨  -------------+--------+------------+----------------+------- ¨  587-27-8957 | M | 2013-12-21 | health_fitness | 10.59 ¨  587-27-8957 | M | 2014-10-19 | shopping_pos | 0.80 ¨  587-27-8957 | M | 2013-05-04 | entertainment | 5.17 ¨  587-27-8957 | M | 2014-03-27 | utilities | 9.59 ¨  587-27-8957 | M | 2013-03-30 | home | 10.64 ¤  How can we shape the data? n  Transpose using case statements n  Aggregate and sum n  The query…
  • 63. Pivotal NY – Big Data & Analytics Meetup Building Models Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  PL/R -  PL/Python -  MADLib
  • 64. Pivotal NY – Big Data & Analytics Meetup Learn Age Using Category Spending DROP TABLE IF EXISTS elastic_output; SELECT madlib.elastic_net_train( 'training_set', 'elastic_output', 'age', 'array[food_dining, utilities, grocery_net, home, pharmacy, shopping_pos, kids_pets, personal_care, misc_pos, gas_transport, misc_net, health_fitness, shopping_net, travel]', <other parameters>);
  • 65. Pivotal NY – Big Data & Analytics Meetup Scoring Chosen Model Score Models against test data Split Datasets Train/Tune Model EDA/V Ask Questions Manipulate Data Model is good Make Decisions -  PL/R -  PL/Python -  MADLib
  • 66. Pivotal NY – Big Data & Analytics Meetup Predictions on Test Set SELECT ssn, predict, age - predict AS residual FROM ( SELECT test_set.*, madlib.elastic_net_gaussian_predict( m.coef_all, m.intercept, ARRAY[food_dining, utilities, grocery_net, home, pharmacy, shopping_pos, kids_pets, personal_care, misc_pos, gas_transport, misc_net, health_fitness, shopping_net, travel] ) AS predict FROM test_set, elastic_output m) s ORDER BY ssn;
  • 67. Pivotal NY – Big Data & Analytics Meetup Predictions on Test Set SELECT ssn, predict, age - predict AS residual FROM ( SELECT test_set.*, madlib.elastic_net_gaussian_predict( m.coef_all, m.intercept, ARRAY[food_dining, utilities, grocery_net, home, pharmacy, shopping_pos, kids_pets, personal_care, misc_pos, gas_transport, misc_net, health_fitness, shopping_net, travel] ) AS predict FROM test_set, elastic_output m) s ORDER BY ssn;
  • 68. Pivotal NY – Big Data & Analytics Meetup Score Results SELECT avg(residual^2) FROM (<previous slide>) RESIDUALS;
  • 69. Pivotal NY – Big Data & Analytics Meetup PL/Python ¤  PL/Python (or PL/R) for data manipulation and training models ¤  Less intuitive than MADLib
  • 70. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset - UDFs DROP TYPE IF EXISTS cust_address CASCADE; CREATE TYPE cust_address AS ( ssn text, address text ); CREATE OR REPLACE FUNCTION mart.test(ssn text, street text, city text, state text, zip numeric) RETURNS cust_address AS $$ return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)] $$ LANGUAGE plpythonu; ---------------------------- SELECT mart.test(ssn, street, city, state, zip) from mart.trans_fact limit 5;
  • 71. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset - UDFs CREATE TYPE cust_address AS ( ssn text, address text ); CREATE OR REPLACE FUNCTION mart.test(ssn text, street text, city text, state text, zip numeric) RETURNS cust_address AS $$ return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)] $$ LANGUAGE plpythonu; ---------------------------- SELECT mart.test(ssn, street, city, state, zip) from mart.trans_fact limit 5;
  • 72. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset - UDFs DROP TYPE IF EXISTS cust_address CASCADE; CREATE TYPE cust_address AS ( ssn text, address text ); CREATE OR REPLACE FUNCTION mart.test(ssn text, street text, city text, state text, zip numeric) RETURNS cust_address AS $$ return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)] $$ LANGUAGE plpythonu; ---------------------------- SELECT mart.test(ssn, street, city, state, zip) from mart.trans_fact limit 5;
  • 73. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset - UDFs DROP TYPE IF EXISTS cust_address CASCADE; CREATE TYPE cust_address AS ( ssn text, address text ); CREATE OR REPLACE FUNCTION mart.test(ssn text, street text, city text, state text, zip numeric) RETURNS cust_address AS $$ return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)] $$ LANGUAGE plpythonu; ---------------------------- SELECT mart.test(ssn, street, city, state, zip) from mart.trans_fact limit 5;
  • 74. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset - UDFs DROP TYPE IF EXISTS cust_address CASCADE; CREATE TYPE cust_address AS ( ssn text, address text ); CREATE OR REPLACE FUNCTION mart.test(ssn text, street text, city text, state text, zip numeric) RETURNS cust_address AS $$ return [ssn, street + ', ' + city + ', ' + state + ' ' + str(zip)] $$ LANGUAGE plpythonu; ---------------------------- SELECT mart.test(ssn, street, city, state, zip) from mart.trans_fact limit 5;
  • 75. Pivotal NY – Big Data & Analytics Meetup Shaping the Dataset - UDFs INSERT INTO addresses SELECT distinct (mart.test(ssn, street, city, state, zip)).ssn, (mart.test(ssn, street, city, state, zip)).address FROM mart.trans_fact LIMIT 5; SELECT * FROM addresses;