Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech
Data Analysis Timeline
ISAM files
COBOL/JCL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map
Reduce/Hive
Data Analysis Timeline
HDFS files
Map
Reduce/Hive
SQL
Simplified View of
Co-existence
HDFS
Files
Map Reduce , Hive,
HBase
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
The
Great
Divide
PXF addresses the
divide.
Pivotal eXtension Framework (PXF)
• History
o Based on external table functionality of RDBMS
o Built at Pivotal by a small team in Israel
• Goals
o Single hop
o No Materialization of data
o Fully parallel for high throughput
o Extensible
Motivation for building PXF
• Use SQL engine’s statistical/analytic functions (e.g.
Madlib) on third party data stores e.g.
o HBase data
o Hive data
o Native Data on HDFS in a variety of formats
• Join in-database dimensions with other fact tables
• Fast ingest of data into SQL native format (insert into …
select * from …)
Motivation for building PXF
• Enterprises love the cheap storage offered by HDFS,
and want to store data over there
• M/R is very limiting
• Integrating with Third Party systems e.g. Accumulo etc.
• Existing techniques involved copying data to HDFS,
which is very brittle and in-efficient
High Level Flow
SQL
Data
Node5
Data
Node1
Data
Node2
Data
Node3
Data
Node4
Where is
the data for
table foo?
On
DataNodes
1,3 and 5
- Protocol is http
- End points are running on all data nodes
Name
Node
Major components
• Fragmenter
o Get the locations of fragments for a table
• Accessor
o Understand and read the fragment, return records
• Resolver
o Convert the records into a SQL engine format
• Analyzer
o Provide source stats to the Query optimizer
PXF Architecture
HAWQ
Master
M/R,
Pig,
Hive
Data Node
Container with End-Points
PXF Fragmenter
Local HDFS
Hadoop
Pivotal Green
Zookeeper
3
1
6
PSQL
select * from external table foo
location=”pxf://namenode:50070/financedata”
0
splits[..]
HAWQ
Segment
getSplit(0)
PXFWritable
A
B
0 6To
A BTo
MetaData
Data
Native
PHD
5
4
PXF Accessor/Resolver
Local HDFS
2
Classes
• The four major components are defined as interfaces and
base classes that can be extended. e.g. Fragmenter
/*
* Class holding information about fragments (FragmentInfo)
*/
public class FragmentsOutput {
public FragmentsOutput();
public void addFragment(String sourceName, String[] replicas, byte[] metadata );
public void addFragment(String sourceName, String[] replicas, byte[] metadata,
String userData);
public List<FragmentInfo> getFragments();
}
/* Internal interface that defines the access to data on the source
* data store (e.g, a file on HDFS, a region of an HBase table, etc).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IReadAccessor {
public boolean openForRead() throws Exception;
public OneRow readNextObject() throws Exception;
public void closeForRead() throws Exception;
}
/*
* An interface for writing data into a data store
* (e.g, a sequence file on HDFS).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IWriteAccessor {
public boolean openForWrite() throws Exception;
public OneRow writeNextObject(OneRow onerow) throws Exception;
public void closeForWrite() throws Exception;
}
Accessor Interface
/*
* Interface that defines the deserialization of one record brought from
* the data Accessor. Every implementation of a deserialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IReadResolver
{
public List<OneField> getFields(OneRow row) throws Exception;
}
/*
* Interface that defines the serialization of data read from the DB
* into a OneRow object.
* Every implementation of a serialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IWriteResolver
{
public OneRow setFields(DataInputStream inputStream) throws Exception;
}
Resolver Interface
/*Abstract class that defines getting statistics for ANALYZE.
* GetEstimatedStats returns statistics for a given path
* (block size, number of blocks, number of tuples (rows)).
* Used when calling ANALYZE on a PXF external table, to get
* table's statistics that are used by the optimizer to plan queries.
*/
public abstract class Analyzer extends Plugin {
public Analyzer(InputData metaData){
super(metaData);
}
/** path is a data source name (e.g, file, dir, wildcard, table name)
* returns the data statistics in json format
*
* NOTE: It is highly recommended to implement an extremely fast logic
* that returns *estimated* statistics. Scanning all the data for exact
* statistics is considered bad practice.
*/
public String GetEstimatedStats(String data) throws Exception {
/* Return default values */
return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo());
}
}
Analyzer Interface
Syntax - Long Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?
FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter&
ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor&
RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver&
ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer')
format 'TEXT' (delimiter = ',');
Say WHAT???
Syntax - Short Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple')
format 'TEXT' (delimiter = ',');
Whew!!
Built in Profiles
• # of profiles are built in and more are being contributed
o HBase, Hive, HDFS Text, Avro, SequenceFiles,
GemFireXD, Accumulo, Cassandra, JSON
o PXF will be open-sourced completely, for using with your
favorite SQL engine.
o But you can write your own connectors right now, and
use it with HAWQ.
Predicate Pushdown
• SQL engines may push down parts of the “WHERE” clause
down to PXF.
• e.g. “where id > 500 and id < 1000”
• PXF provides a FilterBuilder class
• Filters can be combined together
• Simple expression “constant <OP> column”
• Complex expression “object(s) <OP> object(s)”
Demo
• Create a text file on HDFS
• Create a table using a SQL engine (HAWQ) on HDFS
• Create an external table using PXF
• Select from both tables separately
• Finally run a join across both tables
More info online...
• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html
• http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
Questions?
Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

  • 1.
    Pivotal eXtension Framework Sameer Tiwari HadoopStorage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech
  • 2.
  • 3.
    Data Analysis Timeline ISAMfiles COBOL/JCL RDBMS SQL
  • 4.
    Data Analysis Timeline ISAMfiles COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 5.
    Data Analysis Timeline ISAMfiles COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 6.
    Data Analysis Timeline ISAMfiles COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 7.
    Data Analysis Timeline HDFSfiles Map Reduce/Hive SQL
  • 8.
  • 9.
    Simplified View of Co-existence SQL HDFSFiles Map Reduce , Hive, HBase RDBMS Files HDFS
  • 10.
    Simplified View of Co-existence SQL HDFSFiles Map Reduce , Hive, HBase RDBMS Files HDFS The Great Divide
  • 11.
  • 12.
    Pivotal eXtension Framework(PXF) • History o Based on external table functionality of RDBMS o Built at Pivotal by a small team in Israel • Goals o Single hop o No Materialization of data o Fully parallel for high throughput o Extensible
  • 13.
    Motivation for buildingPXF • Use SQL engine’s statistical/analytic functions (e.g. Madlib) on third party data stores e.g. o HBase data o Hive data o Native Data on HDFS in a variety of formats • Join in-database dimensions with other fact tables • Fast ingest of data into SQL native format (insert into … select * from …)
  • 14.
    Motivation for buildingPXF • Enterprises love the cheap storage offered by HDFS, and want to store data over there • M/R is very limiting • Integrating with Third Party systems e.g. Accumulo etc. • Existing techniques involved copying data to HDFS, which is very brittle and in-efficient
  • 15.
    High Level Flow SQL Data Node5 Data Node1 Data Node2 Data Node3 Data Node4 Whereis the data for table foo? On DataNodes 1,3 and 5 - Protocol is http - End points are running on all data nodes Name Node
  • 16.
    Major components • Fragmenter oGet the locations of fragments for a table • Accessor o Understand and read the fragment, return records • Resolver o Convert the records into a SQL engine format • Analyzer o Provide source stats to the Query optimizer
  • 17.
    PXF Architecture HAWQ Master M/R, Pig, Hive Data Node Containerwith End-Points PXF Fragmenter Local HDFS Hadoop Pivotal Green Zookeeper 3 1 6 PSQL select * from external table foo location=”pxf://namenode:50070/financedata” 0 splits[..] HAWQ Segment getSplit(0) PXFWritable A B 0 6To A BTo MetaData Data Native PHD 5 4 PXF Accessor/Resolver Local HDFS 2
  • 18.
    Classes • The fourmajor components are defined as interfaces and base classes that can be extended. e.g. Fragmenter /* * Class holding information about fragments (FragmentInfo) */ public class FragmentsOutput { public FragmentsOutput(); public void addFragment(String sourceName, String[] replicas, byte[] metadata ); public void addFragment(String sourceName, String[] replicas, byte[] metadata, String userData); public List<FragmentInfo> getFragments(); }
  • 19.
    /* Internal interfacethat defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */ public interface IReadAccessor { public boolean openForRead() throws Exception; public OneRow readNextObject() throws Exception; public void closeForRead() throws Exception; } /* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */ public interface IWriteAccessor { public boolean openForWrite() throws Exception; public OneRow writeNextObject(OneRow onerow) throws Exception; public void closeForWrite() throws Exception; } Accessor Interface
  • 20.
    /* * Interface thatdefines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IReadResolver { public List<OneField> getFields(OneRow row) throws Exception; } /* * Interface that defines the serialization of data read from the DB * into a OneRow object. * Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IWriteResolver { public OneRow setFields(DataInputStream inputStream) throws Exception; } Resolver Interface
  • 21.
    /*Abstract class thatdefines getting statistics for ANALYZE. * GetEstimatedStats returns statistics for a given path * (block size, number of blocks, number of tuples (rows)). * Used when calling ANALYZE on a PXF external table, to get * table's statistics that are used by the optimizer to plan queries. */ public abstract class Analyzer extends Plugin { public Analyzer(InputData metaData){ super(metaData); } /** path is a data source name (e.g, file, dir, wildcard, table name) * returns the data statistics in json format * * NOTE: It is highly recommended to implement an extremely fast logic * that returns *estimated* statistics. Scanning all the data for exact * statistics is considered bad practice. */ public String GetEstimatedStats(String data) throws Exception { /* Return default values */ return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo()); } } Analyzer Interface
  • 22.
    Syntax - LongForm CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data? FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter& ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor& RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver& ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer') format 'TEXT' (delimiter = ','); Say WHAT???
  • 23.
    Syntax - ShortForm CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple') format 'TEXT' (delimiter = ','); Whew!!
  • 24.
    Built in Profiles •# of profiles are built in and more are being contributed o HBase, Hive, HDFS Text, Avro, SequenceFiles, GemFireXD, Accumulo, Cassandra, JSON o PXF will be open-sourced completely, for using with your favorite SQL engine. o But you can write your own connectors right now, and use it with HAWQ.
  • 25.
    Predicate Pushdown • SQLengines may push down parts of the “WHERE” clause down to PXF. • e.g. “where id > 500 and id < 1000” • PXF provides a FilterBuilder class • Filters can be combined together • Simple expression “constant <OP> column” • Complex expression “object(s) <OP> object(s)”
  • 26.
    Demo • Create atext file on HDFS • Create a table using a SQL engine (HAWQ) on HDFS • Create an external table using PXF • Select from both tables separately • Finally run a join across both tables
  • 27.
    More info online... •http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html • http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
  • 28.
  • 29.
    Pivotal eXtension Framework Sameer Tiwari HadoopStorage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech