Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech

Data Analysis Timeline
ISAM files
COBOL/JCL

ISAM files
COBOL/JCL
RDBMS
SQL

ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive

ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map
Reduce/Hive

HDFS files
Map
Reduce/Hive
SQL

Simplified View of
Co-existence
HDFS
Files
Map Reduce , Hive,
HBase
HDFS

Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS

Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
The
Great
Divide

Pivotal eXtension Framework (PXF)
• History
o Based on external table functionality of RDBMS
o Built at Pivotal by a small team in Israel
• Goals
o Single hop
o No Materialization of data
o Fully parallel for high throughput
o Extensible

Motivation for building PXF
• Use SQL engine’s statistical/analytic functions (e.g.
Madlib) on third party data stores e.g.
o HBase data
o Hive data
o Native Data on HDFS in a variety of formats
• Join in-database dimensions with other fact tables
• Fast ingest of data into SQL native format (insert into …
select * from …)

Motivation for building PXF
• Enterprises love the cheap storage offered by HDFS,
and want to store data over there
• M/R is very limiting
• Integrating with Third Party systems e.g. Accumulo etc.
• Existing techniques involved copying data to HDFS,
which is very brittle and in-efficient

High Level Flow
SQL
Data
Node5
Data
Node1
Data
Node2
Data
Node3
Data
Node4
Where is
the data for
table foo?
On
DataNodes
1,3 and 5
- Protocol is http
- End points are running on all data nodes
Name
Node

Major components
• Fragmenter
o Get the locations of fragments for a table
• Accessor
o Understand and read the fragment, return records
• Resolver
o Convert the records into a SQL engine format
• Analyzer
o Provide source stats to the Query optimizer

PXF Architecture
HAWQ
Master
M/R,
Pig,
Hive
Data Node
Container with End-Points
PXF Fragmenter
Local HDFS
Hadoop
Pivotal Green
Zookeeper
3
1
6
PSQL
select * from external table foo
location=”pxf://namenode:50070/financedata”
0
splits[..]
HAWQ
Segment
getSplit(0)
PXFWritable
A
B
0 6To
A BTo
MetaData
Data
Native
PHD
5
4
PXF Accessor/Resolver
Local HDFS
2

Classes
• The four major components are defined as interfaces and
base classes that can be extended. e.g. Fragmenter
/*
* Class holding information about fragments (FragmentInfo)
*/
public class FragmentsOutput {
public FragmentsOutput();
public void addFragment(String sourceName, String[] replicas, byte[] metadata );
public void addFragment(String sourceName, String[] replicas, byte[] metadata,
String userData);
public List<FragmentInfo> getFragments();
}

/* Internal interface that defines the access to data on the source
* data store (e.g, a file on HDFS, a region of an HBase table, etc).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IReadAccessor {
public boolean openForRead() throws Exception;
public OneRow readNextObject() throws Exception;
public void closeForRead() throws Exception;
}
/*
* An interface for writing data into a data store
* (e.g, a sequence file on HDFS).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IWriteAccessor {
public boolean openForWrite() throws Exception;
public OneRow writeNextObject(OneRow onerow) throws Exception;
public void closeForWrite() throws Exception;
}
Accessor Interface

/*
* Interface that defines the deserialization of one record brought from
* the data Accessor. Every implementation of a deserialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IReadResolver
{
public List<OneField> getFields(OneRow row) throws Exception;
}
/*
* Interface that defines the serialization of data read from the DB
* into a OneRow object.
* Every implementation of a serialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IWriteResolver
{
public OneRow setFields(DataInputStream inputStream) throws Exception;
}
Resolver Interface

/*Abstract class that defines getting statistics for ANALYZE.
* GetEstimatedStats returns statistics for a given path
* (block size, number of blocks, number of tuples (rows)).
* Used when calling ANALYZE on a PXF external table, to get
* table's statistics that are used by the optimizer to plan queries.
*/
public abstract class Analyzer extends Plugin {
public Analyzer(InputData metaData){
super(metaData);
}
/** path is a data source name (e.g, file, dir, wildcard, table name)
* returns the data statistics in json format
*
* NOTE: It is highly recommended to implement an extremely fast logic
* that returns *estimated* statistics. Scanning all the data for exact
* statistics is considered bad practice.
*/
public String GetEstimatedStats(String data) throws Exception {
/* Return default values */
return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo());
}
}
Analyzer Interface

Syntax - Long Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?
FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter&
ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor&
RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver&
ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer')
format 'TEXT' (delimiter = ',');
Say WHAT???

Syntax - Short Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple')
format 'TEXT' (delimiter = ',');
Whew!!

Built in Profiles
• # of profiles are built in and more are being contributed
o HBase, Hive, HDFS Text, Avro, SequenceFiles,
GemFireXD, Accumulo, Cassandra, JSON
o PXF will be open-sourced completely, for using with your
favorite SQL engine.
o But you can write your own connectors right now, and
use it with HAWQ.

Predicate Pushdown
• SQL engines may push down parts of the “WHERE” clause
down to PXF.
• e.g. “where id > 500 and id < 1000”
• PXF provides a FilterBuilder class
• Filters can be combined together
• Simple expression “constant <OP> column”
• Complex expression “object(s) <OP> object(s)”

Demo
• Create a text file on HDFS
• Create a table using a SQL engine (HAWQ) on HDFS
• Create an external table using PXF
• Select from both tables separately
• Finally run a join across both tables

More info online...
• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html
• http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

More Related Content

What's hot

Similar to Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

Recently uploaded

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation