Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech
Data Analysis Timeline
ISAM files
COBOL/JCL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map
Reduce/Hive
Data Analysis Timeline
HDFS files
Map
Reduce/Hive
SQL
Simplified View of
Co-existence
HDFS
Files
Map Reduce , Hive,
HBase
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
The
Great
Divide
PXF addresses the
divide.
Pivotal eXtension Framework (PXF)
• History
o Based on external table functionality of RDBMS
o Built at Pivotal by a small...
Motivation for building PXF
• Use SQL engine’s statistical/analytic functions (e.g.
Madlib) on third party data stores e.g...
Motivation for building PXF
• Enterprises love the cheap storage offered by HDFS,
and want to store data over there
• M/R ...
High Level Flow
SQL
Data
Node5
Data
Node1
Data
Node2
Data
Node3
Data
Node4
Where is
the data for
table foo?
On
DataNodes
1...
Major components
• Fragmenter
o Get the locations of fragments for a table
• Accessor
o Understand and read the fragment, ...
PXF Architecture
HAWQ
Master
M/R,
Pig,
Hive
Data Node
Container with End-Points
PXF Fragmenter
Local HDFS
Hadoop
Pivotal G...
Classes
• The four major components are defined as interfaces and
base classes that can be extended. e.g. Fragmenter
/*
* ...
/* Internal interface that defines the access to data on the source
* data store (e.g, a file on HDFS, a region of an HBas...
/*
* Interface that defines the deserialization of one record brought from
* the data Accessor. Every implementation of a ...
/*Abstract class that defines getting statistics for ANALYZE.
* GetEstimatedStats returns statistics for a given path
* (b...
Syntax - Long Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070...
Syntax - Short Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:5007...
Built in Profiles
• # of profiles are built in and more are being contributed
o HBase, Hive, HDFS Text, Avro, SequenceFile...
Predicate Pushdown
• SQL engines may push down parts of the “WHERE” clause
down to PXF.
• e.g. “where id > 500 and id < 10...
Demo
• Create a text file on HDFS
• Create a table using a SQL engine (HAWQ) on HDFS
• Create an external table using PXF
...
More info online...
• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html
• http://docs.gopivotal.co...
Questions?
Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech
Upcoming SlideShare
Loading in …5
×

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

2,181 views

Published on

Accessing Data in Hadoop clusters on HDFS using SQL external table functionality.

Published in: Software, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,181
On SlideShare
0
From Embeds
0
Number of Embeds
33
Actions
Shares
0
Downloads
56
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

  1. 1. Pivotal eXtension Framework Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech
  2. 2. Data Analysis Timeline ISAM files COBOL/JCL
  3. 3. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL
  4. 4. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  5. 5. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  6. 6. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  7. 7. Data Analysis Timeline HDFS files Map Reduce/Hive SQL
  8. 8. Simplified View of Co-existence HDFS Files Map Reduce , Hive, HBase HDFS
  9. 9. Simplified View of Co-existence SQL HDFS Files Map Reduce , Hive, HBase RDBMS Files HDFS
  10. 10. Simplified View of Co-existence SQL HDFS Files Map Reduce , Hive, HBase RDBMS Files HDFS The Great Divide
  11. 11. PXF addresses the divide.
  12. 12. Pivotal eXtension Framework (PXF) • History o Based on external table functionality of RDBMS o Built at Pivotal by a small team in Israel • Goals o Single hop o No Materialization of data o Fully parallel for high throughput o Extensible
  13. 13. Motivation for building PXF • Use SQL engine’s statistical/analytic functions (e.g. Madlib) on third party data stores e.g. o HBase data o Hive data o Native Data on HDFS in a variety of formats • Join in-database dimensions with other fact tables • Fast ingest of data into SQL native format (insert into … select * from …)
  14. 14. Motivation for building PXF • Enterprises love the cheap storage offered by HDFS, and want to store data over there • M/R is very limiting • Integrating with Third Party systems e.g. Accumulo etc. • Existing techniques involved copying data to HDFS, which is very brittle and in-efficient
  15. 15. High Level Flow SQL Data Node5 Data Node1 Data Node2 Data Node3 Data Node4 Where is the data for table foo? On DataNodes 1,3 and 5 - Protocol is http - End points are running on all data nodes Name Node
  16. 16. Major components • Fragmenter o Get the locations of fragments for a table • Accessor o Understand and read the fragment, return records • Resolver o Convert the records into a SQL engine format • Analyzer o Provide source stats to the Query optimizer
  17. 17. PXF Architecture HAWQ Master M/R, Pig, Hive Data Node Container with End-Points PXF Fragmenter Local HDFS Hadoop Pivotal Green Zookeeper 3 1 6 PSQL select * from external table foo location=”pxf://namenode:50070/financedata” 0 splits[..] HAWQ Segment getSplit(0) PXFWritable A B 0 6To A BTo MetaData Data Native PHD 5 4 PXF Accessor/Resolver Local HDFS 2
  18. 18. Classes • The four major components are defined as interfaces and base classes that can be extended. e.g. Fragmenter /* * Class holding information about fragments (FragmentInfo) */ public class FragmentsOutput { public FragmentsOutput(); public void addFragment(String sourceName, String[] replicas, byte[] metadata ); public void addFragment(String sourceName, String[] replicas, byte[] metadata, String userData); public List<FragmentInfo> getFragments(); }
  19. 19. /* Internal interface that defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */ public interface IReadAccessor { public boolean openForRead() throws Exception; public OneRow readNextObject() throws Exception; public void closeForRead() throws Exception; } /* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */ public interface IWriteAccessor { public boolean openForWrite() throws Exception; public OneRow writeNextObject(OneRow onerow) throws Exception; public void closeForWrite() throws Exception; } Accessor Interface
  20. 20. /* * Interface that defines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IReadResolver { public List<OneField> getFields(OneRow row) throws Exception; } /* * Interface that defines the serialization of data read from the DB * into a OneRow object. * Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IWriteResolver { public OneRow setFields(DataInputStream inputStream) throws Exception; } Resolver Interface
  21. 21. /*Abstract class that defines getting statistics for ANALYZE. * GetEstimatedStats returns statistics for a given path * (block size, number of blocks, number of tuples (rows)). * Used when calling ANALYZE on a PXF external table, to get * table's statistics that are used by the optimizer to plan queries. */ public abstract class Analyzer extends Plugin { public Analyzer(InputData metaData){ super(metaData); } /** path is a data source name (e.g, file, dir, wildcard, table name) * returns the data statistics in json format * * NOTE: It is highly recommended to implement an extremely fast logic * that returns *estimated* statistics. Scanning all the data for exact * statistics is considered bad practice. */ public String GetEstimatedStats(String data) throws Exception { /* Return default values */ return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo()); } } Analyzer Interface
  22. 22. Syntax - Long Form CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data? FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter& ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor& RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver& ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer') format 'TEXT' (delimiter = ','); Say WHAT???
  23. 23. Syntax - Short Form CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple') format 'TEXT' (delimiter = ','); Whew!!
  24. 24. Built in Profiles • # of profiles are built in and more are being contributed o HBase, Hive, HDFS Text, Avro, SequenceFiles, GemFireXD, Accumulo, Cassandra, JSON o PXF will be open-sourced completely, for using with your favorite SQL engine. o But you can write your own connectors right now, and use it with HAWQ.
  25. 25. Predicate Pushdown • SQL engines may push down parts of the “WHERE” clause down to PXF. • e.g. “where id > 500 and id < 1000” • PXF provides a FilterBuilder class • Filters can be combined together • Simple expression “constant <OP> column” • Complex expression “object(s) <OP> object(s)”
  26. 26. Demo • Create a text file on HDFS • Create a table using a SQL engine (HAWQ) on HDFS • Create an external table using PXF • Select from both tables separately • Finally run a join across both tables
  27. 27. More info online... • http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html • http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
  28. 28. Questions?
  29. 29. Pivotal eXtension Framework Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech

×