SlideShare a Scribd company logo
Shivram Mani (HAWQ UD)
PXF
Pivotal Extension Framework
Agenda
● Motivations
● PXF Introduction
● Architecture/Design
● HAWQ Bridge - Deep Dive
● PXF - Developer View
● Usage/Plugins
● What’s coming
Motivations: SQL on Hadoop
RDBMS
?
various formats, storages
supported on HDFS
● ANSI SQL
● Cost based optimizer
● Transactions
● ...
Foreign
Tables!
PXF is an extension framework that facilitates access to external data
● Uniform tabular view to heterogeneous data sources
● Exploits parallelism for data access
● Pluggable framework for custom connectors
● Provides built-in connectors for accessing data in HDFS files, Hive/HBase
tables, etc
What is PXF ?
PXF Communication
Apache
Tomcat
PXF Webapp
REST API
libhdfs3 (written in C) segments
External Tables
Native
Tables
HTTP, port: 51200
Java API
Deployment Architecture
HAWQ
Master Node NN
pxf
HBase
Master
DN4
pxf
HAWQ
seg4
DN1
pxf
HAWQ
seg1
HBase
Region
Server1
DN2
pxf
HAWQ
seg2
HBase
Region
Server2
DN3
pxf
HAWQ
seg3
HBase
Region
Server3
* PXF needs to be installed on all DN
* PXF is recommended to be installed on NN
PXF Components
Fragmenter
Splits dataset into partitions
Returns locations of each partition
Accessor Understand and read/write the fragment
Return records
Resolver Convert records to a consumable format (Data Types)
Compact way to configure Fragmenter, Accessor,
ResolverProfile
Architecture - Read Data Flow
HAWQ
Master Node NN
pxf
DN1
pxf
HAWQ
seg1
select * from ext_table0
getFragments() API
pxf://<location>:<port>/<pa
th>
1
Fragments (JSON)2
7
3
Assign
Fragments
to Segments
DN1
pxf
HAWQ
seg1
DN1
pxf
HAWQ
seg1
Query dispatched to Segment 1,2,3… (Interconnect)
5
Read() REST
6 records
8
query result
Records (stream)
Fragmenter
Resolver
Accessor
4
Read Data Flow - Take 2
1. Get Fragments (Partition Data)
2. Fragment Distribution
3. Reading Data
HAWQ Bridge - Deep Dive
Step 1 - Get Fragments
• Code location: https://github.com/apache/incubator-
hawq/blob/master/src/backend/access/external/hd_work_mgr.c
• Called by optimizer (createplan.c)
• Gets fragments from PXF for the given location specified in the table,
using Fragmenter.
Step 2 - Fragments Distribution
• Code location: hd_work_mgr.c
• Returns a mapping of the fragments for each segment.
• Trying to maximize both parallelism and locality:
• Splitting the load between all participating segments (determined by
GUC).
• Assigning fragments to segments with a replica on the same host.
DN1 DN2 DN3 DN4
HAWQ
master NN
pxfpxfpxfpxf
HAWQ
seg1
pxf
HAWQ
seg2
HAWQ
seg3
HBase
master
HBase1, HBase2
HBase1, HBase3
HBase1, HBase2
HBase1, HBase3
HBase
regsion
server1
HBase
regsion
server2
HBase
regsion
server3
seg1 - green-DN2
seg2 - yellow-DN2 +
red-DN2
seg3 - orange-DN3
Step 2 - Fragments Distribution
Step 3 - Reading Data
• Done using external protocol API.
• PXF code is under cdb-pg/src/backend/access/external/
• C Rest API using enhanced libcurl https://github.com/apache/incubator-
hawq/blob/master/src/backend/access/external/libchurl.c
• Each segment calls PXF to get each of its fragments’ data, using
Accessor & Resolver
• Data returned as stream(text/csv/binary) from PXF
PXF Developer View
PXF Usage
Built-in with Plugins
HDFS Hive
HBase GemfireXD
Community (https://bintray.com/big-data/maven/pxf-plugins/view )
Cassandra Accumulo
Solr
Redis Jdbc
CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name ( column_name data_type [, ...] )
LOCATION ('pxf://host[:port]/path-to-data?PROFILE=<profile-name> [&custom-option=value...]')
FORMAT '[TEXT | CSV | CUSTOM]'
(<formatting_properties>);
Demo
https://github.com/shivzone/pxf_demo
PXF HDFS Plugin
Fragment - Splits (blocks)
● Support Read : multiple formats
● Support Write to Sequence Files
● Chunked Read Optimization
● Support for stats
Profile Description
HdfsTextSimpl
e
Read delimited single line records (plain text)
HdfsTextMulti Read delimited multiline records (plain text)
Avro Read avro records
JSON Supports simple/pretty printed JSON with field
projection
ORC* Supports ORC files with Column Projection &
Filter Pushdown
PXF Hive Plugin
Fragment - Splits of the file stored in table
● Text based
● SequenceFile
● RCFile
● ORCFile
● Parquet
● Avro
➔ Complex types are converted to text
Profile Description
Hive Read all Hive tables (all types)
HiveRC Hive tables stored in RC (serialized with
ColumnarSerDe/LazyBinaryColumnarSerDe)
HiveText Faster access for Hive tables stored as Text
HiveORC Supports ORC files with Column Projection & Filter
Pushdown
PXF HBase Plugin
Fragment - Regions
● Read Only. Uses Profile ‘Hbase’
● Filter push down to Hbase scanner
○ (Operators: EQ, NE, LT, GT, LE, GE & AND)
● Direct Mapping
● Indirect Mapping
○ Lookup table - pxflookup
○ Maps attribute name to hbase <cf:qualififer>
(row key) mapping
sales id=cf1:saleid
sales cmts-cf8:comments
Enterprise documentation
Wiki
PXF Javadoc
github.com/apache/incubator-hawq/tree/master/pxf
issues.apache.org/jira/browse/HAWQ Component = PXF
Contribution
Feature Areas Custom Plugins
(storage, formats)
Push Down
Filters
Custom
Applications
Documentation Wiki/Docs
Code / Review
Github(Apache
)
Join Discussion/Ask Questions Apache DLs
dev@hawq.incubator.apache.org
user@hawq.incubator.apache.org
Github(Field) github.com/Pivotal-Field-Engineering/pxf-field
thank you !

More Related Content

What's hot

Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
exsuns
 
Apache Hive
Apache HiveApache Hive
Apache Hive
Ajit Koti
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
trihug
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
Bopyo Hong
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
HBase Incremental Backup
HBase Incremental BackupHBase Incremental Backup
HBase Incremental Backup
Lee neal
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Hadoop 20111215
Hadoop 20111215Hadoop 20111215
Hadoop 20111215
exsuns
 
HBase System Tables / Metadata Info
HBase System Tables / Metadata InfoHBase System Tables / Metadata Info
HBase System Tables / Metadata Info
wchevreuil
 
Lcna 2012-tutorial
Lcna 2012-tutorialLcna 2012-tutorial
Lcna 2012-tutorial
Gluster.org
 
Hypertable Berlin Buzzwords
Hypertable Berlin BuzzwordsHypertable Berlin Buzzwords
Hypertable Berlin Buzzwords
hypertable
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
GetInData
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
Cloudera, Inc.
 
Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011
Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011
Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011
camp_drupal_ua
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File Processing
Danairat Thanabodithammachari
 

What's hot (20)

Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
HBase Incremental Backup
HBase Incremental BackupHBase Incremental Backup
HBase Incremental Backup
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Hadoop 20111215
Hadoop 20111215Hadoop 20111215
Hadoop 20111215
 
HBase System Tables / Metadata Info
HBase System Tables / Metadata InfoHBase System Tables / Metadata Info
HBase System Tables / Metadata Info
 
Lcna 2012-tutorial
Lcna 2012-tutorialLcna 2012-tutorial
Lcna 2012-tutorial
 
Hypertable Berlin Buzzwords
Hypertable Berlin BuzzwordsHypertable Berlin Buzzwords
Hypertable Berlin Buzzwords
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
 
Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011
Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011
Andriy Podanenko.Drupal database api.DrupalCamp Kyiv 2011
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File Processing
 

Viewers also liked

Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
Alexey Grishchenko
 
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
Shivram Mani
 
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
Masayuki Matsushita
 
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARI
Mithun (Matt) Mathew
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PivotalOpenSourceHub
 
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
Sandeep Kunkunuru
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
VMware Tanzu
 
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
saravana krishnamurthy
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
PivotalOpenSourceHub
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
PivotalOpenSourceHub
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
InMobi Technology
 
Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1
seungdon Choi
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 
Pivotal HAWQ 소개
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개
Seungdon Choi
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
VMware Tanzu
 
Pivotal hawq internals
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internals
Alexey Grishchenko
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARN
Hortonworks
 

Viewers also liked (20)

Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
 
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
 
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARI
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
 
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Pivotal HAWQ 소개
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
 
Pivotal hawq internals
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internals
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARN
 

Similar to PXF HAWQ Unmanaged Data

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
NoSQLmatters
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
VMware Tanzu
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
Cheng-Chun William Tu
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
Neil Mackenzie
 
Remote secured storage
Remote secured storageRemote secured storage
Remote secured storage
Salo Shp
 
Tech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxTech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on Linux
Ralph Attard
 
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Christian Tzolov
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
Ruben Taelman
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
DataWorks Summit
 
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
Shawn Wells
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
Michael Kehoe
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
appaji intelhunt
 
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
Shawn Wells
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
Alex Diachenko
 
Next-Gen DHCP
Next-Gen DHCPNext-Gen DHCP
Next-Gen DHCP
Andreas Taudte
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
TechWiseTV Open NX-OS Workshop
TechWiseTV  Open NX-OS WorkshopTechWiseTV  Open NX-OS Workshop
TechWiseTV Open NX-OS Workshop
Robb Boyd
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
Ceph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To EnterpriseCeph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To Enterprise
Alex Lau
 

Similar to PXF HAWQ Unmanaged Data (20)

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Remote secured storage
Remote secured storageRemote secured storage
Remote secured storage
 
Tech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxTech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on Linux
 
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
 
Next-Gen DHCP
Next-Gen DHCPNext-Gen DHCP
Next-Gen DHCP
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
TechWiseTV Open NX-OS Workshop
TechWiseTV  Open NX-OS WorkshopTechWiseTV  Open NX-OS Workshop
TechWiseTV Open NX-OS Workshop
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Ceph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To EnterpriseCeph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To Enterprise
 

Recently uploaded

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

PXF HAWQ Unmanaged Data

  • 1. Shivram Mani (HAWQ UD) PXF Pivotal Extension Framework
  • 2. Agenda ● Motivations ● PXF Introduction ● Architecture/Design ● HAWQ Bridge - Deep Dive ● PXF - Developer View ● Usage/Plugins ● What’s coming
  • 3. Motivations: SQL on Hadoop RDBMS ? various formats, storages supported on HDFS ● ANSI SQL ● Cost based optimizer ● Transactions ● ... Foreign Tables!
  • 4. PXF is an extension framework that facilitates access to external data ● Uniform tabular view to heterogeneous data sources ● Exploits parallelism for data access ● Pluggable framework for custom connectors ● Provides built-in connectors for accessing data in HDFS files, Hive/HBase tables, etc What is PXF ?
  • 5. PXF Communication Apache Tomcat PXF Webapp REST API libhdfs3 (written in C) segments External Tables Native Tables HTTP, port: 51200 Java API
  • 6. Deployment Architecture HAWQ Master Node NN pxf HBase Master DN4 pxf HAWQ seg4 DN1 pxf HAWQ seg1 HBase Region Server1 DN2 pxf HAWQ seg2 HBase Region Server2 DN3 pxf HAWQ seg3 HBase Region Server3 * PXF needs to be installed on all DN * PXF is recommended to be installed on NN
  • 7. PXF Components Fragmenter Splits dataset into partitions Returns locations of each partition Accessor Understand and read/write the fragment Return records Resolver Convert records to a consumable format (Data Types) Compact way to configure Fragmenter, Accessor, ResolverProfile
  • 8. Architecture - Read Data Flow HAWQ Master Node NN pxf DN1 pxf HAWQ seg1 select * from ext_table0 getFragments() API pxf://<location>:<port>/<pa th> 1 Fragments (JSON)2 7 3 Assign Fragments to Segments DN1 pxf HAWQ seg1 DN1 pxf HAWQ seg1 Query dispatched to Segment 1,2,3… (Interconnect) 5 Read() REST 6 records 8 query result Records (stream) Fragmenter Resolver Accessor 4
  • 9. Read Data Flow - Take 2
  • 10. 1. Get Fragments (Partition Data) 2. Fragment Distribution 3. Reading Data HAWQ Bridge - Deep Dive
  • 11. Step 1 - Get Fragments • Code location: https://github.com/apache/incubator- hawq/blob/master/src/backend/access/external/hd_work_mgr.c • Called by optimizer (createplan.c) • Gets fragments from PXF for the given location specified in the table, using Fragmenter.
  • 12. Step 2 - Fragments Distribution • Code location: hd_work_mgr.c • Returns a mapping of the fragments for each segment. • Trying to maximize both parallelism and locality: • Splitting the load between all participating segments (determined by GUC). • Assigning fragments to segments with a replica on the same host.
  • 13. DN1 DN2 DN3 DN4 HAWQ master NN pxfpxfpxfpxf HAWQ seg1 pxf HAWQ seg2 HAWQ seg3 HBase master HBase1, HBase2 HBase1, HBase3 HBase1, HBase2 HBase1, HBase3 HBase regsion server1 HBase regsion server2 HBase regsion server3 seg1 - green-DN2 seg2 - yellow-DN2 + red-DN2 seg3 - orange-DN3 Step 2 - Fragments Distribution
  • 14. Step 3 - Reading Data • Done using external protocol API. • PXF code is under cdb-pg/src/backend/access/external/ • C Rest API using enhanced libcurl https://github.com/apache/incubator- hawq/blob/master/src/backend/access/external/libchurl.c • Each segment calls PXF to get each of its fragments’ data, using Accessor & Resolver • Data returned as stream(text/csv/binary) from PXF
  • 16. PXF Usage Built-in with Plugins HDFS Hive HBase GemfireXD Community (https://bintray.com/big-data/maven/pxf-plugins/view ) Cassandra Accumulo Solr Redis Jdbc CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name ( column_name data_type [, ...] ) LOCATION ('pxf://host[:port]/path-to-data?PROFILE=<profile-name> [&custom-option=value...]') FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
  • 18. PXF HDFS Plugin Fragment - Splits (blocks) ● Support Read : multiple formats ● Support Write to Sequence Files ● Chunked Read Optimization ● Support for stats Profile Description HdfsTextSimpl e Read delimited single line records (plain text) HdfsTextMulti Read delimited multiline records (plain text) Avro Read avro records JSON Supports simple/pretty printed JSON with field projection ORC* Supports ORC files with Column Projection & Filter Pushdown
  • 19. PXF Hive Plugin Fragment - Splits of the file stored in table ● Text based ● SequenceFile ● RCFile ● ORCFile ● Parquet ● Avro ➔ Complex types are converted to text Profile Description Hive Read all Hive tables (all types) HiveRC Hive tables stored in RC (serialized with ColumnarSerDe/LazyBinaryColumnarSerDe) HiveText Faster access for Hive tables stored as Text HiveORC Supports ORC files with Column Projection & Filter Pushdown
  • 20. PXF HBase Plugin Fragment - Regions ● Read Only. Uses Profile ‘Hbase’ ● Filter push down to Hbase scanner ○ (Operators: EQ, NE, LT, GT, LE, GE & AND) ● Direct Mapping ● Indirect Mapping ○ Lookup table - pxflookup ○ Maps attribute name to hbase <cf:qualififer> (row key) mapping sales id=cf1:saleid sales cmts-cf8:comments
  • 21. Enterprise documentation Wiki PXF Javadoc github.com/apache/incubator-hawq/tree/master/pxf issues.apache.org/jira/browse/HAWQ Component = PXF Contribution Feature Areas Custom Plugins (storage, formats) Push Down Filters Custom Applications Documentation Wiki/Docs Code / Review Github(Apache ) Join Discussion/Ask Questions Apache DLs dev@hawq.incubator.apache.org user@hawq.incubator.apache.org Github(Field) github.com/Pivotal-Field-Engineering/pxf-field