Pivotal HAWQ - High Availability (2014)

A NEW PLATFORM FOR A NEW ERA
SK Krishnamurthy

2© Copyright 2013 Pivotal. All rights reserved.
Agenda
 HAWQ failover and HA now
 HAWQ HA upcoming release
 What’s new in PHD 1.1
 Pivotal Command Center new features
 Discuss roadmap in conjunction with AMEX requirements
 Open discussion: SAW, PHD 1.1 upgrade, …

3© Copyright 2013 Pivotal. All rights reserved. 3© Copyright 2013 Pivotal. All rights reserved.
HAWQ - Availability
Nov 25, 2013

Deployment Model – Sample HAWQ Cluster
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS

HAWQ Master Fails
HAWQ
PM
HAWQ
SM
PNN SNN
Action Availability Notes
HAWQ Cluster Yes (with downtime) HAWQ Cluster available. How
does clients connect to
SM?Manual process to
connect to standby master.
Similar to GPDB.
Current “SELECT” queries Aborted Users need to restart the
query.
Current Transaction Aborted Dirty data & temp files will be
removed.
New “SELECT” & transaction Yes SM will continue to process
queries.
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS

HAWQ Master Fails
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• Execution coordinator resides on
master
• Distributed transaction master resides
on master
• Log copied up to last committed
transaction
• Run gpactivatestandby on secondary
master
• Either VIP or DNS hostname change
to re-route client connections

HAWQ
PM
HAWQ
SM
PNN SNN
HAWQ Cluster Un-Available Cluster is considered to be
down.
Current “SELECT” queries Aborted Can’t restart the query.
removed.
New “SELECT” &
Transaction query
Not possible
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
HAWQ Master & Standby Master Fail

HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
HAWQ Master & Standby Master Fail
• Configure RAID 10 for HAWQ master
so primary segment data directory is
never lost

PNN Fails
HAWQ
PM
HAWQ
SM
PNN SNN
HAWQ Cluster Yes (with downtime) Meta data query can be
carried out, but no other
queries. No DDL or DML.
query.
Current Transaction Aborted After the PNN Is up, dirty data
& temp files will be removed.
New “SELECT” &
Transaction query
Not possible
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• PHD 1.1:
• (option 1)Manually bring up PNN. HAWQ cannot switch to secondary name node.
• (option 2)HDFS admin should change the FQDN or IP address of secondary NN to the PNN.
• HAWQ master keeps on trying to connect PNN and when it finds one, the cluster becomes operational.
• PHD 1.1.1 (Dec,13)
• QA verified testing of above 2 options.

PNN Fails
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• PHD 1.1:
• PHD 1.1.1 (Dec,13)
• Normal HDFS failover process
• Change DNS name of secondary NN
to the current NN
• Namenode service will be supported in
PHD 1.2 (February)

PNN & Secondary NN Fail
HAWQ
PM
HAWQ
SM
PNN SNN
HAWQ Cluster No Meta data query can be
carried out, but no other
queries. No DDL or DML.
query.
Current Transaction Aborted After the PNN Is up, dirty data
& temp files will be removed.
New “SELECT” &
Transaction query
Not possible
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• PHD 1.1:
• PHD 1.1.1 (Dec,13)

PNN & Secondary NN Fail
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• No split information
• No transactions

Secondary NN Fail
HAWQ
PM
HAWQ
SM
PNN SNN
HAWQ Cluster Yes Fully available
Current “SELECT” queries Yes
Current Transaction Yes
New “SELECT” &
Transaction query
Yes
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS

A Segment Fails
HAWQ
PM
HAWQ
SM
PNN SNN
HAWQ Cluster Yes HAWQ Cluster available.
query.
removed.
New “SELECT” &
Transaction query
Yes Remaining segments will
handle the query.
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS

A Segment Fails
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• Segments QE (Query Executers) are
killed
• HAWQ does not materialize
intermediate results
• Local actions by QE is not committed
• Segment QEs are started by other
segments in consequent queries
• QE substitution is random
• Future release for option to materialize
work files

Multiple Segment Fail
HAWQ
PM
HAWQ
SM
PNN SNN
query.
removed.
New “SELECT” &
Transaction query
handle the query.
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS

DN Fails
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
Current “SELECT” queries Yes SS will automatically connect
to remote DN in the middle of
currently executing query.
Current Transaction Yes Transaction will finish
successfully.
New “SELECT” &
Transaction query
Yes
• PHD 1.1:
• No Impact. SS will continue to work with remote DN
• Loss of data locality might introduce slight performance impact. In 10G network performance
impact is measured to be around 10% for large queries. Simple queries might experience 50%
performance impact.

DN Fails
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
• PHD 1.1:
• No Impact. SS will continue to work with remote DN
• Loss of data locality might introduce slight performance impact. In 10G network performance
impact is measured to be around 10% for large queries. Simple queries might experience 50%
performance impact.
• libhdfs faults to read from HDFS
replica
• Short-term performance loss until NN
marks DN as dead

Segment Host Dies
HAWQ
PM
HAWQ
SM
PNN SNN
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
DN
SS SS SS
SS SS SS
query.
removed.
New “SELECT” &
Transaction query
handle the query.

Single Disk Failure in DN
 JBOD
– If Tempdata is not in the failed disk then no impact on the cluster or query.
– If Tempdata is configured to be on the failed disk.
▪ Small queries will run, but large queries with too much temporary data will be impacted.
▪ Transactions will be aborted and new transaction will continue if multiple disk are configured
to contain tempdata.
 RAID 5
– No impact.
– Possible performance loss.
 RAID 10
– No Impact & no performance loss.

HAWQ HA on roadmap
 Automatic Namenode HA supported on PHD now
 Automatic Namenode HA (name service) supported by HAWQ in
February release
 PXF to also support NN service
 No interruption in query execution during NN failure
 HAWQ HA unchanged

What’s New in
Pivotal HD 1.1
November 7th, 2013

Key Themes of PivotalHD 1.1 Release
 Leverage more data, in real time, more easily to gain
competitive advantage
 Richer services and tools to create broader set of
applications
 Deeper, streamlined administrative capabilities for enterprise
deployments

Pivotal HD Architecture
HDFS
HBas
e
Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal
Command
Center
Configure,
Deploy,
Monitor,
Manage
Data Loader
Pivotal HD
Enterprise
Spring
Unified Storage
Service
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced
Database Services
Hadoop Virtualization
Extension
Distrubuted
In-memory
Store
Query
Transactions
Ingestion
Processing
Hadoop Driver –
Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time
Database Services
MADlib Algorithms
Oozie
Vaidya

GemFire XD : Delivers
Enterprise real-time data processing platform for SLA critical applications; enables users to rapidly and reliably
analyze & react to high volumes of events while leveraging10s of TBs of in-memory reference data.
Cloud Scale
Real-Time Platform
Seamless Pivotal
HD Integration
Optimized for
Real-Time Analytics
• Very low & predictable
latencies at high &
variable loads
• 10s of TBs in-memory
(Memscale)
• Multi-tiered caching
• Efficient in-memory M-R
• Real-time event
processing
• Continuous querying
• SQL based queries
• Support structured and
semi-structured* data
• Java stored procedures
• Deep Spring Data
integration
• Native support for
JSON and Objects
(Java, C++, C#)*
• Scale to HDFS with
policy driven in-memory
data retention
• Online and offline
querying of HDFS data
• ETL-less bi-directional
integration with other
Pivotal HD services
Enterprise-Class
Reliability
• JTA distributed
transactions
• HA through in-memory
redundancy
• Reliable event
propagation
• Active-active
deployments across
WAN
* EA / Not in 1.0

Feature Benefit
Command Center:
Install Wizard Faster, easier set up and configuration of HD cluster
Start/Stop Services Point/click control of multiple services through a central interface
HAWQ
UDF
(Partial)
- C, PL/pgsql
- pgcrypto, orafce
Enable richer data processing and analytics functionality leveraging existing SQL
skill sets
Kerberos Support Tightly integrated security with HDFS
PXF: Writable HDFS Table
Support
Easily export HAWQ data to HDFS for external consumption
HAWQ Input Format Reader Directly leverage HAWQ data in MapReduce, Pig and Hive
Diagnostic Tools Lower administration costs
Improved Query Planner “Orca” Enabled to provide more efficient query plans
What’s New in Pivotal HD 1.1

Feature Benefit
Install/Config (ICM) CLI
Add/Remove Services Faster, easier set up and administration of services (e.g. Hbase, GemfireXD etc)
Upgrade Streamlined, low risk upgrade from 1.0.1 to 1.1
Apache Hadoop Components
Hadoop to 2.0.5 and select
2.0.6 patches
Greater stability and lower risk based on critical defect fixes incorporated
Oozie 3.3.2 Orchestrate data processing (e.g. MR, Pig) job pipelines with dependencies
Hive 11 (incl. HCatalog and
Hiveserver2)
Significant improvements in functionality, scalability and security.
Hbase 0.94.8 Enables snapshots of tables without overhead to the Region Servers
RHEL 6.4 Certification Enhanced performance optimizations and security improvements

Feature Benefit
Platform and Security
Kerberos Support
- HDFS
- HAWQ
- Unified Storage Service
- PXF to be supported in Dec 2013
Tighter governance, risk and compliance
JRE 1.7.0.15 support Supported platform. JRE 1.6 is end of life.
RHEL 6.4 (FIPS) certification Federal standard for cryptography modules
Pgcrypto for HAWQ Flexible and robust encryption of sensitive data
Tools
Unified Storage Service: CDH4 as a
data source
Stream data from CDH4
Data Loader
- Push Stream API
- Spring XD front end for Twitter
Integration support for wider variety of data sources

Command Center Cluster Deployment Wizard
• Performs “Host
Verification” to determine
host eligibility to be added
to cluster

Command Center Cluster Deployment Wizard
• Easily Add Eligible Nodes to
Roles
• Basic Validation of Layout
• Checkbox Add/Remove
Services
• Ability to Download
Configuration Locally
Recorder Demo can be found -> Here

Orca - Improved Optimizer
 Pluggable architecture, allowing faster innovation and quicker iteration on
quality improvements
 Subset of improved functionality:
• Parity with Planner
• Improved join-ordering
• Join-Aggregate re-ordering
• Sub-query de-correlation
• Optimal sort-orders
• Full integration of data (re-
)distribution
• Contradiction detection
• Elimination of redundant joins
• Smarter Partition scan
• Star-join optimization
• Skew aware

What’s new in PXF
 Profiles
 Writable external tables
 Hive partition pruning, HBase filtration
 Additional connectors & CSV support
 Complete extensibility
 Roadmap
– Security & authentication
– Multi-FS support & other distributions via OS
– Stand-alone service

Why Pivotal HD?
 Big Data + Fast Data
 The first enterprise grade platform that provides OLAP
and OLTP with HDFS as the common data substrate
 Enables closed loop analytics, real-time event
processing and high speed data ingest

Hawq Format Reader
Java Program
(i.e. MapReduce
Job)
HDFS
Hawq
Hawq
Reader
(Jar file)
1. Request is made to
where Files for specific
“Table” exist
2. Location is returned on
where are files
2. HDFS Files with
Hawq Format are
streamed to Reader
Recorded Demo can be found -> Here

Oozie now Included and Supported with PHD
 Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
 Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
 Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time
(frequency) and data availabilty.
 Oozie is integrated with the rest of the Hadoop stack supporting several types of
Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce,
Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java
programs and shell scripts).
 Oozie is a scalable, reliable and extensible system.

Matrix of what is supported via Install method

Security Dashboard (items in bold tested; rest are scheduled))
Support secure
cluster
Supports Kerberos
for Authentication
Support LDAP for
Authentication
HDFS Yes Yes Linux OS supports
MapReduce/Pig Yes N/A
Hive Yes (standalone mode) N/A
Hiveserver No No
Hiveserver2 Yes Yes Yes
Hbase Yes Yes Yes
HAWQ* Yes Yes Yes
GemfireXD Yes Yes Yes
* Except PXF; Scheduled for Dec (PHD 1.1.1 release

Vaidya

Roadmap
Open Discussion
Nov 25, 2013

Roadmap – Action Items
 Error tables released in PHD 1.2 (February)
– Current workaround
 PCC new features?!
 SAW integration
 PHD 1.1 upgrade planning

Appendix
Nov 25, 2013

HAWQ
Nov 25, 2013

History
 HAWQ 1.0 (March release)
– True SQL Engine in Hadoop
▪ SQL 92, 99 & 2003 OLAP extensions
▪ JDBC/ODBC
– Basic SQL functionalities
▪ DDL and DML
– High availability feature
– Transaction support
 HAWQ 1.1 (June release)
– JBOD support feature
 HAWQ 1.1.1 (August release)
– HDFS access layer read fault tolerance support
– HAWQ diagnosis tool
– ORCA enabled
 HAWQ 1.1.2 (September release)
– HAWQ MR Inputformat for AO tables
– HDFS access layer write fault tolerance support
– HDFS 2.0.5 support
 HAWQ 1.1.3 (Oct release)
– HAWQ Kerberos support
– HAWQ on secure HDFS
– UDF
 HAWQ 1.1.4 (Dec release)
– Gptoolkit
– UDF enhancement
– Manual failover for HDFS HA
 HAWQ 1.2 (Feb release)
– Parquet storage support
– HAWQ MR Inputformat
– Automatic failover for HDFS HA
– …

Network
Interconnect
...
......
HAWQ & HDFS Master
Severs
Planning & dispatch
Segment
Severs
Query execution
...
Storage
HDFS, HBase …

Namenode
B
replication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
Segment
Segment
Segment host
Segment
Segment host
Master host
Meta Ops
GPDB Interconnect
Segment
Segment
Segment
Segment host
Segment
Datanode
Segment
SegmentSegment Segment

Query execution flow

Parallel Query Optimizer
• Converts SQL into a physical execution plan
– Cost-based optimization looks for the most efficient plan
– Physical plan contains scans, joins, sorts, aggregations, etc.
– Global planning avoids sub-optimal ‘SQL pushing’ to segments
– Directly inserts ‘motion’ nodes for inter-segment communication
• ‘Motion’ nodes for efficient non-local join processing
(Assume table A is distributed across all segments – i.e. each has AK)
– Broadcast Motion (N:N)
• Every segment sends AK to all other segments
– Redistribute Motion (N:N)
• Every segment rehashes AK (by join column) and redistributes each row
– Gather Motion (N:1)
• Every segment sends its AK to a single node (usually the master)

Example of Parallel Query Optimization
48
select
c_custkey, c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, n_name, c_address, c_phone, c_comment
from
customer, orders, lineitem, nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1994-08-01'
and o_orderdate < date '1994-08-01'
+ interval '3 month'
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey, c_name, c_acctbal,
c_phone, n_name, c_address, c_comment
order by
revenue desc
Gather Motion
4:1
(slice 3)
Sort
HashAggregate
HashJoin
Redistribute Motion
4:4
(slice 1)
HashJoin
Seq Scan on
lineitem
Hash
Seq Scan on
orders
Hash
HashJoin
Seq Scan on
customer
Hash
Broadcast Motion
4:4
(slice 2)
Seq Scan on
nation

Interconnect
• UDP based
• Flow control

Metadata dispatch
• Metadata dispatch
• Stateless segments
– Read only metadata on segment

Transaction
 Full transaction support tables on HDFS
– When a load transaction is aborted, there will be some garbage data left at the end of
file. For HDFS like systems, data cannot be truncated or overwritten.
 Methods to process the partial data to support transaction.
– Option 1: Load data into a separate HDFS file. Unlimited number of files.
– Option 2: Use metadata to records the boundary of garbage data, and
implements a kind of vacuum mechanism.
– Option 3: Implement HDFS truncation.
 HDFS truncate is added to support transaction

Transaction
 Snapshot isolation
 Simplified Transaction Model Support
– Simplified two phase commit

Transaction support
• Methods to process the partial data to support
transaction.
– Option 1: Load data into a separate HDFS file.
Unlimited number of files.
– Option 2: Use metadata to records the boundary of
garbage data, and implements a kind of vacuum
mechanism.
– Option 3: Implement HDFS truncation.

Pluggable storage
• Read Optimized/Append only storage
• Column store
– Compressions: quicklz, zlib, RLE
– Partitioned tables hit HDFS limitation
• Parquet
– Open source format
– PAX like column store
– Snappy, gzip
• MR Input/Output format

HDFS C client: why
• libhdfs (Current HDFS c client) is based on JNI. It is difficult to make
HAWQ support a large number of concurrent queries.
• Example:
– 4 segments on each segment hosts
– 50 concurrent queries
– each query has 16 QE processes that do scan
– there will be about 800 processes that start 800 JVMs to access HDFS.
– If each JVM uses 500MB memory, the JVMs will consume 800 * 500M =
400G memory.
– Thus naïve usage of libhdfs is not suitable for HAWQ. Currently we
have three options to solve this problem

HDFS client: three options
• Option 1: use HDFS FUSE. HDFS FUSE introduces some
performance overhead. And the scalability is not verified yet.
• Option 2 (libhdfs2): implement a webhdfs based C client. webhdfs is
based on HTTP. It also introduces some costs. Performance should
be benchmarked. Webhdfs based method has several benefits, such
as ease to implementation and low maintenance cost.
• Option 3 (libhdfs3): implement a C RPC interface that directly
communicates with NameNode and DataNode. Many changes when
the RPC protocol is changed.

PXF
Nov 25, 2013

PXF is...
A fast extensible framework
connecting Hawq to a data
store of choice that exposes a
parallel API

Hawq External Tables
• gpfdist
– remote delimited text (or csv) files.
• file
– text files on segment filesystem.
• execute
– script execution and produced data
• pxf
– text and binary data from available pxf connectors (mostly HD based).

Steps
• Step 1: GRANT ON PROTOCOL pxf
• Step 2: Define a PXF table
– Pick built-in plugins right for the job
– Specify data source of choice
– Map remote data fields to Hawq db attributes (plugin
dependent)
• Step 3: Query the PXF table.
– Directly
– Or copy to a Hawq table first
CREATE EXTERNAL TABLE foo(<col list>)
LOCATION (‘pxf://<host:port>/<data source>?<plugin
options>’)
FORMAT ‘<type>’(<params>)

New Features
Main additions since PHD1.0

User Experience

User Experience
• Improved/Informative error messages.
• Profiles
LOCATION(‘pxf://<host:port>/sales?fragmenter=H
iveFragmenter&accessor=HiveAccessor&resolver=H
iveResolver’)
LOCATION(‘pxf://<host:port>/sales?profile=Hive
’)

profiles.xml
<profile>
<name>HBase</name>
<description>Used for connecting to an HBase data store engine</description>
<plugins>
<fragmenter>HBaseDataFragmenter</fragmenter>
<accessor>HBaseAccessor</accessor>
<resolver>HBaseResolver</resolver>
<myidentifier>MyValue</myidentifier>
</plugins>
</profile>

profiles.xml
<profile>
<name>HdfsTextSimple</name>
<description>Used when reading delimited single line records from plain text files on HDFS
</description>
<plugins>
<fragmenter>HdfsDataFragmenter</fragmenter>
<accessor>LineBreakAccessor</accessor>
<resolver>StringPassResolver</resolver>
<analyzer>HdfsAnalyzer</analyzer> <-- (soon to be added)
</plugins>
</profile>

profiles.xml
<profile>
<name>MyCustomProfile</name>
<description>Used with a new set of plugins I wrote</description>
<plugins>
<fragmenter>MyFragmenter</fragmenter>
<accessor>MyAccessor</accessor>
<resolver>MyResolver</resolver>
<analyzer>MyAnalyzer</analyzer>
</plugins>
</profile>
Add your own profiles

Export to HDFS

Writable PXF
• gphdfs-like functionality
– but extensible…
– currently supports text, csv, SequenceFile
– supports various hadoop compression CodecsCREATE WRITABLE EXTERNAL TABLE ...
LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.ha
doop.io.compress.GzipCodec')
FORMAT ‘text’(delimiter ‘,’);
can create a new profile “HdfsTextSimpleGZipped” that includes compression_codec
LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimpleGZipped')

New Connectors

New Connectors
• GemFire XD (Released. GA February)
• JSON (On github. GA February (r+w))
• Accumulo (On github. GA version being coded by Clearedge. GA February)
• Cassandra (On github. Alpha)
Non of them was written by the PXF Dev team… a
testament for extensibility.

Feature Summary
★ HBase (w/filter pushdown)
★ Hive (w/partition exclusion. various storage file types)
★ HDFS Files: read (delimited text, csv, Sequence, Avro)
★ HDFS Files: write (delimited text, csv, Sequence, various compression
codecs and options)
★ GemFireXD, JSON format, Cassandra, Accumulo (currently Beta)
★ Stats collection
★ Automatic data locality optimizations
★ Extensibility!

Coming Up Very Soon...
★ Isilon Integration
★ Kerberized HDFS Support
★ Namenode High Availability

Limitations
• Local metadata of external data
– Will be made more transparent when UCS exists.
• Authentication and Authorization of external systems
– Will be made simpler when centralized user mgmt exists.
• Currently supporting local PHD only
• Error tables not yet supported
• Sharing space with Name/DataNode

Writing a plugin
steps and guidelines

Main Steps
1. Verify P-HD running and PXF installed
a. SingleCluster, AllInAll, SingleNode VM
2. Implement the PXF plugin API for your connector
(Java)
a. Use the PXF API doc as a reference
3. Compile your connector classes and add them to the
hadoop classpath on all nodes
4. Restart PHD (won’t be necessary in the future)
5. Add a profile (optional)

Plugins
• Fragmenter – returns a list of source data fragments
and their location
• Accessor – access a given list of fragments read them
and return records
• Resolver – deserialize each record according to a given
schema or technique
• Analyzer – returns statistics about the source data

Thanks!
Nov 25, 2013

Pivotal HAWQ - High Availability (2014)

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to Pivotal HAWQ - High Availability (2014)

Similar to Pivotal HAWQ - High Availability (2014) (20)

Recently uploaded

Recently uploaded (20)

Pivotal HAWQ - High Availability (2014)