Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unified Framework for Big Data FDW

732 views

Published on

With the introduction of Foreign Data Wrappers in Postgres 9.1, access to distributed systems such as Hdfs, HBase, Hive with their multiple data formats is feasible.
However, the existing FDW implementations for Big Data systems, such as Hdfs or Hive, lack a few key features and doesn’t have a common framework.

The talk introduced PXF an open source project which provides a unified extensible framework for accessing any distributed system data source. PXF is currently being used by Apache HAWQ’s external table via REST API and is in the process of being integrated with other SQL engines. Existing plugins include loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, Avro, Sequence, Hive RCFile, ORC, Parquet and Avro formats and HBase. The pluggable framework makes it very convenient for adding new custom plugins. It also supports advanced statistics and filter pushdown.

With the integration of PXF into Postgres FDW, we can achieve a single unified pluggable framework to read and write any distributed system data source.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Unified Framework for Big Data FDW

  1. 1. Shivram Mani ( Pivotal) Unified Framework for Big Data Foreign Data Wrappers @ FOSDEM PGDay 2016
  2. 2. Agenda ● Introduction to Hadoop Ecosystem ● Why Postgres SQL on Hadoop ● Current state of SQL on Hadoop (FDW/Big data wrappers) ● PXF - Design & Architecture ● Demo ● Benefits of using PXF with FDW ● Q&A
  3. 3. Agenda ➢ Introduction to Hadoop Ecosystem ● Why Postgres SQL on Hadoop ● Current state of SQL on Hadoop (FDW/Big data wrappers) ● PXF - Design & Architecture ● Demo ● Benefits of using PXF with FDW ● Q&A
  4. 4. What is Hadoop/Big Data Apache Hadoop is an open source framework for distributed processing of large data sets across clusters of computers. ● Commodity Hardware ● Scale out ● Fault tolerance ● Support multiple file formats Mapreduce HBase Hive Pig Clustered File System Distributed Data Processing Top level Abstractions ETL Tools BI Tools RDMS Hadoop Distributed File System (HDFS) Top level Interfaces
  5. 5. Agenda ● Introduction to Hadoop Ecosystem ➢ Why Postgres SQL on Hadoop ● Current state of SQL on Hadoop (FDW/Big data wrappers) ● PXF ● Demo ● Benefits of using PXF with FDW ● Q&A
  6. 6. Motivations: SQL on Hadoop RDBMS ? various formats, storages supported on HDFS ● ANSI SQL ● Cost based optimizer ● Transactions ● Indexes Foreign Tables!
  7. 7. Agenda ● Introduction to Hadoop Ecosystem ● Why Postgres SQL on Hadoop ➢ Current state of SQL on External Hadoop - FDW/Big data wrappers ● PXF - Design & Architecture ● Demo ● Benefits of using PXF with FDW ● Q&A
  8. 8. Foreign Data Wrappers (FDW) Foreign tables and foreign data wrapper is postgres way to read external data. 1. Create FDW (compiled C functions in the handler) 2. Declare the extension (FDW) 3. Create server that uses the wrapper 4. Create table that uses the server CREATE FOREIGN DATA WRAPPER hadoop_fdw HANDLER hadoop_fdw_handler NO VALIDATOR; CREATE EXTENSION hadoop_fdw; CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000'); CREATE FOREIGN TABLE retail_history ( name text, price double precision ) SERVER hadoop_server OPTIONS (table 'example.retail_history');
  9. 9. Foreign Data Wrappers - Implementation Creating a new foreign data wrapper simply consists of implementing the API of the FDW as c- language functions. Scanning a foreign table requires implementation of the following: ● GetForeignRelSize - Estimate of the relation size ● GetForeignPaths - Get access paths for the foreign data ● GetForeignPlan - Plan the foreign paths of this table ● BeginForeignScan - Start scan. Open connections, etc ● IterateForeignScan - Perform scan and return tuples ● EndForeignScan - End scan. Close connection, etc
  10. 10. Big Data Wrappers (Multicorn, BigSQL EnterpriseDB) Create a Hive table corresponding to HDFS file/HBase table Create Extension, Server & Foreign Table schema and necessary Options Results mapped to postgres table Query connects to HiveServer via thrift client Hive server executes mapreduce jobs Query Foreign Table
  11. 11. Big Data Wrapper - Communication libthrift F D W MetaStore
  12. 12. Agenda ● Introduction to Hadoop Ecosystem ● Why Postgres SQL on Hadoop ● Current state of SQL on Hadoop - FDW/Big data wrappers ➢ PXF - Design & Architecture ● Demo ● Benefits of using PXF with FDW ● Q&A
  13. 13. ● HAWQ is an MPP SQL engine on HDFS (evolved from Greenplum Database) ● PXF is an extensible framework that allows HAWQ to query external data. ● PXF includes built-in connectors for accessing data in HDFS files, Hive & HBase tables. ● Users can create custom connectors to other parallel data stores or processing engines. HAWQ Extension Framework - PXF
  14. 14. PXF - Communication Apache Tomcat PXF Webapp REST API Java API libhdfs3, written in C, segments External Tables Native Tables HTTP, port: 51200 Java API Java API
  15. 15. Architecture - Deployment HAWQ Master Node NN pxf HBase Master DN4 pxf HAWQ seg4 DN1 pxf HAWQ seg1 HBase Region Server1 DN2 pxf HAWQ seg2 HBase Region Server2 DN3 pxf HAWQ seg3 HBase Region Server3 * PXF needs to be installed on all DN * PXF is recommended to be installed on NN
  16. 16. Design - Components(PXF) Fragmenter Get the locations of fragments for an external table Implicitly provides stats to query optimizer Accessor Understand and read/write the fragment , return records Resolver Convert records to HAWQ consumable format (Data Types)
  17. 17. CREATE EXTENSION hadoop_fdw; CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000'); CREATE FOREIGN TABLE retail_history ( name text, price double precision ) SERVER hadoop_server OPTIONS (table 'example.retail_history'); CREATE PROTOCOL PXF; DDL Comparison LOCATION('pxf://127.0.0.1:51200/ example.retail_history? CREATE EXTERNAL TABLE retail_history name text, price double precision ) PROFILE = HIVE FORMAT 'CUSTOM' (formatter='pxfwritable_import'); PXF FDW * Items with the same color have similar action
  18. 18. Architecture - Data Flow: Query (HDFS) HAWQ Master Node NN pxf DN1 pxf HAWQ seg1 select * from ext_table0 pxf: //<namenode><port>/path /to/data getFragments() REST 1 Fragments JSON2 7 3 Split mapping (fragment - > segment) DN1 pxf HAWQ seg1 DN1 pxf HAWQ seg1 Query dispatched to Segment 1,2,3… (Interconnect) 5 Read() REST 6 records 8 query result records (stream) Fragmenter Resolver Accessor 4
  19. 19. PXF Plugins, Profiles • Built-in with HAWQ (Profiles) • HDFS: HDFSTextSimple(R/W), HDFSTextMulti(R), Avro(R) • Hive(R): Hive, HiveRC, HiveText • HBase(R): HBase • Community (https://bintray.com/big-data/maven/pxf-plugins/view ) • JSON HAWQ-178 • Cassandra • Accumulo • ...
  20. 20. Agenda ● Introduction to Hadoop Ecosystem ● Why Postgres SQL on Hadoop ● Current state of SQL on Hadoop - FDW/Big data wrappers ● PXF - Design & Architecture ➢ Demo ● Benefits of using PXF with FDW ● Q&A
  21. 21. Demo https://github.com/shivzone/pxf_demo
  22. 22. ● Implement FDW callback functions that will interact with PXF. ● Use the enhanced libcurl library - libchurl PXF as Big Data Wrapper Abstraction Apache Tomcat PXF WebappREST API Java API HTTP, port: 51200 Java API Java API F D W
  23. 23. Agenda ● Introduction to Hadoop Ecosystem ● Why Postgres SQL on Hadoop ● Current state of SQL on Hadoop - FDW/Big data wrappers ● PXF - Design & Architecture ● Demo ➢ Benefits of using PXF with FDW ● Q&A
  24. 24. Benefits of using PXF with FDW ● FDW isolated from underlying hadoop ecosystem APIs ● Direct access of HDFS data. ● Access Hive data without overhead of underlying execution framework ● Access HBase data without mapped Hive table ● Supports Single node & parallel execution ● Extensibility/ease of building extensions ● Support for multiple versions of underlying distributions ● Built in filter push down and support for stats
  25. 25. Resources ● Github https://github.com/apache/incubator-hawq/tree/master/pxf ● Documentation http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html ● Wiki https://cwiki.apache.org/confluence/display/HAWQ/PXF
  26. 26. Q & A

×