Postgresql Federation

4,192 views

Published on

As more and more alternative data stores come into use, the problem of being able to easily use and report on the data scattered across those data stores becomes increasingly difficult. PostgreSQL has a feature called Foreign Data Wrappers that allows external data sources to be queried from PostgreSQL and look like a standard table. Using Foreign Data Wrappers, users can create a report that joins data residing in Oracle, Hadoop and MongoDB all in a single query.

In this talk, we'll discuss how to set up a Foreign Data Wrapper for various data sources and the pros and cons using them. We'll also discuss the growing ecosystem of Foreign Data Wrapper and a little about how to write one.

Published in: Technology

Postgresql Federation

  1. 1. Federated PostgreSQL
  2. 2. Who Am I? ● Jim Mlodgenski – – ● jimm@openscg.com @jim_mlodgenski Co-organizer of – – ● NYC PUG (www.nycpug.org) Philly PUG (www.phlpug.org) CTO, OpenSCG – www.openscg.com
  3. 3. http://nyc.pgconf.us
  4. 4. What is a federated database? “A federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. ... There is no actual data integration in the constituent disparate databases as a result of data federation.” -Wikipedia
  5. 5. How does PostgreSQL do it? ● Uses Foreign Table Wrappers (FDW) ● Used with SQL/MED – – Management of External Data – ● New ANIS SQL 2003 Extension Standard way of handling remote objects in SQL databases Wrappers used by SQL/MED to access remotes data sources
  6. 6. Types of Foreign Data Wrappers ● SQL ● NoSQL ● File ● Miscellaneous ● PostgreSQL
  7. 7. SQL Wrappers ● Oracle ● SQLite ● MySQL ● JDBC ● Informix ● ODBC ● Firebird
  8. 8. SQL Wrappers CREATE SERVER oracle_server FOREIGN DATA WRAPPER oracle_fdw OPTIONS (dbserver 'ORACLE_DBNAME'); CREATE USER MAPPING FOR CURRENT_USER SERVER oracle_server OPTIONS (user 'scott', password 'tiger'); CREATE FOREIGN TABLE fdw_test ( userid numeric, username text, email text ) SERVER oracle_server OPTIONS ( schema 'scott', table 'fdw_test'); postgres=# select * from fdw_test; userid | username | email --------+----------+------------------1 | scott (1 row) | scott@oracle.com
  9. 9. NoSQL Wrappers ● MongoDB ● Redis ● CouchDB ● Neo4j ● MonetDB ● Tycoon
  10. 10. NoSQL Wrappers CREATE SERVER mongo_server FOREIGN DATA WRAPPER mongo_fdw OPTIONS (address '192.168.122.47', port '27017'); CREATE FOREIGN TABLE databases ( _id NAME, name TEXT ) SERVER mongo_server OPTIONS (database 'mydb', collection 'pgData'); test=# select * from databases ; _id | name --------------------------+-----------52fd49bfba3ae4ea54afc459 | mongo 52fd49bfba3ae4ea54afc45a | postgresql 52fd49bfba3ae4ea54afc45b | oracle 52fd49bfba3ae4ea54afc45c | mysql 52fd49bfba3ae4ea54afc45d | redis 52fd49bfba3ae4ea54afc45e | db2 (6 rows)
  11. 11. File Wrappers ● Delimited files ● Fixed length files ● JSON files
  12. 12. File Wrappers CREATE SERVER pg_load FOREIGN DATA WRAPPER file_fdw; CREATE FOREIGN TABLE leads ( first_name text, last_name text, company_name text, address text, city text, county text, state text, zip text, phone1 text, phone2 text, email text, web text ) SERVER pg_load OPTIONS ( filename '/tmp/us-500.csv', format 'csv', header 'TRUE' ); test=# select first_name || ' ' || last_name as full_name, email from leads limit 3; full_name | email -------------------+------------------------------James Butt | jbutt@gmail.com Josephine Darakjy | josephine_darakjy@darakjy.org Art Venere (3 rows) | art@venere.org
  13. 13. Miscellaneous Wrappers ● Hadoop ● LDAP ● S3 ● WWW ● PG-Strom
  14. 14. Hadoop Wrapper CREATE SERVER hive_server FOREIGN DATA WRAPPER hive_fdw OPTIONS (address '127.0.0.1', port '10000'); CREATE USER MAPPING FOR PUBLIC SERVER hive_server; CREATE FOREIGN TABLE order_line ( ol_w_id integer, ol_d_id integer, ol_o_id integer, ol_number integer, ol_i_id integer, ol_delivery_d timestamp, ol_amount decimal(6,2), ol_supply_w_id integer, ol_quantity decimal(2,0), ol_dist_info varchar(24) ) SERVER hive_server OPTIONS (table 'order_line'); INSERT INTO item_sale_month SELECT ol_i_id as i_id, EXTRACT(YEAR FROM ol_delivery_d) as year, EXTRACT(MONTH FROM ol_delivery_d) as month, sum(ol_amount) as amount FROM order_line GROUP BY 1, 2, 3;
  15. 15. Hadoop Wrapper ● Hadoop foreign tables can also be writable CREATE FORIEGN TABLE audit ( audit_id bigint, event_d timestamp, table varchar, action varchar, user varchar, ) SERVER hive_server OPTIONS (table 'audit', flume_port '44444'); INSERT INTO audit VALUES (nextval('audit_id_seq'), now(), 'users', 'SELECT', 'scott');
  16. 16. Hadoop Wrapper ● It also works with HBase tables CREATE FOREIGN TABLE hive_hbase_table ( key varchar, value varchar ) SERVER localhive OPTIONS (table 'hbase_table', hbase_address 'localhost', hbase_port '9090', hbase_mapping ':key,cf:val'); INSERT INTO hive_hbase_table VALUES ('key1', 'value1'); INSERT INTO hive_hbase_table VALUES ('key2', 'value2'); UPDATE hive_hbase_table SET value = 'update' WHERE key = 'key2'; DELETE FROM hive_hbase_table WHERE key='key1'; SELECT * from hive_hbase_table;
  17. 17. WWW Wrapper CREATE SERVER www_fdw_server_google_search FOREIGN DATA WRAPPER www_fdw OPTIONS (uri 'https://ajax.googleapis.com/ajax/services/search/web?v=1.0'); CREATE USER MAPPING FOR current_user SERVER www_fdw_server_google_search; CREATE FOREIGN TABLE www_fdw_google_search ( q text, GsearchResultClass text, unescapedUrl text, url text, visibleUrl text, cacheUrl text, title text, titleNoFormatting text, content text ) SERVER www_fdw_server_google_search; select url,substring(title,1,25)||'...',substring(content,1,25)||'...' from www_fdw_google_search where q='postgresql fdw'; url | ?column? | ?column? -------------------------------------------------------------+------------------------------+-----------------------------http://wiki.postgresql.org/wiki/Foreign_data_wrappers | Foreign data wrappers - <... | Jan 24, 2014 <b>...</b> 1... http://www.postgresql.org/docs/9.3/static/postgres-fdw.html | <b>PostgreSQL</b>: Docume... | F.31.1. <b>FDW</b> Option... http://www.postgresql.org/docs/9.3/static/fdwhandler.html | <b>PostgreSQL</b>: Docume... | Foreign Data Wrapper Call... http://www.craigkerstiens.com/2013/08/05/a-look-at-FDWs/ | A look at Foreign Data Wr... | Aug 5, 2013 <b>...</b> An... (4 rows)
  18. 18. PostgreSQL Wrapper ● The most functional FDW by far ● Replaces much of the functionality of dblink ● Shipped as a contrib module
  19. 19. PostgreSQL Wrapper CREATE SERVER postgres_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', port '5432', dbname 'test2'); CREATE USER MAPPING FOR PUBLIC SERVER postgres_server; CREATE FOREIGN TABLE bird_strikes ( aircraft_type varchar, airport varchar, altitude varchar, aircraft_model varchar, num_wildlife_struck varchar, impact_to_flight varchar, effect varchar, location varchar, flight_num varchar, flight_date timestamp, record_id int, indicated_damage varchar, freeform_en_route varchar, num_engines varchar, airline varchar, origin_state varchar, phase_of_flight varchar, precipitation varchar, wildlife_collected boolean, wildlife_sent_to_smithsonian boolean, remarks varchar, reported_date timestamp, wildlife_size varchar, sky_conditions varchar, wildlife_species varchar, when_time_hhmm varchar, time_of_day varchar, pilot_warned varchar, cost_out_of_service varchar, cost_other varchar, cost_repair varchar, cost_total varchar, miles_from_airport varchar, feet_above_ground varchar, num_human_fatalities integer, num_injured integer, speed_knots varchar ) SERVER postgres_server OPTIONS (table_name 'bird_strikes');
  20. 20. PostgreSQL Wrapper ● Only requests columns that are needed test=# explain verbose select airport, flight_date from bird_strikes; QUERY PLAN ------------------------------------------------------------------------------Foreign Scan on public.bird_strikes (cost=100.00..148.40 rows=1280 width=40) Output: airport, flight_date Remote SQL: SELECT airport, flight_date FROM public.bird_strikes (3 rows)
  21. 21. PostgreSQL Wrapper ● Sends a WHERE clause test=# explain verbose select airport, flight_date from bird_strikes where flight_date > '2011-01-01'; QUERY PLAN -----------------------------------------------------------------Foreign Scan on public.bird_strikes rows=427 width=40) (cost=100.00..134.54 Output: airport, flight_date Remote SQL: SELECT airport, flight_date FROM public.bird_strikes WHERE ((flight_date > '2011-01-01 00:00:00'::timestamp without time zone)) (3 rows)
  22. 22. PostgreSQL Wrapper ● Sends built-in immutable functions test=# explain verbose select airport, flight_date from bird_strikes where flight_date > '2011-01-01' and length(airport) < 10; QUERY PLAN ------------------------------------------------------------------------------Foreign Scan on public.bird_strikes (cost=100.00..135.24 rows=142 width=40) Output: airport, flight_date Remote SQL: SELECT airport, flight_date FROM public.bird_strikes WHERE ((flight_date > '2011-01-01 00:00:00'::timestamp without time zone)) AND ((length(airport) < 10)) (3 rows)
  23. 23. PostgreSQL Wrapper ● Writable (INSERT, UPDATE, DELETE) test=# explain verbose update bird_strikes set airport = 'Unknown' where record_id = 313339; QUERY PLAN ------------------------------------------------------------------------------Update on public.bird_strikes (cost=100.00..111.05 rows=1 width=964) Remote SQL: UPDATE public.bird_strikes SET airport = $2 WHERE ctid = $1 -> Foreign Scan on public.bird_strikes (cost=100.00..111.05 rows=1 width=964) Output: aircraft_type, 'Unknown'::character varying, altitude, aircraft_model, num_wildlife_struck, impact_to_flight, effect, location, flight_num, flight_date, record_id, indicated_damage, freefo rm_en_route, num_engines, airline, origin_state, phase_of_flight, precipitation, wildlife_collected, wildlife_sent_to_smithsonian, remarks, reported_date, wildlife_size, sky_conditions, wildlife_species, w hen_time_hhmm, time_of_day, pilot_warned, cost_out_of_service, cost_other, cost_repair, cost_total, miles_from_airport, feet_above_ground, num_human_fatalities, num_injured, speed_knots, ctid Remote SQL: SELECT aircraft_type, altitude, aircraft_model, num_wildlife_struck, impact_to_flight, effect, location, flight_num, flight_date, record_id, indicated_damage, freeform_en_route, num_en gines, airline, origin_state, phase_of_flight, precipitation, wildlife_collected, wildlife_sent_to_smithsonian, remarks, reported_date, wildlife_size, sky_conditions, wildlife_species, when_time_hhmm, time _of_day, pilot_warned, cost_out_of_service, cost_other, cost_repair, cost_total, miles_from_airport, feet_above_ground, num_human_fatalities, num_injured, speed_knots, ctid FROM public.bird_strikes WHERE ( (record_id = 313339)) FOR UPDATE (5 rows)
  24. 24. PostgreSQL Wrapper ● Writes are transactional test=# select airport from bird_strikes where record_id = 313339; airport --------Unknown (1 row) test=# BEGIN; BEGIN test=# update bird_strikes set airport = 'UNKNOWN' where record_id = 313339; UPDATE 1 test=# ROLLBACK; ROLLBACK test=# select airport from bird_strikes where record_id = 313339; airport --------Unknown (1 row)
  25. 25. Limitations ● Aggregates are not pushed down test=# explain verbose select count(*) from bird_strikes; QUERY PLAN --------------------------------------------------------------------------------------------------------Aggregate (cost=220.92..220.93 rows=1 width=0) Output: count(*) -> Foreign Scan on public.bird_strikes (cost=100.00..212.39 rows=3413 width=0) Output: aircraft_type, airport, altitude, aircraft_model, num_wildlife_struck, impact_to_flight, effect, location, flight_num, flight_date, record_id, indicated_damage, freeform_en_route, num_engi nes, airline, origin_state, phase_of_flight, precipitation, wildlife_collected, wildlife_sent_to_smithsonian, remarks, reported_date, wildlife_size, sky_conditions, wildlife_species, when_time_hhmm, time_o f_day, pilot_warned, cost_out_of_service, cost_other, cost_repair, cost_total, miles_from_airport, feet_above_ground, num_human_fatalities, num_injured, speed_knots Remote SQL: SELECT NULL FROM public.bird_strikes (5 rows)
  26. 26. Limitations ● ORDER BY, GROUP BY, LIMIT not pushed down test=# explain verbose select flight_num from bird_strikes order by flight_date limit 5; QUERY PLAN ------------------------------------------------------------------------------------------Limit (cost=169.66..169.67 rows=5 width=40) Output: flight_num, flight_date -> Sort (cost=169.66..172.86 rows=1280 width=40) Output: flight_num, flight_date Sort Key: bird_strikes.flight_date -> Foreign Scan on public.bird_strikes (cost=100.00..148.40 rows=1280 width=40) Output: flight_num, flight_date Remote SQL: SELECT flight_num, flight_date FROM public.bird_strikes (8 rows)
  27. 27. Limitations ● Joins not pushed down test=# explain verbose select s.name, b.flight_date test-# from bird_strikes b, state_code s test-# where b.location = s.abbreviation and flight_date > '2011-01-01'; QUERY PLAN ------------------------------------------------------------------------------Hash Join (cost=239.88..349.95 rows=1986 width=40) Output: s.name, b.flight_date Hash Cond: ((s.abbreviation)::text = (b.location)::text) -> Foreign Scan on public.state_code s (cost=100.00..137.90 rows=930 width=64) Output: s.id, s.name, s.abbreviation, s.country, s.type, s.sort, s.status, s.occupied, s.notes, s.fips_state, s.assoc_press, s.standard_federal_region, s.census_region, s.census_region_name, s.cen sus_division, s.census_devision_name, s.circuit_court Remote SQL: SELECT name, abbreviation FROM public.state_code -> Hash (cost=134.54..134.54 rows=427 width=40) Output: b.flight_date, b.location -> Foreign Scan on public.bird_strikes b (cost=100.00..134.54 rows=427 width=40) Output: b.flight_date, b.location Remote SQL: SELECT location, flight_date FROM public.bird_strikes WHERE ((flight_date > '2011-01-01 00:00:00'::timestamp without time zone)) (11 rows)
  28. 28. Limitations (Gotcha) ● Sometimes the foreign tables don't act like tables test=# SELECT l.*, w.lat, w.lng FROM leads l, www_fdw_geocoder_google w WHERE w.address = l.address || ',' || l.city || ',' || l.state; first_name | last_name | company_name | address | city | county | state | zip | phone1 | phone2 | email | web | lat | lng ------------+-----------+--------------+---------+------+-------+-------+-----+--------+--------+-------+-----+-----+----(0 rows)
  29. 29. Limitations (Gotcha) QUERY PLAN ------------------------------------------------------------------------------------------Merge Join (cost=187.47..215.47 rows=1000 width=448) Output: l.first_name, l.last_name, l.company_name, l.address, l.city, l.county, l.state, l.zip, l.phone1, l.phone2, l.email, l.web, w.lat, w.lng Merge Cond: ((((((l.address || ','::text) || l.city) || ','::text) || l.state)) = w.address) -> Sort (cost=37.64..38.14 rows=200 width=384) Output: l.first_name, l.last_name, l.company_name, l.address, l.city, l.county, l.state, l.zip, l.phone1, l.phone2, l.email, l.web, (((((l.address || ','::text) || l.city) || ','::text) || l.state )) Sort Key: (((((l.address || ','::text) || l.city) || ','::text) || l.state)) -> Foreign Scan on public.leads l (cost=0.00..30.00 rows=200 width=384) Output: l.first_name, l.last_name, l.company_name, l.address, l.city, l.county, l.state, l.zip, l.phone1, l.phone2, l.email, l.web, ((((l.address || ','::text) || l.city) || ','::text) || l. state) Foreign File: /tmp/us-500.csv Foreign File Size: 81485 -> Sort (cost=149.83..152.33 rows=1000 width=96) Output: w.lat, w.lng, w.address Sort Key: w.address -> Foreign Scan on public.www_fdw_geocoder_google w Output: w.lat, w.lng, w.address WWW API: Request (16 rows) (cost=0.00..100.00 rows=1000 width=96)
  30. 30. Limitations (Gotcha) CREATE OR REPLACE FUNCTION google_geocode( OUT first_name text, OUT last_name text, OUT company_name text, OUT address text, OUT city text, OUT county text, OUT state text, OUT zip text, OUT phone1 text, OUT phone2 text, OUT email text, OUT web text, OUT lat text, OUT lng text) RETURNS SETOF RECORD AS $$ DECLARE r record; f_adr text; l_lat text; l_lng text; BEGIN FOR r IN SELECT * FROM leads LOOP f_adr := r.address || ',' || r.city || ',' || r.state; EXECUTE 'SELECT lat, lng FROM www_fdw_geocoder_google WHERE address = $1' INTO l_lat, l_lng USING f_adr; SELECT r.first_name, r.last_name, r.company_name, r.address, r.city, r.county, r.state, r.zip, r.phone1, r.phone2, r.email, r.web, l_lat, l_lng INTO first_name, last_name, company_name, address, city, county, state, zip, phone1, phone2, email, web, lat, lng; RETURN NEXT; END LOOP; END $$ LANGUAGE plpgsql;
  31. 31. Writing a new FDW ● Might not need to write one if there is a http interface ● Use the Blackhole as a template – https://bitbucket.org/adunstan/blackhole_fdw
  32. 32. Writing a new FDW Datum blackhole_fdw_handler(PG_FUNCTION_ARGS){ ... /* these are required */ fdwroutine->GetForeignRelSize = blackholeGetForeignRelSize; fdwroutine->GetForeignPaths = blackholeGetForeignPaths; fdwroutine->GetForeignPlan = blackholeGetForeignPlan; fdwroutine->BeginForeignScan = blackholeBeginForeignScan; fdwroutine->IterateForeignScan = blackholeIterateForeignScan; fdwroutine->ReScanForeignScan = blackholeReScanForeignScan; fdwroutine->EndForeignScan = blackholeEndForeignScan; /* remainder are optional - use NULL if not required */ /* support for insert / update / delete */ fdwroutine->AddForeignUpdateTargets = blackholeAddForeignUpdateTargets; fdwroutine->PlanForeignModify = blackholePlanForeignModify; fdwroutine->BeginForeignModify = blackholeBeginForeignModify; fdwroutine->ExecForeignInsert = blackholeExecForeignInsert; fdwroutine->ExecForeignUpdate = blackholeExecForeignUpdate; fdwroutine->ExecForeignDelete = blackholeExecForeignDelete; fdwroutine->EndForeignModify = blackholeEndForeignModify; /* support for EXPLAIN */ fdwroutine->ExplainForeignScan = blackholeExplainForeignScan; fdwroutine->ExplainForeignModify = blackholeExplainForeignModify; /* support for ANALYSE */ fdwroutine->AnalyzeForeignTable = blackholeAnalyzeForeignTable; PG_RETURN_POINTER(fdwroutine); }
  33. 33. Future ● Even more Wrappers ● Check Constraints on Foreign Tables – ● Allows partitioning Joins – Custom Scan API ● Probably will not be the way to do this, but progress being made
  34. 34. Questions? jimm@openscg.com @jim_mlodgenski

×