Your SlideShare is downloading. ×
0
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Complex stories about Sqooping PostgreSQL data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Complex stories about Sqooping PostgreSQL data

2,293

Published on

"Complex stories about Sqooping PostgreSQL data" …

"Complex stories about Sqooping PostgreSQL data"

Presentation slide for Sqoop User Meetup 10/28/2013
(Strata + Hadoop World NYC 2013)

NTT DATA Corporation
OSS Professional Services
Masatake Iwasaki

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,293
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Presentation  slide  for  Sqoop  User  Meetup  (Strata  +  Hadoop  World  NYC  2013)   Complex  stories  about Sqooping  PostgreSQL  data 10/28/2013 NTT  DATA  Corporation Masatake  Iwasaki Copyright © 2013 NTT DATA Corporation
  • 2. Introduction Copyright © 2013 NTT DATA Corporation 2
  • 3. About  Me Masatake  Iwasaki: Software  Engineer  @  NTT  DATA: NTT(Nippon  Telegraph  and  Telephone  Corporation):  Telecommunication NTT  DATA:  Systems  Integrator Developed: Ludia:  Fulltext  search  index  for  PostgreSQL  using  Senna Authored: “A  Complete  Primer  for  Hadoop”   (no  official  English  title) Patches  for  Sqoop: SQOOP-‐‑‒390:  PostgreSQL  connector  for  direct  export  with  pg_̲bulkload SQOOP-‐‑‒999:  Support  bulk  load  from  HDFS  to  PostgreSQL  using  COPY  ...  FROM   SQOOP-‐‑‒1155:  Sqoop  2  documentation  for  connector  development   Copyright © 2013 NTT DATA Corporation 3
  • 4. Why  PostgreSQL? Enterprisy from  earlier  version comparing  to  MySQL Active  community  in  Japan NTT  DATA  commits  itself  to  development Copyright © 2013 NTT DATA Corporation 4
  • 5. Sqooping  PostgreSQL  data Copyright © 2013 NTT DATA Corporation 5
  • 6. Working  around  PostgreSQL Direct  connector  for  PostgreSQL  loader:    SQOOP-‐‑‒390:  PostgreSQL  connector  for  direct  export  with  pg_̲bulkload Yet  another  direct  connector  for  PostgreSQL  JDBC:    SQOOP-‐‑‒999:  Support  bulk  load  from  HDFS  to  PostgreSQL                                            using  COPY  ...  FROM Supporting  complex  data  types:    SQOOP-‐‑‒1149:  Support  Custom  Postgres  Types   Copyright © 2013 NTT DATA Corporation 6
  • 7. Direct  connector  for  PostgreSQL  loader SQOOP-‐‑‒390:      PostgreSQL  connector  for  direct  export  with  pg_̲bulkload pg_̲bulkload: Data  loader  for  PostgreSQL Server  side  plug-‐‑‒in  library  and  client  side  command Providing  filtering  and  transformation  of  data http://pgbulkload.projects.pgfoundry.org/ Copyright © 2013 NTT DATA Corporation 7
  • 8. SQOOP-‐‑‒390:      PostgreSQL  connector  for  direct  export  with  pg_̲bulkload HDFS File  Split File  Split File  Split CREATE  TABLE    tmp3(LIKE  dest  INCLUDING  CONSTRAINTS) Mapper Mapper Mapper pg_̲bulkload pg_̲bulkoad pg_̲bulkload tmp1 PostgreSQL tmp2 Reducer Destination  Table Copyright © 2013 NTT DATA Corporation tmp3 external  process staging  table  per  mapper  is  must due  to  table  level  locks BEGIN INSERT  INTO  dest  (  SELECT  *  FROM  tmp1  ) DROP  TABLE  tmp1 INSERT  INTO  dest  (  SELECT  *  FROM  tmp2  ) DROP  TABLE  tmp2 INSERT  INTO  dest  (  SELECT  *  FROM  tmp3  ) DROP  TABLE  tmp3 COMMIT 8
  • 9. Direct  connector  for  PostgreSQL  loader Pros: Fast by  short-‐‑‒circuitting  server  functionality Flexible filtering  error  records Cons: Not  so  fast Bottleneck  is  not  in  client  side  but  in  DB  side Built-‐‑‒in  COPY  functionality  is  fast  enough Not  General pg_̲bulkload  supports  only  export Requiring  setup  on  all  slave  nodes  and  client  node Possible  to  Require  recovery  on  failure Copyright © 2013 NTT DATA Corporation 9
  • 10. Yet  another  direct  connector  for  PostgreSQL  JDBC PostgreSQL  provides  custom  SQL  command  for  data  import/export COPY table_name [ ( column_name [, ...] ) ] FROM { 'filename' | STDIN } [ [ WITH ] ( option [, ...] ) ] COPY { table_name [ ( column_name [, ...] ) ] | ( query ) } TO { 'filename' | STDOUT } [ [ WITH ] ( option [, ...] ) ] where option can be one of: FORMAT format_name OIDS [ boolean ] DELIMITER 'delimiter_character' NULL 'null_string' HEADER [ boolean ] QUOTE 'quote_character' ESCAPE 'escape_character' FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL ( column_name [, ...] ) ENCODING 'encoding_name‘ AND  JDBC  API org.postgresql.copy.* Copyright © 2013 NTT DATA Corporation 10
  • 11. SQOOP-‐‑‒999:    Support  bulk  load  from  HDFS  to  PostgreSQL  using  COPY  ...  FROM File  Split File  Split File  Split Mapper Mapper Mapper PostgreSQL JDBC HDFS PostgreSQL JDBC PostgreSQL JDBC Using  custom  SQL  command   via  JDBC  API only  available  in  PostgreSQL COPY  FROM  STDIN  WITH  ... staging  tagle PostgreSQL Destination  Table Copyright © 2013 NTT DATA Corporation 11
  • 12. SQOOP-‐‑‒999:    Support  bulk  load  from  HDFS  to  PostgreSQL  using  COPY  ...  FROM   import org.postgresql.copy.CopyManager; import org.postgresql.copy.CopyIn; ... Requiring  PostgreSQL   specific  interface. protected void setup(Context context) ... dbConf = new DBConfiguration(conf); CopyManager cm = null; ... public void map(LongWritable key, Writable value, Context context) ... if (value instanceof Text) { line.append(System.getProperty("line.separator")); } try { byte[]data = line.toString().getBytes("UTF-8"); Just  feeding  lines  of  text copyin.writeToCopy(data, 0, data.length); Copyright © 2013 NTT DATA Corporation 12
  • 13. Yet  another  direct  connector  for  PostgreSQL  JDBC Pros: Fast  enough Ease  of  use JDBC  driver  jar  is  distributed  automatically  by  MR  framework Cons: Dependency  on  not  general  JDBC possible  licensing  issue  (PostgreSQL  is  OK,  itʼ’s  BSD  Lisence) build  time  requirement  (PostgreSQL  JDBC  is  available  in  Maven  repo.) <dependency org="org.postgresql" name="postgresql" rev="${postgresql.version}" conf="common->default" /> Error  record  causes  rollback  of  whole  transaction Still  difficult  to  implement  custom  connector  for  IMPORT because  of  code  generation  part Copyright © 2013 NTT DATA Corporation 13
  • 14. Supporting  complex  data  types PostgreSQL  supports  lot  of  complex  data  types   Geometric  Types Points Line  Segments Boxes Paths Polygons Circles Network  Address  Types inet cidr macaddr XML  Type JSON  Type Supporting  complex  data  types:    SQOOP-‐‑‒1149:  Support  Custom  Postgres  Types Copyright © 2013 NTT DATA Corporation not  me 14
  • 15. Constraints  on  JDBC  data  types  in  Sqoop  framework protected Map<String, Integer> getColumnTypesForRawQuery(String stmt) { ... results = execute(stmt); ... ResultSetMetaData metadata = results.getMetaData(); for (int i = 1; i < cols + 1; i++) { int typeId = metadata.getColumnType(i); returns  java.sql.Types.OTHER  for  types   { not  mappable  to  basic    Java  data  types =>  Losing  type  imformation     public String toJavaType(int sqlType) // Mappings taken from: // http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html if (sqlType == Types.INTEGER) { return "Integer"; } else if (sqlType == Types.VARCHAR) { return "String"; .... reaches  here } else { // TODO(aaron): Support DISTINCT, ARRAY, STRUCT, REF, JAVA_OBJECT. // Return null indicating database-specific manager should return a // java data type if it can find one for any nonstandard type. return null; Copyright © 2013 NTT DATA Corporation 15
  • 16. Sqoop  1  Summary Pros: Simple  Standalone  MapReduce  Driver Easy  to  understand  for  MR  application  developpers                                                                            except  for  ORM  (SqoopRecord)  code  generation  part. Variety  of  connectors Lot  of  information Cons: Complex  command  line  and  inconsistent  options meaning  of  options  is  according  to  connectors Not  enough  modular Dependency  on  JDBC  data  model Security Copyright © 2013 NTT DATA Corporation 16
  • 17. Sqooping  PostgreSQL  Data  2 Copyright © 2013 NTT DATA Corporation 17
  • 18. Sqoop  2 Everything  are  rewritten Working  on  server  side More  modular Not  compatible  with  Sqoop  1  at  all (Almost)  Only  generic  connector Black  box  comparing  to  Sqoop  1 Needs  more  documentation SQOOP-‐‑‒1155:  Sqoop  2  documentation  for  connector  development Internal of Sqoop2 MapReduce Job ++++++++++++++++++++++++++++++++ ... - OutputFormat invokes Loader's load method (via SqoopOutputFor .. todo: sequence diagram like figure. Copyright © 2013 NTT DATA Corporation 18
  • 19. Sqoop2:  Initialization  phase  of  IMPORT  job Implement  this ,----------------. ,-----------. |SqoopInputFormat| |Partitioner| `-------+--------' `-----+-----' getSplits | | ----------->| | | getPartitions | |------------------------>| | | ,---------. | |-------> |Partition| | | `----+----' |<- - - - - - - - - - - - | | | | | ,----------. |-------------------------------------------------->|SqoopSplit| | | | `----+-----' Copyright © 2013 NTT DATA Corporation 19
  • 20. Sqoop2:  Map  phase  of  IMPORT  job ,-----------. |SqoopMapper| `-----+-----' run | --------->| ,-------------. |---------------------------------->|MapDataWriter| | `------+------' Implement  this | ,---------. | |--------------> |Extractor| | | `----+----' | | extract | | |-------------------->| | Conversion  to  Sqoop   | | | internal  data  format read from DB | | <-------------------------------| write* | | |------------------->| | | | ,----. | | |---------->|Data| | | | `-+--' | | | | | | context.write | | |--------------------------> Copyright © 2013 NTT DATA Corporation 20
  • 21. Sqoop2:  Reduce  phase  of  EXPORT  job ,-------. ,---------------------. |Reducer| |SqoopNullOutputFormat| `---+---' `----------+----------' | | ,-----------------------------. Implement  this | |-> |SqoopOutputFormatLoadExecutor| | | `--------------+--------------' ,----. | | |---------------------> |Data| | | | `-+--' | | | ,-----------------. | | | |-> |SqoopRecordWriter| | getRecordWriter | | `--------+--------' | ----------------------->| getRecordWriter | | | | |----------------->| | | ,--------------. | | |-----------------------------> |ConsumerThread| | | | | | `------+-------' | |<- - - - - - - - -| | | | ,------. <- - - - - - - - - - - -| | | | |--->|Loader| | | | | | | `--+---' | | | | | | | | | | | | | load | run | | | | | |------>| ----->| | write | | | | | |------------------------------------------------>| setContent | | read* | | | | |----------->| getContent |<------| | | | | |<-----------| | | | | | | | - - ->| | | | | | | | write into DB | | | | | | |--------------> Copyright © 2013 NTT DATA Corporation 21
  • 22. Summary Copyright © 2013 NTT DATA Corporation 22
  • 23. My  interest  for  popularizing  Sqoop  2 Complex  data  type  support  in  Sqoop  2 Bridge  to  use  Sqoop  1  connectors  on  Sqoop  2 Bridge  to  use  Sqoop  2  connectors  from  Sqoop  1  CLI Copyright © 2013 NTT DATA Corporation 23
  • 24. Copyright © 2011 NTT DATA Corporation Copyright © 2013 NTT DATA Corporation

×