• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
New Data Transfer Tools for Hadoop: Sqoop 2
 

New Data Transfer Tools for Hadoop: Sqoop 2

on

  • 9,349 views

 

Statistics

Views

Total Views
9,349
Views on SlideShare
8,115
Embed Views
1,234

Actions

Likes
25
Downloads
0
Comments
0

9 Embeds 1,234

http://www.scoop.it 1039
http://eventifier.co 102
http://eventifier.com 55
https://twitter.com 24
http://hochul.net 7
http://webcache.googleusercontent.com 3
http://www.eventifier.co 2
http://translate.googleusercontent.com 1
http://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    New Data Transfer Tools for Hadoop: Sqoop 2 New Data Transfer Tools for Hadoop: Sqoop 2 Presentation Transcript

    • A  New  GeneraAon  of  Data  Transfer     Tools  for  Hadoop:  Sqoop  2   Bilung  Lee  (blee  at  cloudera  dot  com)   Kathleen  Ting  (kathleen  at  cloudera  dot  com)     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Who  Are  We?  •  Bilung  Lee   –  Apache  Sqoop  CommiQer   –  So=ware  Engineer,  Cloudera    •  Kathleen  Ting   –  Apache  Sqoop  CommiQer   –  Support  Manager,  Cloudera   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   2   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • What  is  Sqoop?  •  Bulk  data  transfer  tool   –  Import/Export  from/to  relaAonal  databases,   enterprise  data  warehouses,  and  NoSQL  systems   –  Populate  tables  in  HDFS,  Hive,  and  HBase   –  Integrate  with  Oozie  as  an  acAon   –  Support  plugins  via  connector  based  architecture   May ‘09 March ‘10 August ‘11 April ‘12 First version Moved to Moved to Apache(HADOOP-5815) GitHub Apache Top Level Project Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   3   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1  Architecture   Document Enterprise Based Data Systems Warehouse Relational Databasecommand Hadoop Map Task Sqoop HDFS/HBase/ Hive Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   4   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1  Challenges  •  CrypAc,  contextual  command  line  arguments  •  Tight  coupling  between  data  transfer  and   output  format    •  Security  concerns  with  openly  shared   credenAals  •  Not  easy  to  manage  installaAon/configuraAon  •  Connectors  are  forced  to  follow  JDBC  model   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   5   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2  Architecture   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   6   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2  Themes  •  Ease  of  Use  •  Ease  of  Extension  •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   7   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2  Themes  •  Ease  of  Use  •  Ease  of  Extension  •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   8   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Ease  of  Use  Sqoop  1   Sqoop  2  Client-­‐only  Architecture   Client/Server  Architecture  CLI  based   CLI  +  Web  based  Client  access  to  Hive,  HBase   Server  access  to  Hive,  HBase  Oozie  and  Sqoop  Aghtly  coupled   Oozie  finds  REST  API   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   9   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  Client-­‐side  Tool  •  Client-­‐side  installaAon  +  configuraAon   –  Connectors  are  installed/configured  locally   –  Local  requires  root  privileges   –  JDBC  drivers  are  needed  locally     –  Database  connecAvity  is  needed  locally   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   10   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  Sqoop  as  a  Service  •  Server-­‐side  installaAon  +  configuraAon   –  Connectors  are  installed/configured  in  one  place   –  Managed  by  administrator  and  run  by  operator   –  JDBC  drivers  are  needed  in  one  place   –  Database  connecAvity  is  needed  on  the  server   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   11   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Client  Interface  •  Sqoop  1  client  interface:   –  Command  line  interface  (CLI)  based   –  Can  be  automated  via  scripAng  •  Sqoop  2  client  interface:   –  CLI  based  (in  either  interacAve  or  script  mode)   –  Web  based  (remotely  accessible)   –  REST  API  is  exposed  for  external  tool  integraAon   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   12   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  Service  Level  IntegraAon  •  Hive,  HBase   –  Require  local  installaAon  •  Oozie   –  von  Neumann(esque)  integraAon:     •  Package  Sqoop  as  an  acAon   •  Then  run  Sqoop  from  node  machines,  causing  one  MR   job  to  be  dependent  on  another  MR  job   •  Error-­‐prone,  difficult  to  debug   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   13   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  Service  Level  IntegraAon  •  Hive,  HBase   –  Server-­‐side  integraAon  •  Oozie   –  REST  API  integraAon   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   14   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Ease  of  Use  Sqoop  1   Sqoop  2  Client-­‐only  Architecture   Client/Server  Architecture  CLI  based   CLI  +  Web  based  Client  access  to  Hive,  HBase   Server  access  to  Hive,  HBase  Oozie  and  Sqoop  Aghtly  coupled   Oozie  finds  REST  API   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   15   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2  Themes  •  Ease  of  Use  •  Ease  of  Extension  •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   16   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Ease  of  Extension  Sqoop  1   Sqoop  2  Connector  forced  to  follow  JDBC  model   Connector  given  free  rein  Connectors  must  implement  funcAonality   Connectors  benefit  from  common   framework  of  funcAonality  Connector  selecAon  is  implicit   Connector  selecAon  is  explicit     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   17   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  ImplemenAng  Connectors  •  Connectors  are  forced  to  follow  JDBC  model   –  Connectors  are  limited/required  to  use  common   JDBC  vocabulary  (URL,  database,  table,  etc)  •  Connectors  must  implement  all  Sqoop   funcAonality  they  want  to  support   –  New  funcAonality  may  not  be  available  for   previously  implemented  connectors   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   18   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  ImplemenAng  Connectors  •  Connectors  are  not  restricted  to  JDBC  model   –  Connectors  can  define  own  domain  •  Common  funcAonality  are  abstracted  out  of   connectors   –  Connectors  are  only  responsible  for  data  transfer   –  Common  Reduce  phase  implements  data   transformaAon  and  system  integraAon   –  Connectors  can  benefit  from  future  development   of  common  funcAonality   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   19   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Different  OpAons,  Different  Results  Which  is  running  MySQL?  $ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST$ sqoop import --connect jdbc:mysql://localhost/db --driver com.mysql.jdbc.Driver --username foo --table TEST  •  Different  opAons  may  lead  to  unpredictable   results   –  Sqoop  2  requires  explicit  selecAon  of  a  connector,   thus  disambiguaAng  the  process   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   20   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  Using  Connectors  •  Choice  of  connector  is  implicit   –  In  a  simple  case,  based  on  the  URL  in  -­‐-­‐connect   string  to  access  the  database   –  SpecificaAon  of  different  opAons  can  lead  to   different  connector  selecAon   –  Error-­‐prone  but  good  for  power  users   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   21   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  Using  Connectors  •  Require  knowledge  of  database  idiosyncrasies     –  e.g.  Couchbase  does  not  need  to  specify  a  table   name,  which  is  required,  causing  -­‐-­‐table  to  get   overloaded  as  backfill  or  dump  operaAon   –  e.g.  -­‐-­‐null-­‐string  representaAon  is  not  supported   by  all  connectors  •  FuncAonality  is  limited  to  what  the  implicitly   chosen  connector  supports   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   22   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  Using  Connectors  •  Users  make  explicit  connector  choice   –  Less  error-­‐prone,  more  predictable  •  Users  need  not  be  aware  of  the  funcAonality   of  all  connectors   –  Couchbase  users  need  not  care  that  other   connectors  use  tables   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   23   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  Using  Connectors  •  Common  funcAonality  is  available  to  all   connectors   –  Connectors  need  not  worry  about  common   downstream  funcAonality,  such  as  transformaAon   into  various  formats  and  integraAon  with  other   systems   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   24   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Ease  of  Extension  Sqoop  1   Sqoop  2  Connector  forced  to  follow  JDBC  model   Connector  given  free  rein  Connectors  must  implement  funcAonality   Connectors  benefit  from  common   framework  of  funcAonality  Connector  selecAon  is  implicit   Connector  selecAon  is  explicit     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   25   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2  Themes  •  Ease  of  Use  •  Ease  of  Extension  •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   26   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Security  Sqoop  1   Sqoop  2  Support  only  for  Hadoop  security   Support  for  Hadoop  security  and  role-­‐ based  access  control  to  external  systems  High  risk  of  abusing  access  to  external   Reduced  risk  of  abusing  access  to  external  systems   systems  No  resource  management  policy   Resource  management  policy   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   27   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  Security  •  Inherit/Propagate  Kerberos  principal  for  the   jobs  it  launches  •  Access  to  files  on  HDFS  can  be  controlled  via   HDFS  security  •  Limited  support  (user/password)  for  secure   access  to  external  systems   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   28   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  Security  •  Inherit/Propagate  Kerberos  principal  for  the   jobs  it  launches  •  Access  to  files  on  HDFS  can  be  controlled  via   HDFS  security  •  Support  for  secure  access  to  external  systems   via  role-­‐based  access  to  connecAon  objects   –  Administrators  create/edit/delete  connecAons   –  Operators  use  connecAons   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   29   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  External  System  Access  •  Every  invocaAon  requires  necessary   credenAals  to  access  external  systems  (e.g.   relaAonal  database)   –  Workaround:  create  a  user  with  limited  access  in   lieu  of  giving  out  password   •  Does  not  scale   •  Permission  granularity  is  hard  to  obtain  •  Hard  to  prevent  misuse  once  credenAals  are   given   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   30   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  External  System  Access  •  ConnecAons  are  enabled  as  first-­‐class  objects   –  ConnecAons  encompass  credenAals   –  ConnecAons  are  created  once  and  then  used   many  Ames  for  various  import/export  jobs   –  ConnecAons  are  created  by  administrator  and   used  by  operator   •  Safeguard  credenAal  access  from  end  users  •  ConnecAons  can  be  restricted  in  scope  based   on  operaAon  (import/export)   –  Operators  cannot  abuse  credenAals   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   31   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  1:  Resource  Management  •  No  explicit  resource  management  policy   –  Users  specify  the  number  of  map  jobs  to  run   –  Cannot  throQle  load  on  external  systems   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   32   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Sqoop  2:  Resource  Management  •  ConnecAons  allow  specificaAon  of  resource   management  policy     –  Administrators  can  limit  the  total  number  of   physical  connecAons  open  at  one  Ame   –  ConnecAons  can  also  be  disabled   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   33   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Security  Sqoop  1   Sqoop  2  Support  only  for  Hadoop  security   Support  for  Hadoop  security  and  role-­‐ based  access  control  to  external  systems  High  risk  of  abusing  access  to  external   Reduced  risk  of  abusing  access  to  external  systems   systems  No  resource  management  policy   Resource  management  policy   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   34   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   35   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   36   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   37   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   38   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   39   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Takeaway    Sqoop  2  Highights:   –  Ease  of  Use:  Sqoop  as  a  Service   –  Ease  of  Extension:  Connectors  benefit  from   shared  funcAonality   –  Security:  ConnecAons  as  first-­‐class  objects  and   role-­‐based  security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   40   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Current  Status:  work-­‐in-­‐progress  •  Sqoop2  Development:    hQp://issues.apache.org/jira/browse/SQOOP-­‐365  •  Sqoop2  Blog  Post:    hQp://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop  •  Sqoop2  Design:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   41   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Current  Status:  work-­‐in-­‐progress  •  Sqoop2  Quickstart:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Quickstart  •  Sqoop2  Resource  Layout:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+-­‐+Resource+Layout  •  Sqoop2  Feature  Requests:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Feature+Requests   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   42   Copyright  2012  The  Apache  So=ware  FoundaAon  
    • Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   43  Copyright  2012  The  Apache  So=ware  FoundaAon  
    • ReceptionReception takes place in the Community Showcase (Hall 2) Page 44
    • Sqoop  &  Flume  Meetups  Tonight   6:30pm  (Sqoop)   7:45pm  (Flume)   at   Hilton  San  Jose     (San  Carlos  Conf  Rm)   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   45   Copyright  2012  The  Apache  So=ware  FoundaAon