Successfully reported this slideshow.

Sqooping 50 Million Rows a Day from MySQL

4,390 views

Published on

Sqoop Meetup in NYC 11/7/11 - creating a single data source for analysts to mine.

  • Be the first to comment

Sqooping 50 Million Rows a Day from MySQL

  1. 1.    Sqooping  50  Million  Rows  a  Day  from  MySQL       Eric  Hernandez   Database  Administrator  
  2. 2. Parent_YearMonth_Merge   Child_YearMonth_0   Child_YearMonth_1   Child_YearMonth_2   App     Child_YearMonth_3  Servers   Child_YearMonth_4   Child_YearMonth_5   Child_YearMonth_6   Child_YearMonth_7   Child_YearMonth_8   Child_YearMonth_9   +50  Million  New  Rows  a  Day     +1.5  Billion  Rows  a  Month  
  3. 3. 3  Month  RotaSonal  Life  Cycle  MySQL  AcSve     MySQL  Archive  Writer  Instance   Long-­‐Term  Storage  Instance   Two  Months  Current  Month   Ago   One  Month   Three  Months   Ago   Ago   Two  Months   So  on  ..     Ago  
  4. 4. Problem:  Data  Analyst  have  to  pull  data  from  two  different  sources.   One  of  the  goals  of  our  project  is  to  create  a  single  data   source  for  analyst  to  mine.    MySQL  AcSve     MySQL  Archive  Writer  Instance   Long-­‐Term  Storage  Instance   Two  Months  Current  Month   Ago   One  Month   Three  Months   Ago   Ago   So  on  ..    
  5. 5. Data  Analyst  with  Hadoop  only  have  to  pull  from  one  data  source.   Hadoop  Cluster   Hive  MySQL  AcSve      Writer  Instance   With  all  data,     current   to  the  last  24  hours.    Current  Month   One  Month   Ago  
  6. 6. A^empt  1.0  Sqooping  in  Data  from  MySQL  Sqoop  enSre  table  into  hive  every  day  at  0030   Parent_201108_Merge   9  Node     Child_201108_0   Hadoop  Cluster   Child_201108_1   4  TB  Available  Storage   Child_201108_2   Child_201108_3   Child_201108_4   Hive  Table   Child_201108_5     Child_201108_6    2011-­‐08-­‐01   Child_201108_7  5  Million  Rows  Per  Table   Child_201108_8  2  Minutes  Sqoop  Sme  Per  Table   Child_201108_9  20  Minute  Total  Time  Total  50  Million  Rows  into  Hive  Table  2011-­‐08-­‐02   2011-­‐08-­‐10  10  Million  Rows  Per  Table   50  Million  Rows  Per  Table  4  Minutes  Sqoop  Sme  Per  Table   20  Minutes  Sqoop  Sme  Per  Table  40  Minutes  Total  Time   200  Minutes  Total  Time  Total  100  Million  Rows  into  Hive  Table   Total  500  Million  Rows  into  Hive  Table  
  7. 7. A^empt  2.0  Incremental  Sqoop  of  Data  from  MySQL   Child_YearMonth  Schema   ID  BIGINT   MISC   MISC   MISC   Date_Created  Auto  Increment   Column   Column   Column   TimeStamp   Parent_201108_Merge   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   Child_201108_5     Child_201108_6     Child_201108_7   Child_201108_8   Child_201108_9    sqoop  import  -­‐-­‐where  "date_created  between  ${DATE}  00:00:00  and  ${DATE}  23:59:59’”  
  8. 8. A^empt  2.0  Incremental  Sqoop  of  Data  from  MySQL   9  Node     Sqoop Parent_201108_Merge     Hadoop  Cluster   Last  2 with  wher Child_201108_0   4  hou e 4  TB  Available  Storage   rs  fro  clause   Child_201108_1   m  Da te_Cr Child_201108_2   eat ed   Child_201108_3   Child_201108_4   Hive  Table   Child_201108_5     Child_201108_6     Child_201108_7   Child_201108_8   Child_201108_9   2011-­‐08-­‐01   5  Million  Rows  Per  Table   2  Minutes  Sqoop  Sme  Per  Table   2011-­‐08-­‐10   10  Minute  Total  Time   5  Million  Rows  Per  Table   Total  50  Million  Rows  into  Hive  Table   2  Minutes  Sqoop  Sme  Per  Table   10  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   2011-­‐08-­‐02     5  Million  Rows  Per  Table   2  Minutes  Sqoop  Sme  Per  Table   10  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   Consistent  run  Smes  for  sqoop  jobs  achieved    
  9. 9. Ager  our  2.0  Incremental  Process  we  had  achieved  consistent  run  Smes  however,  two  new  problems  surfaced.    1)  Each  day  10  new  parts  would  be  added  to  the  Hive  table  which  caused  10  more  map   tasks  per  hive  query.      2)  Space  consumpSon  on  hadoop  cluster.    
  10. 10. Too  many  parts  and    map  tasks  per  query.   Hive  Table   Parent_201108_Merge   Part-­‐0   Child_201108_0   Part-­‐1   Child_201108_1   Part-­‐2   Part-­‐3   Child_201108_2  2011-­‐08-­‐01   Part-­‐4   Sqoop   Child_201108_3   Part-­‐5   Part-­‐6   Child_201108_4   Part-­‐7   Child_201108_5     Part-­‐8   Part-­‐9   Child_201108_6     Part-­‐10   Child_201108_7   Part-­‐11   Child_201108_8   Part-­‐12   Part-­‐13   Child_201108_9   Part-­‐14  2011-­‐08-­‐02   Part-­‐15   For  3  Days  of  Data   Part-­‐16   Part-­‐17   30  Map  tasks  must  be  processed  for   Part-­‐18   Part-­‐19   any  Hive  Query   Part-­‐20   Part-­‐21   Part-­‐22   Part-­‐23  2011-­‐08-­‐03   Part-­‐24   For  30  Days  of  Data   Part-­‐25   Part-­‐26   300  Map  tasks  must  be  processed  for   Part-­‐27   any  Hive  Query   Part-­‐28   Part-­‐29  
  11. 11. Parent_201108_Merge   Hive  Table   Child_201108_0   Child_201108_1   p   Sqoo Child_201108_2   Child_201108_3   Part-­‐0   Part-­‐1   Child_201108_4   Part-­‐2   Child_201108_5     ParSSon   Part-­‐3   Child_201108_6    2011-­‐08-­‐01   Part-­‐4   dt=2011-­‐08-­‐01   Child_201108_7   Part-­‐5   Part-­‐6   Child_201108_8   Part-­‐7   Child_201108_9   Part-­‐8   Part-­‐9   To  sqoop  10  tables  into  one  parSSon   Part-­‐0   I  choose  to  dynamically  create  a  parSSon  based  on  date   Part-­‐1   Part-­‐2   and  Sqoop  the  data  into  parSSon  directory  with  an  append   Part-­‐3  2011-­‐08-­‐02   ParSSon   Part-­‐4   dt=2011-­‐08-­‐02   #  Set  date  to  yesterday   Part-­‐5   Part-­‐6   DATE=`date  +%Y-­‐%m-­‐%d  -­‐d  "1  day  ago"`   Part-­‐7     Part-­‐8   #Create  ParSSon   Part-­‐9   echo  "ALTER  TABLE  ${TABLE}  ADD  IF  NOT  EXISTS  PARTITION  (dt=${DATE})  locaSon   Part-­‐0   ${PARTITION_DIR};  exit;"  |  /usr/bin/hive   Part-­‐1   Part-­‐2     Part-­‐3   #  Sqoop  in  event_logs   ParSSon  2011-­‐08-­‐03   Part-­‐4   dt=2011-­‐08-­‐03   Part-­‐5   TABLE_DIR=/user/hive/warehouse/${TABLE}   Part-­‐6   PARTITION_DIR=$TABLE_DIR/${DATE}   Part-­‐7     Part-­‐8   Part-­‐9   sqoop  import  -­‐-­‐where  "date_created  between  ${DATE}  00:00:00  and  ${DATE}   23:59:59"  -­‐-­‐target-­‐dir  $PARTITION_DIR  -­‐-­‐append  
  12. 12. Hive  Table   Parent_201108_Merge   Child_201108_0   Part-­‐0   Child_201108_1   Part-­‐1   Child_201108_2   Part-­‐2   Part-­‐3   Sqoop   Child_201108_3   ParSSon   Part-­‐4   Child_201108_4  2011-­‐08-­‐01   dt=2011-­‐08-­‐01   Part-­‐5   Part-­‐6   Child_201108_5     Part-­‐7   Child_201108_6     Part-­‐8   Part-­‐9   Child_201108_7   Part-­‐0   Child_201108_8   Part-­‐1   Child_201108_9   Part-­‐2   Part-­‐3  2011-­‐08-­‐02   ParSSon   Part-­‐4   dt=2011-­‐08-­‐02   Part-­‐5   Part-­‐6   Part-­‐7   As  a  result  of  sqooping  into  hive  parSSons  only  a   Part-­‐8   minimal  amount  map  task  have  to  be  processed.       Part-­‐9   1  Day  =  10  Map  Tasks   Part-­‐0   Part-­‐1   2  Days  =  20  Map  Tasks   Part-­‐2   …   Part-­‐3   ParSSon   Part-­‐4   30  Days  =  300  Map  Tasks  2011-­‐08-­‐03   dt=2011-­‐08-­‐03   Part-­‐5   Part-­‐6   Part-­‐7   Part-­‐8   Part-­‐9  
  13. 13. Space  ConsumpSon  Parent_201108_Merge   Hadoop   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   1  Month  of  Data   Child_201108_5     =  30GB     ReplicaSon   Child_201108_6     Factor  3           Child_201108_7   Child_201108_8   Child_201108_9   1  Year  of  Data   3  Replicas   1  Replica  =  30  GB   1.08  TB  in  HDFS   3  Replicas  =  90  GB     in  HDFS  
  14. 14. Sqooping  with  Snappy    sqoop  import  -­‐-­‐compression-­‐codec  org.apache.hadoop.io.compress.SnappyCodec  -­‐z   Parent_201108_Merge   Hadoop   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   1  Month  of  Data   Child_201108_5     =  30GB     ReplicaSon   Child_201108_6     Factor  3           Child_201108_7   Child_201108_8   Child_201108_9   1  Year  of  Data   1  Replica  =  6  GB   3  Replicas   3  Replicas  =  18  GB    in  HDFS   216  GB  in  HDFS   with  5:1  Snappy  Compression  
  15. 15. Summary    1)  Develop  some  kind  of  incremental  import  when  sqooping  in  large  acSve  tables.  If  you   do  not,  your  sqoop  jobs  will  take  longer  and  longer  as  the  data  grows  from  the   RDBMS.  2)  Limit  the  amount  of  parts  that  will  be  stored  in  HDFS,  this  translates  into  Sme   consuming  map  tasks,  use  parSSoning  if  possible.  3)  Compress  data  in  HDFS.  You  will  save  space  in  HDFS  as  your  replicaSon  factor  makes   mulSple  copies  of  your  data.  You  may  also  benefit  in  processing  as  your  Map/Reduce   jobs  have  less  data  to  transfer  and  hadoop  becomes  less  I/O  bound.      
  16. 16. ?    

×