Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Migrations: Moving elephant herds by Carlos Izquierdo


Published on

Published in: Technology
  • Be the first to comment

Big Migrations: Moving elephant herds by Carlos Izquierdo

  1. 1. Big Migrations: Moving elephant herds
  2. 2. Motivation ● Everybody wants to jump into Big Data ● Everybody wants their new setup to be cheap – Cloud is an excellent option for this ● These environments generally start as a PoC – They should be re-implemented – Sometimes they are not
  3. 3. Motivation ● You may need to move your Hadoop cluster – You want to reduce costs – You need more performance – Because of corporate policy – For legal reasons ● But moving big data volumes is a problem! – Example: 20 TB at 10 MB/s ~ 2 ½ days
  4. 4. Initial idea ● Set up a second cluster in the new environment ● The new cluster is initially empty ● We need to populate it
  5. 5. Classic UNIX methods ● Well-known file transfer technologies: – (s)FTP – Rsync – NFS + cp ● You need to set up a staging area ● This acts as an intermediate space between Hadoop and the classic UNIX world
  6. 6. Classic UNIX methods
  7. 7. Classic UNIX methods ● Disadvantages: – Needs a big staging area – Transfer times are slow – Single nodes act as bottlenecks – Metadata needs to be copied separately – Everything must be stopped during the copy to avoid data loss – Total downtime: several hours or days (don't even try if your data is bigger)
  8. 8. Using Amazon S3 ● AWS S3 storage is also an option for staging ● Cheaper than VM disks ● Available almost everywhere ● An access key is needed – Create a user only with S3 permissions ● Transfer is done using distcp – (We'll see more about this later)
  9. 9. Using Amazon S3
  10. 10. Distcp ● Distcp copies data between two Hadoop clusters ● No staging area needed (Hadoop native) ● High throughput ● Metadata needs to be copied separately ● Clusters need to be connected – Via VPN for hdfs protocol – NAT can be used when using webhdfs ● Kerberos complicates matters
  11. 11. Distcp
  12. 12. Remote cluster access ● As a side note, remote filesystems can also be used outside distcp ● For example, as LOCATION for Hive tables ● While we're at it... ● We can transform data – For example, convert files to Parquet ● Is this the right time?
  13. 13. Extending Hadoop ● Do like the caterpillar! ● We want to step on the new platform while the old one continues working
  14. 14. Requirements ● Install servers in the new platform – Enough to hold ALL data – Same OS + config as original platform – Config management tools are helpful for this ● Set up connectivity – VPN (private networking) is needed ● Rack-aware configuration: new nodes need to be on a new rack ● System times and time zones should be consistent
  15. 15. Requirements
  16. 16. Starting the copy ● New nodes will have a DataNode role ● No computing yet (YARN, Impala, etc.) ● DataNode roles will be stopped at first ● When started: – If there is only one rack in the original platform, the copy process will begin immediately – If there is more than one rack in the original, manual intervention will be required
  17. 17. Starting the copy
  18. 18. Starting the copy
  19. 19. Starting the copy
  20. 20. Starting the copy
  21. 21. Transfer speed ● Two parameters affect the data transfer speed: – dfs.datanode.balance.bandwidthPerSec – ● No jobs are launched in the new nodes – Data flow is almost exclusive to the copy
  22. 22. Transfer speed
  23. 23. Moving master roles ● When possible, take advantage of HA: – Zookeeper (just add two) – NameNode – ResourceManager ● Others need to be migrated manually: – Hive metastore DB needs to be copied – Having a DNS name for the DB helps
  24. 24. Moving master roles
  25. 25. Moving data I/O ● Once data is copied (fully or most of it), new computation roles will be deployed: – NodeManager – Impalad ● Roles will be stopped at first ● Auxiliary nodes (front-end, app nodes, etc) need to be deployed in the new platform ● A planned intervention (at a low usage time) needs to take place
  26. 26. Moving data I/O
  27. 27. During the intervention ● The cluster is stopped ● If necessary, client configuration is redeployed ● Services are started and tested in this order: – Zookeeper – HDFS – YARN (only for the new platform) – Impala (only for the new platform) ● Auxiliary services in the new platform are tested ● Green light? Change the DNS for the entry points
  28. 28. Final picture
  29. 29. Conclusions and afterthoughts ● Minimal downtime, similar to non-Hadoop planned works ● Data and service are never at risk ● Hadoop tools are used to solve a Hadoop problem ● No user impact: no change in data or access ● Kerberos is not an issue (same REALM + kdc)
  30. 30. Thank you! Carlos Izquierdo