Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Migrations: Moving elephant herds by Carlos Izquierdo

701 views

Published on

https://www.bigdataspain.org/2016/program/fri-big-migrations-moving-elephant-herds.html

https://www.youtube.com/watch?v=oLLHfMJ_aXA&list=PL6O3g23-p8Tr5eqnIIPdBD_8eE5JBDBik&index=48&t=8s

Published in: Technology
  • Be the first to comment

Big Migrations: Moving elephant herds by Carlos Izquierdo

  1. 1. Big Migrations: Moving elephant herds
  2. 2. www.datatons.com Motivation ● Everybody wants to jump into Big Data ● Everybody wants their new setup to be cheap – Cloud is an excellent option for this ● These environments generally start as a PoC – They should be re-implemented – Sometimes they are not
  3. 3. www.datatons.com Motivation ● You may need to move your Hadoop cluster – You want to reduce costs – You need more performance – Because of corporate policy – For legal reasons ● But moving big data volumes is a problem! – Example: 20 TB at 10 MB/s ~ 2 ½ days
  4. 4. www.datatons.com Initial idea ● Set up a second cluster in the new environment ● The new cluster is initially empty ● We need to populate it
  5. 5. www.datatons.com Classic UNIX methods ● Well-known file transfer technologies: – (s)FTP – Rsync – NFS + cp ● You need to set up a staging area ● This acts as an intermediate space between Hadoop and the classic UNIX world
  6. 6. www.datatons.com Classic UNIX methods
  7. 7. www.datatons.com Classic UNIX methods ● Disadvantages: – Needs a big staging area – Transfer times are slow – Single nodes act as bottlenecks – Metadata needs to be copied separately – Everything must be stopped during the copy to avoid data loss – Total downtime: several hours or days (don't even try if your data is bigger)
  8. 8. www.datatons.com Using Amazon S3 ● AWS S3 storage is also an option for staging ● Cheaper than VM disks ● Available almost everywhere ● An access key is needed – Create a user only with S3 permissions ● Transfer is done using distcp – (We'll see more about this later)
  9. 9. www.datatons.com Using Amazon S3
  10. 10. www.datatons.com Distcp ● Distcp copies data between two Hadoop clusters ● No staging area needed (Hadoop native) ● High throughput ● Metadata needs to be copied separately ● Clusters need to be connected – Via VPN for hdfs protocol – NAT can be used when using webhdfs ● Kerberos complicates matters
  11. 11. www.datatons.com Distcp
  12. 12. www.datatons.com Remote cluster access ● As a side note, remote filesystems can also be used outside distcp ● For example, as LOCATION for Hive tables ● While we're at it... ● We can transform data – For example, convert files to Parquet ● Is this the right time?
  13. 13. www.datatons.com Extending Hadoop ● Do like the caterpillar! ● We want to step on the new platform while the old one continues working
  14. 14. www.datatons.com Requirements ● Install servers in the new platform – Enough to hold ALL data – Same OS + config as original platform – Config management tools are helpful for this ● Set up connectivity – VPN (private networking) is needed ● Rack-aware configuration: new nodes need to be on a new rack ● System times and time zones should be consistent
  15. 15. www.datatons.com Requirements
  16. 16. www.datatons.com Starting the copy ● New nodes will have a DataNode role ● No computing yet (YARN, Impala, etc.) ● DataNode roles will be stopped at first ● When started: – If there is only one rack in the original platform, the copy process will begin immediately – If there is more than one rack in the original, manual intervention will be required
  17. 17. www.datatons.com Starting the copy
  18. 18. www.datatons.com Starting the copy
  19. 19. www.datatons.com Starting the copy
  20. 20. www.datatons.com Starting the copy
  21. 21. www.datatons.com Transfer speed ● Two parameters affect the data transfer speed: – dfs.datanode.balance.bandwidthPerSec – dfs.namenode.replication.work.multiplier.per.iteration ● No jobs are launched in the new nodes – Data flow is almost exclusive to the copy
  22. 22. www.datatons.com Transfer speed
  23. 23. www.datatons.com Moving master roles ● When possible, take advantage of HA: – Zookeeper (just add two) – NameNode – ResourceManager ● Others need to be migrated manually: – Hive metastore DB needs to be copied – Having a DNS name for the DB helps
  24. 24. www.datatons.com Moving master roles
  25. 25. www.datatons.com Moving data I/O ● Once data is copied (fully or most of it), new computation roles will be deployed: – NodeManager – Impalad ● Roles will be stopped at first ● Auxiliary nodes (front-end, app nodes, etc) need to be deployed in the new platform ● A planned intervention (at a low usage time) needs to take place
  26. 26. www.datatons.com Moving data I/O
  27. 27. www.datatons.com During the intervention ● The cluster is stopped ● If necessary, client configuration is redeployed ● Services are started and tested in this order: – Zookeeper – HDFS – YARN (only for the new platform) – Impala (only for the new platform) ● Auxiliary services in the new platform are tested ● Green light? Change the DNS for the entry points
  28. 28. www.datatons.com Final picture
  29. 29. www.datatons.com Conclusions and afterthoughts ● Minimal downtime, similar to non-Hadoop planned works ● Data and service are never at risk ● Hadoop tools are used to solve a Hadoop problem ● No user impact: no change in data or access ● Kerberos is not an issue (same REALM + kdc)
  30. 30. Thank you! Carlos Izquierdo cizquierdo@datatons.com www.datatons.com

×