The growing popularity of Hadoop has led to the availability of an increasing number of clusters worldwide, often multiply within the same organization. However, in order to leverage this computing capability, the clusters must first be primed with data. Frequently, this entails uploading existing client repositories into a remote cluster. Such a move can be challenging for the following reasons:
* size: the size of the data to be transferred can be very large. Typically, enterprises do not consider adopting Big Data technologies unless they are actively experiencing pain owing to their current system being unable to handle the existing volume. At that point, their data has usually grown to significant levels and, consequently, is much more difficult to manage.
* networks: if the target cluster is remote, one option is to move data via wide area networks. This presents hurdles in terms of limited available throughput, bandwidth and
* security. Transferring large data sizes via this approach can potentially be very time consuming. A special case is if the source and destination clusters are within the same data center but belong to different organizations. This scenario requires a different set of specialized skills in order to set up a network architecture that allows data to flow.
lack of domain knowledge & tools: there exists little understanding of the various approaches for bulk data uploads to a Hadoop cluster. In addition, widely used data transfer tools such as scp, ftp and rsync do not directly interface with HDFS and alternatives are not available. While there are tools to facilitate cluster to cluster copies, doing so across organizations and multiple hadoop versions is challenging.
* security: data is particularly vulnerable during transit. Being able to safely transport high volume data across organizational boundaries and networks demands thorough understanding of security protocols and practices.
In this talk, we present a number of techniques and best practices for uploading large quantities of data to a remote Hadoop cluster. Our presentation is based on real world experience in transferring large amounts of data on behalf of various clients. Topics covered will include DistCp, S3, disks, Flume and Kafka.