Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Tips from Hadoop experts for beginners
1. Getting started with HADOOP?
Tips from Hadoop Professionals to help kick start your career
2. “I would like to share my experience with you。
1. I think practice is more important than theory, so do
a quick start like use Cloudera QuickStart VM。
2. Starting with the basics of installing and configuring
Hadoop Using command line, when you are familiar
with it, you can use GUI like ambari or cloudera
manager。”
Jin Zhan
Square Enix - Senior Engineer
Japan
3. “Here are some tips - these are based on things which
people should know but I have seen them get wrong -
you probably have them already - and there are more
than two!
1. You must increase ulimits
http://blog.cloudera.com/blog/2009/03/configuration-
parameters-what-can-you-just-ignore/
Mark H. Butler
Software Engineer at
Pataniqa Ltd
Preston, United Kingdom
4. 2. Installing a NoSQL database? Use the YCSB
benchmark to check it is working correctly
https://github.com/brianfrankcooper/YCSB/wiki
Mark H. Butler
Software Engineer at
Pataniqa Ltd
Preston, United Kingdom
5. 3. Consider using compression (although there are tradeoffs!)
http://comphadoop.weebly.com/
http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-
vs-lzf-vs-zlib-a-comparison-of
http://www.slideshare.net/Hadoop_Summit/kamat-singh-
june27425pmroom210cv2
http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/
http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-
part-1-splittable-lzo-compression/
https://github.com/twitter/hadoop-lzo
Mark H. Butler
Software Engineer at
Pataniqa Ltd
Preston, United Kingdom
6. 4. Don't install a Hadoop cluster manually - but there are many
technologies to automate e.g. Puppet, Chef, Ansible, Vagrant
http://blog.godatadriven.com/bare-metal-hadoop-provisioning-
ansible-cobbler.html
http://chimpler.wordpress.com/2013/01/20/deploying-hadoop-
on-ec2-with-whirr/
http://java.dzone.com/articles/setting-hadoop-virtual-cluster
http://www.diversit.eu/2012/05/setting-up-hadoop-cluster-using-
puppet.html
http://www.rpark.com/2013/02/using-chef-to-build-out-hadoop-
cluster.html
Mark H. Butler
Software Engineer at
Pataniqa Ltd
Preston, United Kingdom
7. 5. Java and Scala are great but don't overlook Python -
it's handy for prototyping one-off map-reduce jobs as
you do not need a cluster to test
http://www.michael-noll.com/tutorials/writing-an-
hadoop-mapreduce-program-in-python/
Hope that helps! “
Mark H. Butler
Software Engineer at
Pataniqa Ltd
Preston, United Kingdom
8. “Technically speaking, Map Reduce is the base and Map
= Select and Reduce = Group by so if you know what
you want and how you want to summarize it then
Hadoop is meant for you. “
Piyush Jindal
Software Engineer at Target
Bengaluru, Karnataka, India
9. “Tips :
1. Good knowledge of Data Structure and Insight to
Analyze the data is a Must.
2. Core JAVA and COLLECTION is must.
3. SQL and PL/SQL knowledge to solve complex
scenarios will help a lot.
These are the stepping stones to approach a problem in
Bigdata and to provide solution as well.. “
SOMANATH NANDA
Cloudera Certified Developer for Hadoop
Cognizant Technology Solutions
Bengaluru, Karnataka, India
10. “1. Audit your data to identify what might be useful but
unexploited. 2. Study new technologies; they are
moving rapidly.”
Merv Adrian
Vice President at Gartner
San Francisco Bay Area
11. “some good examples in this whitepaper (note,
registration required):
http://www.mongodb.com/lp/big-data
Mat Keep
Principal Product Marketing Manager at MongoDB Inc.
Hawkinge, Kent, United Kingdom
12. “Here are some tips in no specific order
1. Best value of Hadoop comes from the combination of software
and hardware designed for your specific needs.
2. Hardware configuration of your cluster is very important . If
you work load is I/O bound then disk specs are important, if CPU
bound then faster CPUs are better and if application is memory
bound then server with larger memory are needed.
Mohit Saxena
Vice President -Technology Founder InMobi - A Global
Mobile Ad Network
Bengaluru Area, India
13. 3. Network connectivity between nodes is extremely important at
least 1 gigabit NIC are must in Hadoop cluster so that inter
communication aren't a bottleneck in your cluster as they can be
huge drag.
4. Plan the size of storage and disk controller as per your need of
read per sec that you want to achieve from each server.
5. Ganglia is a fairly good monitoring tool for Hadoop and it can
point out bottlenecks .”
Mohit Saxena
Vice President -Technology Founder InMobi - A Global
Mobile Ad Network
Bengaluru Area, India
14. For more information on best Hadoop courses for your career
Check out the link below
http://www.dezyre.com/Big-Data-and-Hadoop/19