2. Hadoop Modules
HDFS (Hadoop Distributed File System): Actual storage of data over
distributed files system
Hadoop MapReduce: The module introduced since first version of Hadoop for
parallel processing of large data sets
Hadoop YARN: Job scheduling and cluster resource management framework
Introduced since Hadoop 2.0, Hadoop 1.0 took care of scheduling within
MapReduce
Hadoop Common: Common utilities supporting other Hadoop modules, e.g.,
administration
Others: Zookeeper, Oozie, Hue, Sqoop, Flume, Kafka, Hive, Pig, Spark etc.
3. Main Hadoop Distibutions
Cloudera
Founded in 2008 by a group of engineers from Yahoo, Google, Facebook
Largest user base so far
Core distribution based on Apache Hadoop
Also proprietary Cloudera Management Suite to automate installation, other services for convenience of usage
Hortonworks
Founded in 2011
Only vendor with completely open source distribution
Innovations such as YARN
MapR
Standard open source comes with a number of restrictions
Will lead towards vendor distributions eventually
Replaces HDFS with own proprietary file system MapRFS incorporating enterprise-grade features with ease of
use
5. Cloudera – Cloudera Manager
Run cloudera manager to enable visual administration at any scale
sudo ~/cloudera-manager –force
Cloudera manager menu
Following option will be available to enable
Provides visual dashboard for all modules:
Hosts, flime, hbase, hdfs, hive, hue, impala, ks-indexer, oozie, sentry, solr, spark, sqoop, yarn,
zookeeper ….
Can start/stop/restart/Rolling Restart services with a click
Health and Configuration Issues are flagged and Recent Commands are logged
60 day free trial if you want to check it out
Home Clusters Hosts Diagnostics Audit Charts Administration
6. Cloudera Products
Cloudera Express
Free download combining CDH with Cloudera Manager
Provides robust cluster management capabilities like automated deployment, centralized
administration, monitoring, and diagnostic tools
Cloudera Enterprise
In addition to CDH provides advanced system management and data management tools
Includes dedicated support from Cloudera
Cloudera Director
Includes Cloudera Enterprise functionality plus extends enterprise data hub architecture to
the cloud
8. Hortonworks Pearls
Only distribution that can run without a VM (Virtual Machine) on Windows
Open source; you will not be lead to purchase eventually
Similar to Cloudera
Both enterprise-ready distributions for a while
Both have established communities to consult
Differences
Hortonworks open source; Cloudera 60 day free trial
Both work on Windows but Hortonworks has native; windows based cluster can be
deployed on Windows Azure using HDInsight service
Cloudera has Cloudera Manager, Impala (SQL handling interface), and Cloudera Search.
Hortonworks has Ambari, Stringer and Apache Solr correspondingly
9. MapR
MapR is different than the two with its own proprietary file system MapRFS
mapr.com
10. MapR Details
Standard open source edition comes with a number of restrictions
Vendor distributions aimed at covering these issues (so will have to move to
vendor distribution over time
Through a partnership with Canonical (creator of Ubuntu) MapR offering as a
default component of Ubuntu operating system starting MapR M3 Edition
Upto M3 Edition MapR is free but free version lacks some proprietary features
such as JobTracker HA, NameNode HA, NFS-HA, Mirroring, Snapshot etc
MapR M5 Edition and on is not free but provides 24/7 support and annual
subscription model
11. Three Distributions
MapR
If you can afford and do not mind a different approach than Apache Hadoop
consider MapR
Provides a complete stack
Cloudera
Based on open source Apache Hadoop with proprietary tools
Similar to MapR provides both free and paid distributions with extra features
and support
Hortonworks
Only commercial vendor to provide complete open source Hadoop
Hortonworks intentionally has not developed proprietary software and uses
open source tools like Ambari, Stringer, and Solr
12. So Which One?
Your goal should be to figure out the best choice for your
business; there is not a single right choice
Good news: all provide free versions – you can try it out
If you do I suggest checking our benchmarking efforts and
develop your own tests
All three offer consulting, training, and technical
assistance
Consider added value according to your customer base
13. What to pay attention to?
If you are looking at existing benchmarking studies you need to note
that
You need to understand the experiment setting and parameters more than
the results
It is possible to alter performance using different data sets, different sizes of
clusters, or different number of virtual machines etc.
Your typical workload can be way different than the ones used in the study
Try to get your own workload as the basis for your analysis
You should stress test your results
Do you expect to have extreme workloads
How critical is it if you do
How much of a slow down/approximation etc can you tolerate
Can you generate realistic sampling