Hadoop Platforms - Introduction, Importance, Providers

11/2/2016
Introduction
 Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was named after a toy elephant.
It was originally developed to support distribution for the Nutch search engine project.
 Hadoop is an open-source software framework for storing data and running applications on clusters. It provides immense
storage for any kind of data, enormous processing power and the ability to handle limitless concurrent tasks.
 Hadoop is a highly scalable analytics platform and can process multiple petabytes of data spread across hundreds or
thousands of physical storage servers or nodes.
 It provides:
 Redundant, fault-tolerant data storage
 Parallel computation framework
 Job Coordination
 Hadoop is a solution to manage Big Data, it is framework for running data management applications on a
large cluster built of commodity hardware.
2

3
11/2/2016
Importance of Hadoop
 Ability to store and process huge amounts of any kind of data, quickly.
 Computing power- Hadoop's distributed computing model processes big data
faster.
 Fault tolerance- Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes to
make sure the distributed computing does not fail.
 Flexibility- structured and unstructured both kinds of data can be stored
without pre-processing them.
 Low cost- The open-source framework is free and uses commodity hardware to
store large quantities of data.
 Scalability- Nodes can be added as and when needed and maintenance cost is
very less.
http://www.sas.com/content/sascom/en_us/insights/big-data/hadoop/_jcr_content/par/styledcontainer_8bf1/par/styledcontainer_a643/par/textimage_ea05/image.img.png/1468851612191.png
3

4
11/2/2016
Hadoop Core Components
Hadoop is a system for large scale data processing.
It has two main components:
1. HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
2. MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-healing, High Bandwidth
 Clustured storage
 JobTracker manages the TaskTrackers
http://cdn.edureka.co/blog/wp-content/uploads/2014/08/hadoop1componenets.png
4

5
11/2/2016
Top 5 Hadoop Platform Providers
 A software framework which provides the necessary tools to
carry out Big Data analysis is widely used across industries.
 It is open-source, designed to be user-friendly, in its “raw”
state it still needs considerable specialist knowledge to set up
and run.
 “Hadoop-as-a-Service” has evolved in recent times, all of the
installation will actually take place within the vendors own
cloud, with customers paying a subscription to access the
services.
 The top 5 Hadoop platform providers are:
 IBM
 Amazon Web Services
 Hortonworks
 Cloudera
 MapR
https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAclAAAAJDZmZTQwODVlLTAwZGQtNGI3Ny05OTlhLTUzMTEyYTNmMTllMg.jpg
`
5

6
11/2/2016
1. IBM
 IBM has deep roots in the computing industry. Its BigInsights package
adds its proprietary analytics and visualization algorithms to the core
Hadoop infrastructure.
 IBM Open Platform with Apache Hadoop
 Native support for rolling upgrades for Hadoop services
 Support for long-running applications within YARN for enhanced
reliability & security
 Heterogeneous storage in HDFS for in-memory, SSD in addition to
HDD
 Spark in-memory distributed compute engine for dramatic performance increases over MapReduce and simplifies
developer experience, leveraging Java, Python & Scala languages
 Apache Hadoop projects included: HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format,
Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider
https://www-01.ibm.com/software/in/data/images/bd-platform.jpg
6

7
11/2/2016
2. Amazon Web Services
 Amazon is a frontrunner and offering Hadoop in its cloud services
package.
 Amazon Web Services (AWS) is a hosted solution integrating
Hadoop with Amazon’s Elastic Cloud Compute and Simple Storage
Service (S3) cloud-based data processing and storage services.
 AWS offers a broad set of global compute, storage, database,
analytics, application, and deployment services that help
organizations move faster, lower IT costs, and scale applications.
 AWS are trusted by the largest enterprises and the hottest start-
ups to power a wide variety of workloads including web and
mobile applications, data processing and warehousing, storage,
archive, and many others.
 Big Data on AWS introduces you to cloud-based big data solutions such as Amazon Elastic, MapReduce (EMR),
Amazon Redshift, Amazon Kinesis and the rest of the AWS big data platform.
http://www.strategism.org/wp-content/uploads/2015/06/amazon-800x600.jpg
7

8
11/2/2016
3. Hortonworks
 Horton is one of the few which offer 100% open source
Hadoop technology without any proprietary.
 Horton were also the first to integrate support for Apache
Catalog, which creates “metadata” – data within data –
simplifying the process of sharing your data across other
layers of service such as Apache Hive or Pig.
 HDP (HORTONW0RKS DATA PLATFORM) is the
enterprise-ready open source Apache™
Hadoop® distribution based on a centralized architecture
(YARN).
 HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers
robust analytics that accelerate decision making and innovation.
 Hortonworks is all about data: data-in-motion, data-at-rest, and Modern Data Applications. Our Connected
Data Platforms help customers create actionable intelligence to transform their businesses.
http://hortonworks.com/wp-content/uploads/2014/03/11.png
8

9
11/2/2016
4. Cloudera
 Most popular and have largest number of installations running.
 Cloudera contribute Impala, which offers real-time massively parallel
processing of Big Data to Hadoop.
 Cloudera's open-source Apache Hadoop distribution, CDH (Cloudera
Distribution Including Apache Hadoop), targets enterprise-class
deployments of that technology.
 Cloudera says that more than 50% of its engineering output is donated
upstream to the various Apache-licensed open source projects (Apache
Hive, Apache Avro, Apache HBase, and so on) that combine to form the
Hadoop platform.
 Cloudera is a sponsor of the Apache Software Foundation.
http://blog.cloudera.com/wp-content/uploads/2013/06/search.png
9

10
11/2/2016
5. MapR
 MapR uses some differing concepts, such as native support for
UNIX file systems rather than HDFS.
 MapR technologies is spearheading development of the Apache
Drill project, which provides advanced tools for interactive real-
time querying of Big Datasets.
 The MapR Converged Data Platform is the industry’s only
platform to integrate the enormous power of Hadoop and Spark
with global event streaming, real-time database capabilities, and
enterprise storage.
 The MapR Hadoop distribution replaces HDFS with its proprietary
file system, MapR-FS, which is designed to provide more efficient
management of data, reliability and ease of use.
 The MapR Converged Data Platform supports big data storage
and processing through the Apache collection of Hadoop
products, as well as its added-value components.
http://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2016/03/Mapr_Zeta_4-1.png
10

11/2/2016
References
1. http://www.sas.com/en_us/insights/big-data/hadoop.html#hadoopimportance
2. http://www.ironsystems.com/products/hadoop-platforms-overview
3. http://www.slideshare.net/billonahill/intro-to-hadoop-14125097/32-Hadoop_provides_Redundant_faulttolerant_data
4. http://www.computerweekly.com/feature/Big-data-storage-Hadoop-storage-basics
5. https://www.linkedin.com/pulse/big-data-top-10-commercial-hadoop-platforms-bernard-marr
6. http://data-informed.com/10-top-commercial-hadoop-platforms/
7. http://www.cloudera.com/partners/solutions/amazon-web-services.html
8. http://hortonworks.com/products/data-center/hdp/
9. http://www-03.ibm.com/software/products/en/ibm-open-platform-with-apache-hadoop
10. https://en.wikipedia.org/wiki/Cloudera
11. https://www.mapr.com/
12. http://searchdatamanagement.techtarget.com/feature/Inside-the-MapR-Hadoop-distribution-for-managing-big-data
13. http://www.ironnetworks.com/
14. http://www.ironsystems.com/
15. http://shop.ironnetworks.com/
11

Hadoop Platforms - Introduction, Importance, Providers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop Platforms - Introduction, Importance, Providers

Similar to Hadoop Platforms - Introduction, Importance, Providers (20)

Recently uploaded

Recently uploaded (20)

Hadoop Platforms - Introduction, Importance, Providers