A beginners guide to Cloudera Hadoop

Cloudera Hadoop  
as your Data Lake 
Introduction to BigData and Hadoop for beginners
David Yahalom, CTO 
NAYA Technologies
davidy@naya-tech.co.il 
www.naya-tech.com 
 
2015, All Rights reserved
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812 
All Rights Reserved. Do not Distribute.

About NAYA Technologies
Global leader in Data Platform consulting and managed services.
Established in 2009, NAYA is a leading provider of Data Platform managed services
with emphasis on planning, deploying, and managing business critical database
systems for large enterprises and leading startups.
Our company provides everything data platform architecture design through
implementation and 24/7/365 support for mission critical systems.
NAYA is one of the fastest growing consultants in the market with teams that provide
clients with the peace of mind they need when it comes to their critical data and
database systems.
NAYA delivers the most professional, respected and experienced consultants in the
industry. The company uses multi-national consulting teams that can manage projects
consistently across time zones.

BigData as a “Game Changer” 
• What is BigData?
• What makes BigData different ? 
As data becomes increasingly more complex and difficult to manage and analyze,
organizations are looking for new solutions that span beyond the scope of the
traditional RDBMS.
It used to be very simple! A decade ago, everything was running on top of relational
databases - Realtime, Analytics, BI, OLTP, OLAP, batch … 
Back then data sets were much smaller and usually well structured (native to a
relational database) so a single type of database paradigm - the relational database
was a great match for all data requirements supporting all major use cases.
Things aren’t so simple anymore. In the past few years, the nature of our data has
changed – datasets have become larger, more complex and include tremendous
increases in rate of data flow both in and out of our databases. This change brought
about a new way to think about database platforms.
The traditional role of the relational database as the single and unified platform for
all types of data is no more. The market has evolved to embrace a more specialized
approach where different database technologies are used to store and process
different sets of data.

The Challenge of “BigData” 
• How do I know if I have a “BigData problem” ? 
 
The changing nature of data can be discussed in terms of Volume, Velocity of
Verity. These are the differentiating factors which help separate classic data use
cases from next-generation ones. These “three Vs” are the business challenges
which force organizations to look beyond the traditional RDBMS as the sole data
platform.
Volume
• Collecting and analyzing more data helps make more educated business
decisions. We want to store all data that have or might have business value.
Don’t throw anything away as you never know when a piece of data will be
valuable for your organization.
• Flexibility in the ability to store data is also extremely important.
Organizations require solutions that can scale easily. You might have only 1
Terabyte of data today but that may increase to 10 Terabytes in a few years
and your data architecture must seamlessly and easily support scalability
without “throw away” architectures.
Velocity
• The rate of data collected, data which is flowing into our applications and
systems, is increasing dramatically. Thousands, tens of thousands or even
hundreds of thousands of critical business events are generated by our
applications and systems every second. These business events are meaningful
to us and have to be stored, cataloged and analyzed.
• Rapid data ingestion isn’t the only challenge, users are demanding real time
access to analytics based on up to date data. No longer can we provide users
with reports based on yesterday’s data. No longer can we rely on period nightly
ETL jobs. Data needs to be fresh and immediately available to users for
analytics as its being generated.
Variety
• Traditional data sets used to be strictly structured. Either natively or after an
ETL created structure – ETLs which are slow, non-scalable, difficult to change

and prone to errors and failures. Nowadays, applications require to store
different types of data, some structured yet some unstructured. Data
generated from social networks, sensors, application logs, user interactions,
geo-spatial data, etc… This is data which is much more complex and has to be
made accessible for processing and analysis alongside more traditionaldata
models.
• In addition, different applications with different data structures and use cases
can benefit from different processing frameworks / paradigms. Some datasets
require batch processing (such as recommendation engines) while other
datasets rely on realtime analytics (such as fraud detection). Flexibility in data
access APIs – a “best of breed” approach can benefit users by making complex
data easily accessible for everyone in our organization.
 
Enter the world of NoSQL databases 
• What are NoSQL databases and how do they relate to BigData?
• How are NoSQL databases different compared to traditional SQL-based databases?
The solution to the challenges we descried? The next-generation of NoSQL databases.
Databases which try to address the “Volume, Velocity, Verity” challenges by thinking
outside the box.
Remember, relational databases are optimized for storing structured data, are
difficult to scale and rely on SQL for data retrieval. They are optimized for some use
case, but not all.
NoSQL databases, on the other hand, are designed to store and process large amount
of data (Velocity,Volume) that is scalable (Volume), complex (Variety) and provide
immediate access (Velocity) to fresh data.
Relational databases:
• Structured: data is stored in tables. Tables have data types, primary keys and
constraints.
• Transactional: data can be inserted and manipulated in “grouped units” =
transactions. We can commit and rollback.

• Versatile but limited: traditional relational databases can do OLTP, OLAP, DWH,
Batch but are generally not specialized.
• Exemples: Oracle, SQL Server, DB2, MySQL, PostgreSQL.
• Do not easily scale-out: traditional relational database usually rely on a single
database instance , scale-out requires manual sharding, complex application-
level Data Access Layers (DALs) or expensive and specialized hardware.
• Well-known and easy to work with: everyone knows the RDBMS and SQL. 
NoSQL databases:
• Non-structured or semi-structured data model: NoSQL databases usually
provide a flexible data, supports un/semi-structured data, schema-less and
support rapid data model changes. Some NoSQL databases provide native JSON
support others provide a BigTable type data model.
• Extremely Scalable: designed to be scalable from the ground up. Usually
deployed in ac luster architecture to achieve easy and rapid scalability.
• Usually specialized: Specific NoSQL database technologies are designed for
specific use cases.  
High-volume operational processing? HBase. Advanced analytics? Hadoop.
• Exemples: Hadoop, HBase, MongoDB, CouchBase, Cassandra, etc…
• Varity of data retrieval and development APIs: each NoSQL database has its
own unique query API and query language. Some even support SQLs, some do
not.
BigData as a “One Liner”  
Generating value from large datasets that cannot be analyzed using traditional
technologies.

Hadoop as your Data Lake
• How does Hadoop fits the BigData picture?
Apache Hadoop is an open source data platform facilitating a fundamentally new way
of storing and processing data. Instead of relying on expensive, proprietary hardware
and different systems to store and process different types of data, Hadoop allows for
centralized, distributed parallel processing of huge amounts of data across
inexpensive, industry-standard servers.
Hadoop can become your organization’s master centralized location for all raw data,
structured or unstructured and thus become a central “Data Lake” to which all other
databases, data silos and applications can connect to and retrieve data from.
Hadoop doesn’t just store your data, all data in Hadoop can be easily accessed using
multiple frameworks and APIs.
Data can be ingested to Hadoop without pre-processing or the need for complex ETL.
You can just load the data as-is in near realtime. This minimizes any processing
overhead when storing raw data and does not require changing the way your data
looks like so that it can fit a particular target schema. Changes to the raw data does
not mandate changing the data model during data ingestion. The data model is usually
created during queries (reads) and not during data load (writes).
Hadoop provides a “store everything now and decide how to access later” approach. 
 

The “store everything now and decide how to process later“
architecture
 
- All required raw data will ingested in near realtime to an Hadoop cluster from
both unstructured and structured sources.
- Once loaded into Hadoop, all of your data is immediately accessible for all the
different use cases in your organization.
With Hadoop, no data is “too big” or “too complex”.
All valuable data, both raw and processed
NO ETL 
Access data anytime using multiple
data access frameworks
for BATCH and Realtime processing
Once data is stored in Hadoop 
Query data using a variety of APIs and frameworks 
Unstructured data Relationa data

Cloudera Hadoop
• What is Cloudera Hadoop and what is the difference between just “Hadoop”?
• What is the difference between Cloudera Express and Enterprise?
Hadoop is an open-source platform, Cloudera provides a pre-packaged, tested and
enhanced open-source distribution of the Hadoop platform. The relation between
Cloudera Hadoop and “vanilla Hadoop” could be thought of as similar to the relation
between RedHat Linux and “vanilla Linux”.
Cloudera is one of the leading innovators in the Hadoop space and largest contributor
to the open source Apache Hadoop ecosystem.
Cloudera packages the Hadoop source code in a special distribution which includes
enhanced Cloudera-developed Hadoop capabilities (such as Impala for interactive
SQL-based analytics), graphical web cluster management development user interfaces
(Cloudera Manager/HUE) as well as important Hadoop bug fixes and 24X7 support.
 
Cloudera Hadoop includes both an Express and Enterprise editions.
• Cloudera Express is the free to use version of Cloudera Hadoop with support for
unlimited cluster size and runs all the Apache Hadoop features without any
limitations. Cloudera express includes the Cloudera Manager web UI.
• Cloudera Enterprise includes support directly from Cloudera and some cluster
management enchantments such as rolling upgrades, SNMP alerts, etc….
 
 
 
 

In addition to the core Hadoop components - HDFS & YARN which we will discuss later,
Cloudera Hadoop (both Express and Enterprise) also includes multiple supplementary
open-source Hadoop Ecosystem components which come bundled as part of the
Cloudera Hadoop installation. The Hadoop ecosystem components compliment each
other and allow Hadoop to reach it’s full potential.  
Components such as HBase (online near realtime key/value access for “operational”
database use cases) , Impala (interactive SQL-language based analytics on top of
Hadoop data) , Spark (in-memory analytics and stream data processing) and more… 
 
 

The Hadoop Architecture
• How does an Hadoop cluster looks like?
At a high-level, Hadoop is comprised of a Master/Slave architecture where the master
nodes are responsible for providing cluster-wide services (such as resource scheduling
and coordination or storing metadata for data which reside in Hadoop) and the slave
nodes are responsible for actual data storage and processing. Both master nodes and
slave node are highly-available: more than one master node can be brought online for
failover proposes and multiple slave nodes will always be online due to the distributed
nature of Hadoop.
The core of Hadoop is made of two components which provide scalable & highly
available data storage and fast & flexible data retrieval.
• HDFS – Hadoop’s distributed filesystem. The core Hadoop component that is
responsible for storing data in a highly availiable way.
• YARN – Hadoop’s job scheduling and data access resource management
framework allowing fast, parallel processing of data stored in HDFS.
Both HDFS and YARN is deployed on Hadoop in a Master/Slave architecture: 
 
The HDFS master node is responsible for handling file system Metadata while the slave
node store actual business data. 
 
The YARN master node is responsible for cross-cluster resource scheduling and job
execution while the slave nodes are responsible for actually executing user queries
and jobs.

These two core components work together to seamlessly to provide:
• High Availability of your data - Hadoop provides an internal distributed
storage architecture that allows for protection against multiple kinds of data
loss from single block corruption to complete server or rack failures. Automatic
re-balancing of the Hadoop cluster is done in the background to ensure
constant availability for your data and sustained workloads.
• Scalability - Hadoop clusters can scale virtually without limits.  
Adding new Hadoop nodes to an existing cluster can be done online without any
downtime or interruption of existing workloads.  
Because each Hadoop “worker node” in the cluster is a server equipped with its
own processor cores and hard drives, adding new nodes to your Hadoop cluster
adds both storage capacity as well as computation capacity. 
When scaling Hadoop, You are not just expanding your data storage capability
but also increasing your data processing power.  
 
This method of scaling can be considered a paradigm shift compared to the
traditional database model where scaling the storage does not also increase
data retrieval performance - so you end up with the capacity to store more
data but without the capacity to quickly query it.
• Data Model Flexibility - Hadoop can handle any and all types of data. The
underlying Hadoop HDFS filesystem allows for storing any type of structured or
unstructured data. During data load, Hadoop is agnostic to the data model and
can store JSON documents, CSVs, Tab Delimited Files, unstructured text files,
XML files, Binary files – you name it! 
No need for expensive ETL or data pre-processing during data load.
With Hadoop you can first load your data and later decide on how to query it
and what is the data model. This is also known as “schema on read”. 
 
This approach decouples the application data model (schema, data types,
access patterns) from data storage and is considered an essential requirement
for a scalable and flexible next generation database. “Store everything and
decide how to query it later”. 
 
In addition to flexible data models, Hadoop also provides flexible data access

with a “pluggable” architecture allowing for multiple query APIs and data
processing frameworks on top of the same dataset.
Hadoop HDFS
• How is data stored in Hadoop? Tables? Files?
The first component of the Core Hadoop architecture is a fault tolerant and self-
healing distributed file system designed to turn a cluster of industry standard servers
into a massively scalable pool of storage.
Developed specifically for large-scale data processing workloads where scalability,
flexibility and throughput are critical, HDFS accepts data in any format regardless of
schema, optimizes for high bandwidth streaming, and scales to proven deployments of
100PB and beyond.
Scale-Out Architecture - add servers to increase storage capacity. 
High Availability - serve mission-critical workflows and applications. 
Fault Tolerance - automatically and seamlessly recover from failures with affecting
data availability. 
Load Balancing - place data intelligently across cluster nodes for maximum efficiency
and utilization 
Tunable Replication - multiple copies of each piece of data provide failure protection
and computational performance. 
Security - optional LDAP integration. 
 
As the name suggested, HDFS is the HaDoop FileSystem. As such, HDFS behaves in a
similar way to traditional Linux/Unix filesystems. At it’s lowest level, Hadoop stores
data as files which are made of individual blocks.
During data ingestion to Hadoop, the HDFS stripes your loaded files across all nodes in
the cluster, with replication for fault tolerance. The file loaded onto Hadoop will be
split into multiple individual blocks which will be spread across the entire cluster.
Each block will be stored more then once on more than one server.  
The replication factor (number of block copies) and block size is configurable on a
per-file basis.

Working with HDFS, at it’s lowest level, is simple. You can either access the HDFS file
browser using the HUE web interface:
Using the Hadoop HUE WebUI to browse HDFS
Or using a special Hadoop command line tool (“hadoop”) that is part of the Hadoop
client. Speccing the “fs” argument allows for interacting with the filesystem of a
remote Hadoop cluster:

Some more very basic examples:
hadoop fs -mkdir /user/hadoop/dir1 
hadoop fs -ls /user/hadoop/dir1 
hadoop fs -rm -r /user/hadoop/dir1 
hadoop fs -put /path_to_loca_dir/local_file.txt /user/hadoop/hdfs_dir/
hdfs_file.txt
Note that the paths shown in the examples above are HDFS paths and not paths from
the local file system of where the “hadoop fs” command line is executed.
It’s important to note that while writing / reading files from HDFS is the lowest-level
access to HDFS, end-users (developers, data analysts) working with Hadoop rely on
several other Hadoop data access frameworks which allow queries and data processing
on top of HDFS-stored data without having to directly interact with the Hadoop
filesystem. Frameworks such as Cloudera Impala or HIVE allow end-users to write SQL
queries on top of data stored in HDFS.
A SQL query used to directly access and visualize data from HDFS using the Hadoop HUE web UI.
Bottom line - HDFS is the Hadoop filesystem, the low-level data storage layer. Users
can interact with HDFS using both the HUE WebUI and the Hadoop command line.
Using these tools you can treat HDFS as if it is a regular (but distributed) filesystem -
create directories, write files, read files, delete files, etc…  

All data in Hadoop, at it’s lowest level, is just files on HDFS.  
“Structure” (semantics, tables, records, fields) is created when accessing data, not
when writing it.
Hadoop YARN
• Once data is stored in Hadoop, how can we coordinate access to it?
The second component of the core Hadoop architecture is the data processing,
resource management and scheduling framework called YARN.  
 
Different workloads (realtime and batch) can co-exist on your Hadoop cluster. YARN
facilities scheduling, resource management and application/query level execution
failure protection for all types of Hadoop workloads.
If Hadoop HDFS takes care of data storage, YARN takes care of managing data
retrieval.
With YARN, data processing workloads are executed at the same location where the
data is stored rather than relying on moving data form a dedicated storage tier to a
database tier.
Data storage and computation coexist on the same physical nodes in the cluster.
Workloads running in Hadoop under YARN can process exceedingly large amounts of
data without being affected by traditional bottlenecks like network bandwidth by
taking advantage of this data proximity.
Scale-out architecture - adding servers to your Hadoop cluster increases both
processing power and storage capacity. 
 
Security & authentication – YARN works with HDFS security to make sure that only
approved users can operate against the data in Hadoop. 
 
Resource manager and job scheduling – YARN employs data locality and manages
cluster resources intelligently to determine optimal locations (nodes) across the
cluster for data processing while allowing both long-running (batch) and short-running
(realtime) applications to co-exist and access the same datasets. 
 
Flexibility – YARN allows for various data processing APIs and query frameworks to
work on the same data at the same time. Some of the Hadoop data processing
frameworks running under YARN are optimized for batch analytics while others

provide near realtime in-memory event processing thus providing a “best of breed”
approach for accessing your data based on your use cases. 
 
Resiliency & high availability – YARN runs as a distrusted architecture across your
Hadoop cluster ensuring that if a submitted job or query fails, it can independently
and automatically restart and resume processing. No user intervention is required.
When data stored in Hadoop is accessed, data processing is distributed across all
nodes in the cluster. Distributed data sets are pieced-together automatically providing
parallel reads and processing to construct the final output.
Bottom line: YARN, by itself, isn’t a “query engine” or a data processing framework in
Hadoop. It’s a cluster resource manager and coordinator that allows various different
data processing and query engines (discussed later) to access data stored in Hadoop
and “play nice” with one another. That is, share cluster resources (CPU cores,
Memory, etc…)
Hadoop Query APIs and Data Processing Frameworks
• What does end-users (data analysts, developers, etc…) actually use to query and
process data in Hadoop?
Unlike most traditional relational databases which only offer SQL-based access to
data, Hadoop provides a variety of APIs each optimized for specific use cases.
While SQL has its benefits in simplicity and very short development cycles, it is
limited when taxed with more complex computation or analytical workloads.
Continuing Hadoop’s “one size does not fits all” approach, multiple different
“pluggable” APIs and processing languages are available. Each specifically designed to
address an individual use case with custom tailored performance and flexibility.

Identify your data processing use case and then select the best optimized framework
for the job.
Some of these modern frameworks for retrieving and processing data stored in Hadoop
are:
Cloudera Impala (Interactive SQL) – high-performance interactive access
to data via SQL. 
Impala provides second-level latency for SQL-based data retrieval in Hadoop. 
Impala is a fully integrated, state-of-the-art analytic Hadoop database engine
specifically designed to leverage the flexibility and scalability strengths of Hadoop -
combining the familiar SQL language and multi-user performance of a traditional
analytic database with the performance and scalability of Hadoop. Impala workloads
are not converted to Map/Reduce when executed and access Hadoop data directly.  
 
Example Impala create table statement + query:
CREATE EXTERNAL TABLE tab2
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION ‘/some_hdfs_folder/‘; <<<<< Note that this is a location to a folder on HDFS 
 
SELECT tab2.* <<<<< Impala query runs on top of previously crated table, accessing HDFS data
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;
 
HIVE (Batch SQL) – BATCH optimized SQL interface  
HIVE allows for non-developers to access data directly from Hadoop using the SQL
language while providing Batch processing optimizations.
HIVE automatically converts SQL code to Map/Reduce programs and pushes them onto
the Hadoop cluster. Because HIVE leverages Map/Reduce, it is suited for batch
processing and provides the same performance, reliability and scalability which are
core strengths of Map/Reduce on Hadoop.

Spark (In-memory “fast” processing) – next generation memory-
optimized data processing engine  
Spark is an extremely fast, memory-optimized general propose processing engine and
considered the next-generation data processing framework for Hadoop. 
Spark is a functional data processing language which supports development in Python,
JAVA and Scala. SPARK is designed for both batch processing workloads as well as
streaming workloads (using Spark Streaming), interactive queries, and machine
learning.
Example Spark word count application in Phyton: 
 
# Open a CSV file on Hadoop HDFS 
text_file = spark.textFile("hdfs://raw_data/my_raw_datafile.csv") 
 
# Count the words in the file  
text_file.flatMap(lambda line: line.split()) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a+b)
Map/Reduce (BATCH)– Distributed batch processing framework  
MapReduce is the original core of processing in Hadoop. Map/Reduce is this
programming paradigm which allows for data processing to become massive scalability
across hundreds or thousands of servers in an Hadoop cluster.
With MapReduce and Hadoop, compute workloads are executed in the location as
the data. Data storage and computation coexist on the same physical nodes in the
Hadoop cluster. MapReduce processes exceedingly large amounts of data without
being affected by traditional bottlenecks like network bandwidth by taking advantage
of this data proximity.
Apache Mahout (Machine Learning)– Scalable Machine Learning
Framework. 
Mahout provides the core algorithms for clustering, classification and collaborative
filtering which are implemented on top of the scalable, distributed Hadoop platform
such as: 
- Recommendation mining: takes users behavior and from that tries to find
items users might like.
- Clustering: takes data (documents) and groups them into groups of topically
related data or documents.

- Classification: learns from existing categorized data what data of a specific
category look like and is able to assign unlabeled documents to the correct
category
Note that the frameworks detailed above are just some of the most popular Hadoop
frameworks which run under Yarn. Many more exist and being developed.

A beginners guide to Cloudera Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (19)

Similar to A beginners guide to Cloudera Hadoop

Similar to A beginners guide to Cloudera Hadoop (20)

Recently uploaded

Recently uploaded (20)

A beginners guide to Cloudera Hadoop