SlideShare a Scribd company logo
1 of 20
Download to read offline
Cloudera Hadoop 

as your Data Lake

Introduction to BigData and Hadoop for beginners
David Yahalom, CTO

NAYA Technologies
davidy@naya-tech.co.il

www.naya-tech.com



2015, All Rights reserved
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
About NAYA Technologies
Global leader in Data Platform consulting and managed services.
Established in 2009, NAYA is a leading provider of Data Platform managed services
with emphasis on planning, deploying, and managing business critical database
systems for large enterprises and leading startups.
Our company provides everything data platform architecture design through
implementation and 24/7/365 support for mission critical systems.
NAYA is one of the fastest growing consultants in the market with teams that provide
clients with the peace of mind they need when it comes to their critical data and
database systems.
NAYA delivers the most professional, respected and experienced consultants in the
industry. The company uses multi-national consulting teams that can manage projects
consistently across time zones.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
BigData as a “Game Changer”

• What is BigData?
• What makes BigData different ?

As data becomes increasingly more complex and difficult to manage and analyze,
organizations are looking for new solutions that span beyond the scope of the
traditional RDBMS.
It used to be very simple! A decade ago, everything was running on top of relational
databases - Realtime, Analytics, BI, OLTP, OLAP, batch …

Back then data sets were much smaller and usually well structured (native to a
relational database) so a single type of database paradigm - the relational database
was a great match for all data requirements supporting all major use cases.
Things aren’t so simple anymore. In the past few years, the nature of our data has
changed – datasets have become larger, more complex and include tremendous
increases in rate of data flow both in and out of our databases. This change brought
about a new way to think about database platforms.
The traditional role of the relational database as the single and unified platform for
all types of data is no more. The market has evolved to embrace a more specialized
approach where different database technologies are used to store and process
different sets of data.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
The Challenge of “BigData”

• How do I know if I have a “BigData problem” ?



The changing nature of data can be discussed in terms of Volume, Velocity of
Verity. These are the differentiating factors which help separate classic data use
cases from next-generation ones. These “three Vs” are the business challenges
which force organizations to look beyond the traditional RDBMS as the sole data
platform.
Volume
• Collecting and analyzing more data helps make more educated business
decisions. We want to store all data that have or might have business value.
Don’t throw anything away as you never know when a piece of data will be
valuable for your organization.
• Flexibility in the ability to store data is also extremely important.
Organizations require solutions that can scale easily. You might have only 1
Terabyte of data today but that may increase to 10 Terabytes in a few years
and your data architecture must seamlessly and easily support scalability
without “throw away” architectures.
Velocity
• The rate of data collected, data which is flowing into our applications and
systems, is increasing dramatically. Thousands, tens of thousands or even
hundreds of thousands of critical business events are generated by our
applications and systems every second. These business events are meaningful
to us and have to be stored, cataloged and analyzed.
• Rapid data ingestion isn’t the only challenge, users are demanding real time
access to analytics based on up to date data. No longer can we provide users
with reports based on yesterday’s data. No longer can we rely on period nightly
ETL jobs. Data needs to be fresh and immediately available to users for
analytics as its being generated.
Variety
• Traditional data sets used to be strictly structured. Either natively or after an
ETL created structure – ETLs which are slow, non-scalable, difficult to change
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
and prone to errors and failures. Nowadays, applications require to store
different types of data, some structured yet some unstructured. Data
generated from social networks, sensors, application logs, user interactions,
geo-spatial data, etc… This is data which is much more complex and has to be
made accessible for processing and analysis alongside more traditionaldata
models.
• In addition, different applications with different data structures and use cases
can benefit from different processing frameworks / paradigms. Some datasets
require batch processing (such as recommendation engines) while other
datasets rely on realtime analytics (such as fraud detection). Flexibility in data
access APIs – a “best of breed” approach can benefit users by making complex
data easily accessible for everyone in our organization.


Enter the world of NoSQL databases

• What are NoSQL databases and how do they relate to BigData?
• How are NoSQL databases different compared to traditional SQL-based databases?
The solution to the challenges we descried? The next-generation of NoSQL databases.
Databases which try to address the “Volume, Velocity, Verity” challenges by thinking
outside the box.
Remember, relational databases are optimized for storing structured data, are
difficult to scale and rely on SQL for data retrieval. They are optimized for some use
case, but not all.
NoSQL databases, on the other hand, are designed to store and process large amount
of data (Velocity,Volume) that is scalable (Volume), complex (Variety) and provide
immediate access (Velocity) to fresh data.
Relational databases:
• Structured: data is stored in tables. Tables have data types, primary keys and
constraints.
• Transactional: data can be inserted and manipulated in “grouped units” =
transactions. We can commit and rollback.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
• Versatile but limited: traditional relational databases can do OLTP, OLAP, DWH,
Batch but are generally not specialized.
• Exemples: Oracle, SQL Server, DB2, MySQL, PostgreSQL.
• Do not easily scale-out: traditional relational database usually rely on a single
database instance , scale-out requires manual sharding, complex application-
level Data Access Layers (DALs) or expensive and specialized hardware.
• Well-known and easy to work with: everyone knows the RDBMS and SQL.

NoSQL databases:
• Non-structured or semi-structured data model: NoSQL databases usually
provide a flexible data, supports un/semi-structured data, schema-less and
support rapid data model changes. Some NoSQL databases provide native JSON
support others provide a BigTable type data model.
• Extremely Scalable: designed to be scalable from the ground up. Usually
deployed in ac luster architecture to achieve easy and rapid scalability.
• Usually specialized: Specific NoSQL database technologies are designed for
specific use cases. 

High-volume operational processing? HBase. Advanced analytics? Hadoop.
• Exemples: Hadoop, HBase, MongoDB, CouchBase, Cassandra, etc…
• Varity of data retrieval and development APIs: each NoSQL database has its
own unique query API and query language. Some even support SQLs, some do
not.
BigData as a “One Liner” 

Generating value from large datasets that cannot be analyzed using traditional
technologies.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
Hadoop as your Data Lake
• How does Hadoop fits the BigData picture?
Apache Hadoop is an open source data platform facilitating a fundamentally new way
of storing and processing data. Instead of relying on expensive, proprietary hardware
and different systems to store and process different types of data, Hadoop allows for
centralized, distributed parallel processing of huge amounts of data across
inexpensive, industry-standard servers.
Hadoop can become your organization’s master centralized location for all raw data,
structured or unstructured and thus become a central “Data Lake” to which all other
databases, data silos and applications can connect to and retrieve data from.
Hadoop doesn’t just store your data, all data in Hadoop can be easily accessed using
multiple frameworks and APIs.
Data can be ingested to Hadoop without pre-processing or the need for complex ETL.
You can just load the data as-is in near realtime. This minimizes any processing
overhead when storing raw data and does not require changing the way your data
looks like so that it can fit a particular target schema. Changes to the raw data does
not mandate changing the data model during data ingestion. The data model is usually
created during queries (reads) and not during data load (writes).
Hadoop provides a “store everything now and decide how to access later” approach.



NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
The “store everything now and decide how to process later“
architecture


- All required raw data will ingested in near realtime to an Hadoop cluster from
both unstructured and structured sources.
- Once loaded into Hadoop, all of your data is immediately accessible for all the
different use cases in your organization.
With Hadoop, no data is “too big” or “too complex”.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
All valuable data, both raw and processed
NO ETL

Access data anytime using multiple
data access frameworks
for BATCH and Realtime processing
Once data is stored in Hadoop

Query data using a variety of APIs and frameworks

Unstructured data Relationa data
Cloudera Hadoop
• What is Cloudera Hadoop and what is the difference between just “Hadoop”?
• What is the difference between Cloudera Express and Enterprise?
Hadoop is an open-source platform, Cloudera provides a pre-packaged, tested and
enhanced open-source distribution of the Hadoop platform. The relation between
Cloudera Hadoop and “vanilla Hadoop” could be thought of as similar to the relation
between RedHat Linux and “vanilla Linux”.
Cloudera is one of the leading innovators in the Hadoop space and largest contributor
to the open source Apache Hadoop ecosystem.
Cloudera packages the Hadoop source code in a special distribution which includes
enhanced Cloudera-developed Hadoop capabilities (such as Impala for interactive
SQL-based analytics), graphical web cluster management development user interfaces
(Cloudera Manager/HUE) as well as important Hadoop bug fixes and 24X7 support.


Cloudera Hadoop includes both an Express and Enterprise editions.
• Cloudera Express is the free to use version of Cloudera Hadoop with support for
unlimited cluster size and runs all the Apache Hadoop features without any
limitations. Cloudera express includes the Cloudera Manager web UI.
• Cloudera Enterprise includes support directly from Cloudera and some cluster
management enchantments such as rolling upgrades, SNMP alerts, etc….








NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
In addition to the core Hadoop components - HDFS & YARN which we will discuss later,
Cloudera Hadoop (both Express and Enterprise) also includes multiple supplementary
open-source Hadoop Ecosystem components which come bundled as part of the
Cloudera Hadoop installation. The Hadoop ecosystem components compliment each
other and allow Hadoop to reach it’s full potential. 

Components such as HBase (online near realtime key/value access for “operational”
database use cases) , Impala (interactive SQL-language based analytics on top of
Hadoop data) , Spark (in-memory analytics and stream data processing) and more…





NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
The Hadoop Architecture
• How does an Hadoop cluster looks like?
At a high-level, Hadoop is comprised of a Master/Slave architecture where the master
nodes are responsible for providing cluster-wide services (such as resource scheduling
and coordination or storing metadata for data which reside in Hadoop) and the slave
nodes are responsible for actual data storage and processing. Both master nodes and
slave node are highly-available: more than one master node can be brought online for
failover proposes and multiple slave nodes will always be online due to the distributed
nature of Hadoop.
The core of Hadoop is made of two components which provide scalable & highly
available data storage and fast & flexible data retrieval.
• HDFS – Hadoop’s distributed filesystem. The core Hadoop component that is
responsible for storing data in a highly availiable way.
• YARN – Hadoop’s job scheduling and data access resource management
framework allowing fast, parallel processing of data stored in HDFS.
Both HDFS and YARN is deployed on Hadoop in a Master/Slave architecture:



The HDFS master node is responsible for handling file system Metadata while the slave
node store actual business data.



The YARN master node is responsible for cross-cluster resource scheduling and job
execution while the slave nodes are responsible for actually executing user queries
and jobs.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
These two core components work together to seamlessly to provide:
• High Availability of your data - Hadoop provides an internal distributed
storage architecture that allows for protection against multiple kinds of data
loss from single block corruption to complete server or rack failures. Automatic
re-balancing of the Hadoop cluster is done in the background to ensure
constant availability for your data and sustained workloads.
• Scalability - Hadoop clusters can scale virtually without limits. 

Adding new Hadoop nodes to an existing cluster can be done online without any
downtime or interruption of existing workloads. 

Because each Hadoop “worker node” in the cluster is a server equipped with its
own processor cores and hard drives, adding new nodes to your Hadoop cluster
adds both storage capacity as well as computation capacity.

When scaling Hadoop, You are not just expanding your data storage capability
but also increasing your data processing power. 



This method of scaling can be considered a paradigm shift compared to the
traditional database model where scaling the storage does not also increase
data retrieval performance - so you end up with the capacity to store more
data but without the capacity to quickly query it.
• Data Model Flexibility - Hadoop can handle any and all types of data. The
underlying Hadoop HDFS filesystem allows for storing any type of structured or
unstructured data. During data load, Hadoop is agnostic to the data model and
can store JSON documents, CSVs, Tab Delimited Files, unstructured text files,
XML files, Binary files – you name it!

No need for expensive ETL or data pre-processing during data load.
With Hadoop you can first load your data and later decide on how to query it
and what is the data model. This is also known as “schema on read”.



This approach decouples the application data model (schema, data types,
access patterns) from data storage and is considered an essential requirement
for a scalable and flexible next generation database. “Store everything and
decide how to query it later”.



In addition to flexible data models, Hadoop also provides flexible data access
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
with a “pluggable” architecture allowing for multiple query APIs and data
processing frameworks on top of the same dataset.
Hadoop HDFS
• How is data stored in Hadoop? Tables? Files?
The first component of the Core Hadoop architecture is a fault tolerant and self-
healing distributed file system designed to turn a cluster of industry standard servers
into a massively scalable pool of storage.
Developed specifically for large-scale data processing workloads where scalability,
flexibility and throughput are critical, HDFS accepts data in any format regardless of
schema, optimizes for high bandwidth streaming, and scales to proven deployments of
100PB and beyond.
Scale-Out Architecture - add servers to increase storage capacity.

High Availability - serve mission-critical workflows and applications.

Fault Tolerance - automatically and seamlessly recover from failures with affecting
data availability.

Load Balancing - place data intelligently across cluster nodes for maximum efficiency
and utilization

Tunable Replication - multiple copies of each piece of data provide failure protection
and computational performance.

Security - optional LDAP integration.



As the name suggested, HDFS is the HaDoop FileSystem. As such, HDFS behaves in a
similar way to traditional Linux/Unix filesystems. At it’s lowest level, Hadoop stores
data as files which are made of individual blocks.
During data ingestion to Hadoop, the HDFS stripes your loaded files across all nodes in
the cluster, with replication for fault tolerance. The file loaded onto Hadoop will be
split into multiple individual blocks which will be spread across the entire cluster.
Each block will be stored more then once on more than one server. 

The replication factor (number of block copies) and block size is configurable on a
per-file basis.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
Working with HDFS, at it’s lowest level, is simple. You can either access the HDFS file
browser using the HUE web interface:
Using the Hadoop HUE WebUI to browse HDFS
Or using a special Hadoop command line tool (“hadoop”) that is part of the Hadoop
client. Speccing the “fs” argument allows for interacting with the filesystem of a
remote Hadoop cluster:
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
Some more very basic examples:
hadoop fs -mkdir /user/hadoop/dir1

hadoop fs -ls /user/hadoop/dir1

hadoop fs -rm -r /user/hadoop/dir1

hadoop fs -put /path_to_loca_dir/local_file.txt /user/hadoop/hdfs_dir/
hdfs_file.txt
Note that the paths shown in the examples above are HDFS paths and not paths from
the local file system of where the “hadoop fs” command line is executed.
It’s important to note that while writing / reading files from HDFS is the lowest-level
access to HDFS, end-users (developers, data analysts) working with Hadoop rely on
several other Hadoop data access frameworks which allow queries and data processing
on top of HDFS-stored data without having to directly interact with the Hadoop
filesystem. Frameworks such as Cloudera Impala or HIVE allow end-users to write SQL
queries on top of data stored in HDFS.
A SQL query used to directly access and visualize data from HDFS using the Hadoop HUE web UI.
Bottom line - HDFS is the Hadoop filesystem, the low-level data storage layer. Users
can interact with HDFS using both the HUE WebUI and the Hadoop command line.
Using these tools you can treat HDFS as if it is a regular (but distributed) filesystem -
create directories, write files, read files, delete files, etc… 

NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.


All data in Hadoop, at it’s lowest level, is just files on HDFS. 

“Structure” (semantics, tables, records, fields) is created when accessing data, not
when writing it.
Hadoop YARN
• Once data is stored in Hadoop, how can we coordinate access to it?
The second component of the core Hadoop architecture is the data processing,
resource management and scheduling framework called YARN. 



Different workloads (realtime and batch) can co-exist on your Hadoop cluster. YARN
facilities scheduling, resource management and application/query level execution
failure protection for all types of Hadoop workloads.
If Hadoop HDFS takes care of data storage, YARN takes care of managing data
retrieval.
With YARN, data processing workloads are executed at the same location where the
data is stored rather than relying on moving data form a dedicated storage tier to a
database tier.
Data storage and computation coexist on the same physical nodes in the cluster.
Workloads running in Hadoop under YARN can process exceedingly large amounts of
data without being affected by traditional bottlenecks like network bandwidth by
taking advantage of this data proximity.
Scale-out architecture - adding servers to your Hadoop cluster increases both
processing power and storage capacity.



Security & authentication – YARN works with HDFS security to make sure that only
approved users can operate against the data in Hadoop.



Resource manager and job scheduling – YARN employs data locality and manages
cluster resources intelligently to determine optimal locations (nodes) across the
cluster for data processing while allowing both long-running (batch) and short-running
(realtime) applications to co-exist and access the same datasets.



Flexibility – YARN allows for various data processing APIs and query frameworks to
work on the same data at the same time. Some of the Hadoop data processing
frameworks running under YARN are optimized for batch analytics while others
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
provide near realtime in-memory event processing thus providing a “best of breed”
approach for accessing your data based on your use cases.



Resiliency & high availability – YARN runs as a distrusted architecture across your
Hadoop cluster ensuring that if a submitted job or query fails, it can independently
and automatically restart and resume processing. No user intervention is required.
When data stored in Hadoop is accessed, data processing is distributed across all
nodes in the cluster. Distributed data sets are pieced-together automatically providing
parallel reads and processing to construct the final output.
Bottom line: YARN, by itself, isn’t a “query engine” or a data processing framework in
Hadoop. It’s a cluster resource manager and coordinator that allows various different
data processing and query engines (discussed later) to access data stored in Hadoop
and “play nice” with one another. That is, share cluster resources (CPU cores,
Memory, etc…)
Hadoop Query APIs and Data Processing Frameworks
• What does end-users (data analysts, developers, etc…) actually use to query and
process data in Hadoop?
Unlike most traditional relational databases which only offer SQL-based access to
data, Hadoop provides a variety of APIs each optimized for specific use cases.
While SQL has its benefits in simplicity and very short development cycles, it is
limited when taxed with more complex computation or analytical workloads.
Continuing Hadoop’s “one size does not fits all” approach, multiple different
“pluggable” APIs and processing languages are available. Each specifically designed to
address an individual use case with custom tailored performance and flexibility.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
Identify your data processing use case and then select the best optimized framework
for the job.
Some of these modern frameworks for retrieving and processing data stored in Hadoop
are:
Cloudera Impala (Interactive SQL) – high-performance interactive access
to data via SQL.

Impala provides second-level latency for SQL-based data retrieval in Hadoop.

Impala is a fully integrated, state-of-the-art analytic Hadoop database engine
specifically designed to leverage the flexibility and scalability strengths of Hadoop -
combining the familiar SQL language and multi-user performance of a traditional
analytic database with the performance and scalability of Hadoop. Impala workloads
are not converted to Map/Reduce when executed and access Hadoop data directly. 



Example Impala create table statement + query:
CREATE EXTERNAL TABLE tab2
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION ‘/some_hdfs_folder/‘; <<<<< Note that this is a location to a folder on HDFS



SELECT tab2.* <<<<< Impala query runs on top of previously crated table, accessing HDFS data
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;


HIVE (Batch SQL) – BATCH optimized SQL interface 

HIVE allows for non-developers to access data directly from Hadoop using the SQL
language while providing Batch processing optimizations.
HIVE automatically converts SQL code to Map/Reduce programs and pushes them onto
the Hadoop cluster. Because HIVE leverages Map/Reduce, it is suited for batch
processing and provides the same performance, reliability and scalability which are
core strengths of Map/Reduce on Hadoop.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
Spark (In-memory “fast” processing) – next generation memory-
optimized data processing engine 

Spark is an extremely fast, memory-optimized general propose processing engine and
considered the next-generation data processing framework for Hadoop.

Spark is a functional data processing language which supports development in Python,
JAVA and Scala. SPARK is designed for both batch processing workloads as well as
streaming workloads (using Spark Streaming), interactive queries, and machine
learning.
Example Spark word count application in Phyton:



# Open a CSV file on Hadoop HDFS

text_file = spark.textFile("hdfs://raw_data/my_raw_datafile.csv")



# Count the words in the file 

text_file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a+b)
Map/Reduce (BATCH)– Distributed batch processing framework 

MapReduce is the original core of processing in Hadoop. Map/Reduce is this
programming paradigm which allows for data processing to become massive scalability
across hundreds or thousands of servers in an Hadoop cluster.
With MapReduce and Hadoop, compute workloads are executed in the location as
the data. Data storage and computation coexist on the same physical nodes in the
Hadoop cluster. MapReduce processes exceedingly large amounts of data without
being affected by traditional bottlenecks like network bandwidth by taking advantage
of this data proximity.
Apache Mahout (Machine Learning)– Scalable Machine Learning
Framework.

Mahout provides the core algorithms for clustering, classification and collaborative
filtering which are implemented on top of the scalable, distributed Hadoop platform
such as:

- Recommendation mining: takes users behavior and from that tries to find
items users might like.
- Clustering: takes data (documents) and groups them into groups of topically
related data or documents.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.
- Classification: learns from existing categorized data what data of a specific
category look like and is able to assign unlabeled documents to the correct
category
Note that the frameworks detailed above are just some of the most popular Hadoop
frameworks which run under Yarn. Many more exist and being developed.
NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812

All Rights Reserved. Do not Distribute.

More Related Content

What's hot

How to create a workflow
How to create a workflow How to create a workflow
How to create a workflow Atlassian
 
Evolution of the SAP User Experience and Technology Stack
Evolution of the SAP User Experience and Technology StackEvolution of the SAP User Experience and Technology Stack
Evolution of the SAP User Experience and Technology StackVictor Ionescu
 
SAP HCM Consultant
SAP HCM ConsultantSAP HCM Consultant
SAP HCM ConsultantIT LearnMore
 
Oracle
OracleOracle
Oraclensah
 
대학교 취업성공프로젝트 찬스 제안서
대학교 취업성공프로젝트 찬스 제안서대학교 취업성공프로젝트 찬스 제안서
대학교 취업성공프로젝트 찬스 제안서the Learning & Company
 
Grafana is not enough: DIY user interfaces for Prometheus
Grafana is not enough: DIY user interfaces for PrometheusGrafana is not enough: DIY user interfaces for Prometheus
Grafana is not enough: DIY user interfaces for PrometheusWeaveworks
 
Salesforce point of License 20200819
Salesforce point of License 20200819Salesforce point of License 20200819
Salesforce point of License 20200819Hiroki Iida
 
SAP HANA Implementation A Complete Guide.pdf
SAP HANA Implementation A Complete Guide.pdfSAP HANA Implementation A Complete Guide.pdf
SAP HANA Implementation A Complete Guide.pdfZoe Gilbert
 
【●●株式会社 御中】提案資料 2015.04.11
【●●株式会社 御中】提案資料 2015.04.11【●●株式会社 御中】提案資料 2015.04.11
【●●株式会社 御中】提案資料 2015.04.11Naomichi Sawamura
 
[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例
[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例
[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例Amazon Web Services Japan
 
Salesforceでの大規模データの取り扱い
Salesforceでの大規模データの取り扱いSalesforceでの大規模データの取り扱い
Salesforceでの大規模データの取り扱いSalesforce Developers Japan
 

What's hot (14)

How to create a workflow
How to create a workflow How to create a workflow
How to create a workflow
 
Evolution of the SAP User Experience and Technology Stack
Evolution of the SAP User Experience and Technology StackEvolution of the SAP User Experience and Technology Stack
Evolution of the SAP User Experience and Technology Stack
 
K smart lighthouse factory(acs)
K smart lighthouse factory(acs)K smart lighthouse factory(acs)
K smart lighthouse factory(acs)
 
Business workflow
Business workflowBusiness workflow
Business workflow
 
Camunda BPM 7.12 Release Webinar
Camunda BPM 7.12 Release WebinarCamunda BPM 7.12 Release Webinar
Camunda BPM 7.12 Release Webinar
 
SAP HCM Consultant
SAP HCM ConsultantSAP HCM Consultant
SAP HCM Consultant
 
Oracle
OracleOracle
Oracle
 
대학교 취업성공프로젝트 찬스 제안서
대학교 취업성공프로젝트 찬스 제안서대학교 취업성공프로젝트 찬스 제안서
대학교 취업성공프로젝트 찬스 제안서
 
Grafana is not enough: DIY user interfaces for Prometheus
Grafana is not enough: DIY user interfaces for PrometheusGrafana is not enough: DIY user interfaces for Prometheus
Grafana is not enough: DIY user interfaces for Prometheus
 
Salesforce point of License 20200819
Salesforce point of License 20200819Salesforce point of License 20200819
Salesforce point of License 20200819
 
SAP HANA Implementation A Complete Guide.pdf
SAP HANA Implementation A Complete Guide.pdfSAP HANA Implementation A Complete Guide.pdf
SAP HANA Implementation A Complete Guide.pdf
 
【●●株式会社 御中】提案資料 2015.04.11
【●●株式会社 御中】提案資料 2015.04.11【●●株式会社 御中】提案資料 2015.04.11
【●●株式会社 御中】提案資料 2015.04.11
 
[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例
[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例
[よくわかるクラウドデータベース] Amazon RDS for SQL Server導入事例
 
Salesforceでの大規模データの取り扱い
Salesforceでの大規模データの取り扱いSalesforceでの大規模データの取り扱い
Salesforceでの大規模データの取り扱い
 

Viewers also liked

Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
The Future of Data
The Future of DataThe Future of Data
The Future of Datablynnbuckley
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep duttaCapgemini
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
 
Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Olga Zinkevych
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...Lucas Jellema
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (19)

Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
The Future of Data
The Future of DataThe Future of Data
The Future of Data
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to A beginners guide to Cloudera Hadoop

Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLTushar Shende
 
How To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLHow To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLDataStax
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsJane Roberts
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Introduction to NoSQL database technology
Introduction to NoSQL database technologyIntroduction to NoSQL database technology
Introduction to NoSQL database technologynicolausalex722
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Sheena Crouch
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - CassandraJen Wei Lee
 
Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerijaprr
 

Similar to A beginners guide to Cloudera Hadoop (20)

Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQL
 
How To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLHow To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQL
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Introduction to NoSQL database technology
Introduction to NoSQL database technologyIntroduction to NoSQL database technology
Introduction to NoSQL database technology
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...
 
The new EDW
The new EDWThe new EDW
The new EDW
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - Cassandra
 
Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseer
 

Recently uploaded

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

A beginners guide to Cloudera Hadoop

  • 1. Cloudera Hadoop 
 as your Data Lake
 Introduction to BigData and Hadoop for beginners David Yahalom, CTO
 NAYA Technologies davidy@naya-tech.co.il
 www.naya-tech.com
 
 2015, All Rights reserved NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 2. About NAYA Technologies Global leader in Data Platform consulting and managed services. Established in 2009, NAYA is a leading provider of Data Platform managed services with emphasis on planning, deploying, and managing business critical database systems for large enterprises and leading startups. Our company provides everything data platform architecture design through implementation and 24/7/365 support for mission critical systems. NAYA is one of the fastest growing consultants in the market with teams that provide clients with the peace of mind they need when it comes to their critical data and database systems. NAYA delivers the most professional, respected and experienced consultants in the industry. The company uses multi-national consulting teams that can manage projects consistently across time zones. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 3. BigData as a “Game Changer”
 • What is BigData? • What makes BigData different ?
 As data becomes increasingly more complex and difficult to manage and analyze, organizations are looking for new solutions that span beyond the scope of the traditional RDBMS. It used to be very simple! A decade ago, everything was running on top of relational databases - Realtime, Analytics, BI, OLTP, OLAP, batch …
 Back then data sets were much smaller and usually well structured (native to a relational database) so a single type of database paradigm - the relational database was a great match for all data requirements supporting all major use cases. Things aren’t so simple anymore. In the past few years, the nature of our data has changed – datasets have become larger, more complex and include tremendous increases in rate of data flow both in and out of our databases. This change brought about a new way to think about database platforms. The traditional role of the relational database as the single and unified platform for all types of data is no more. The market has evolved to embrace a more specialized approach where different database technologies are used to store and process different sets of data. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 4. The Challenge of “BigData”
 • How do I know if I have a “BigData problem” ?
 
 The changing nature of data can be discussed in terms of Volume, Velocity of Verity. These are the differentiating factors which help separate classic data use cases from next-generation ones. These “three Vs” are the business challenges which force organizations to look beyond the traditional RDBMS as the sole data platform. Volume • Collecting and analyzing more data helps make more educated business decisions. We want to store all data that have or might have business value. Don’t throw anything away as you never know when a piece of data will be valuable for your organization. • Flexibility in the ability to store data is also extremely important. Organizations require solutions that can scale easily. You might have only 1 Terabyte of data today but that may increase to 10 Terabytes in a few years and your data architecture must seamlessly and easily support scalability without “throw away” architectures. Velocity • The rate of data collected, data which is flowing into our applications and systems, is increasing dramatically. Thousands, tens of thousands or even hundreds of thousands of critical business events are generated by our applications and systems every second. These business events are meaningful to us and have to be stored, cataloged and analyzed. • Rapid data ingestion isn’t the only challenge, users are demanding real time access to analytics based on up to date data. No longer can we provide users with reports based on yesterday’s data. No longer can we rely on period nightly ETL jobs. Data needs to be fresh and immediately available to users for analytics as its being generated. Variety • Traditional data sets used to be strictly structured. Either natively or after an ETL created structure – ETLs which are slow, non-scalable, difficult to change NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 5. and prone to errors and failures. Nowadays, applications require to store different types of data, some structured yet some unstructured. Data generated from social networks, sensors, application logs, user interactions, geo-spatial data, etc… This is data which is much more complex and has to be made accessible for processing and analysis alongside more traditionaldata models. • In addition, different applications with different data structures and use cases can benefit from different processing frameworks / paradigms. Some datasets require batch processing (such as recommendation engines) while other datasets rely on realtime analytics (such as fraud detection). Flexibility in data access APIs – a “best of breed” approach can benefit users by making complex data easily accessible for everyone in our organization. 
 Enter the world of NoSQL databases
 • What are NoSQL databases and how do they relate to BigData? • How are NoSQL databases different compared to traditional SQL-based databases? The solution to the challenges we descried? The next-generation of NoSQL databases. Databases which try to address the “Volume, Velocity, Verity” challenges by thinking outside the box. Remember, relational databases are optimized for storing structured data, are difficult to scale and rely on SQL for data retrieval. They are optimized for some use case, but not all. NoSQL databases, on the other hand, are designed to store and process large amount of data (Velocity,Volume) that is scalable (Volume), complex (Variety) and provide immediate access (Velocity) to fresh data. Relational databases: • Structured: data is stored in tables. Tables have data types, primary keys and constraints. • Transactional: data can be inserted and manipulated in “grouped units” = transactions. We can commit and rollback. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 6. • Versatile but limited: traditional relational databases can do OLTP, OLAP, DWH, Batch but are generally not specialized. • Exemples: Oracle, SQL Server, DB2, MySQL, PostgreSQL. • Do not easily scale-out: traditional relational database usually rely on a single database instance , scale-out requires manual sharding, complex application- level Data Access Layers (DALs) or expensive and specialized hardware. • Well-known and easy to work with: everyone knows the RDBMS and SQL.
 NoSQL databases: • Non-structured or semi-structured data model: NoSQL databases usually provide a flexible data, supports un/semi-structured data, schema-less and support rapid data model changes. Some NoSQL databases provide native JSON support others provide a BigTable type data model. • Extremely Scalable: designed to be scalable from the ground up. Usually deployed in ac luster architecture to achieve easy and rapid scalability. • Usually specialized: Specific NoSQL database technologies are designed for specific use cases. 
 High-volume operational processing? HBase. Advanced analytics? Hadoop. • Exemples: Hadoop, HBase, MongoDB, CouchBase, Cassandra, etc… • Varity of data retrieval and development APIs: each NoSQL database has its own unique query API and query language. Some even support SQLs, some do not. BigData as a “One Liner” 
 Generating value from large datasets that cannot be analyzed using traditional technologies. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 7. Hadoop as your Data Lake • How does Hadoop fits the BigData picture? Apache Hadoop is an open source data platform facilitating a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process different types of data, Hadoop allows for centralized, distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers. Hadoop can become your organization’s master centralized location for all raw data, structured or unstructured and thus become a central “Data Lake” to which all other databases, data silos and applications can connect to and retrieve data from. Hadoop doesn’t just store your data, all data in Hadoop can be easily accessed using multiple frameworks and APIs. Data can be ingested to Hadoop without pre-processing or the need for complex ETL. You can just load the data as-is in near realtime. This minimizes any processing overhead when storing raw data and does not require changing the way your data looks like so that it can fit a particular target schema. Changes to the raw data does not mandate changing the data model during data ingestion. The data model is usually created during queries (reads) and not during data load (writes). Hadoop provides a “store everything now and decide how to access later” approach.
 
 NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 8. The “store everything now and decide how to process later“ architecture 
 - All required raw data will ingested in near realtime to an Hadoop cluster from both unstructured and structured sources. - Once loaded into Hadoop, all of your data is immediately accessible for all the different use cases in your organization. With Hadoop, no data is “too big” or “too complex”. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute. All valuable data, both raw and processed NO ETL
 Access data anytime using multiple data access frameworks for BATCH and Realtime processing Once data is stored in Hadoop
 Query data using a variety of APIs and frameworks
 Unstructured data Relationa data
  • 9. Cloudera Hadoop • What is Cloudera Hadoop and what is the difference between just “Hadoop”? • What is the difference between Cloudera Express and Enterprise? Hadoop is an open-source platform, Cloudera provides a pre-packaged, tested and enhanced open-source distribution of the Hadoop platform. The relation between Cloudera Hadoop and “vanilla Hadoop” could be thought of as similar to the relation between RedHat Linux and “vanilla Linux”. Cloudera is one of the leading innovators in the Hadoop space and largest contributor to the open source Apache Hadoop ecosystem. Cloudera packages the Hadoop source code in a special distribution which includes enhanced Cloudera-developed Hadoop capabilities (such as Impala for interactive SQL-based analytics), graphical web cluster management development user interfaces (Cloudera Manager/HUE) as well as important Hadoop bug fixes and 24X7 support. 
 Cloudera Hadoop includes both an Express and Enterprise editions. • Cloudera Express is the free to use version of Cloudera Hadoop with support for unlimited cluster size and runs all the Apache Hadoop features without any limitations. Cloudera express includes the Cloudera Manager web UI. • Cloudera Enterprise includes support directly from Cloudera and some cluster management enchantments such as rolling upgrades, SNMP alerts, etc…. 
 
 
 
 NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 10. In addition to the core Hadoop components - HDFS & YARN which we will discuss later, Cloudera Hadoop (both Express and Enterprise) also includes multiple supplementary open-source Hadoop Ecosystem components which come bundled as part of the Cloudera Hadoop installation. The Hadoop ecosystem components compliment each other and allow Hadoop to reach it’s full potential. 
 Components such as HBase (online near realtime key/value access for “operational” database use cases) , Impala (interactive SQL-language based analytics on top of Hadoop data) , Spark (in-memory analytics and stream data processing) and more…
 
 
 NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 11. The Hadoop Architecture • How does an Hadoop cluster looks like? At a high-level, Hadoop is comprised of a Master/Slave architecture where the master nodes are responsible for providing cluster-wide services (such as resource scheduling and coordination or storing metadata for data which reside in Hadoop) and the slave nodes are responsible for actual data storage and processing. Both master nodes and slave node are highly-available: more than one master node can be brought online for failover proposes and multiple slave nodes will always be online due to the distributed nature of Hadoop. The core of Hadoop is made of two components which provide scalable & highly available data storage and fast & flexible data retrieval. • HDFS – Hadoop’s distributed filesystem. The core Hadoop component that is responsible for storing data in a highly availiable way. • YARN – Hadoop’s job scheduling and data access resource management framework allowing fast, parallel processing of data stored in HDFS. Both HDFS and YARN is deployed on Hadoop in a Master/Slave architecture:
 
 The HDFS master node is responsible for handling file system Metadata while the slave node store actual business data.
 
 The YARN master node is responsible for cross-cluster resource scheduling and job execution while the slave nodes are responsible for actually executing user queries and jobs. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 12. These two core components work together to seamlessly to provide: • High Availability of your data - Hadoop provides an internal distributed storage architecture that allows for protection against multiple kinds of data loss from single block corruption to complete server or rack failures. Automatic re-balancing of the Hadoop cluster is done in the background to ensure constant availability for your data and sustained workloads. • Scalability - Hadoop clusters can scale virtually without limits. 
 Adding new Hadoop nodes to an existing cluster can be done online without any downtime or interruption of existing workloads. 
 Because each Hadoop “worker node” in the cluster is a server equipped with its own processor cores and hard drives, adding new nodes to your Hadoop cluster adds both storage capacity as well as computation capacity.
 When scaling Hadoop, You are not just expanding your data storage capability but also increasing your data processing power. 
 
 This method of scaling can be considered a paradigm shift compared to the traditional database model where scaling the storage does not also increase data retrieval performance - so you end up with the capacity to store more data but without the capacity to quickly query it. • Data Model Flexibility - Hadoop can handle any and all types of data. The underlying Hadoop HDFS filesystem allows for storing any type of structured or unstructured data. During data load, Hadoop is agnostic to the data model and can store JSON documents, CSVs, Tab Delimited Files, unstructured text files, XML files, Binary files – you name it!
 No need for expensive ETL or data pre-processing during data load. With Hadoop you can first load your data and later decide on how to query it and what is the data model. This is also known as “schema on read”.
 
 This approach decouples the application data model (schema, data types, access patterns) from data storage and is considered an essential requirement for a scalable and flexible next generation database. “Store everything and decide how to query it later”.
 
 In addition to flexible data models, Hadoop also provides flexible data access NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 13. with a “pluggable” architecture allowing for multiple query APIs and data processing frameworks on top of the same dataset. Hadoop HDFS • How is data stored in Hadoop? Tables? Files? The first component of the Core Hadoop architecture is a fault tolerant and self- healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond. Scale-Out Architecture - add servers to increase storage capacity.
 High Availability - serve mission-critical workflows and applications.
 Fault Tolerance - automatically and seamlessly recover from failures with affecting data availability.
 Load Balancing - place data intelligently across cluster nodes for maximum efficiency and utilization
 Tunable Replication - multiple copies of each piece of data provide failure protection and computational performance.
 Security - optional LDAP integration.
 
 As the name suggested, HDFS is the HaDoop FileSystem. As such, HDFS behaves in a similar way to traditional Linux/Unix filesystems. At it’s lowest level, Hadoop stores data as files which are made of individual blocks. During data ingestion to Hadoop, the HDFS stripes your loaded files across all nodes in the cluster, with replication for fault tolerance. The file loaded onto Hadoop will be split into multiple individual blocks which will be spread across the entire cluster. Each block will be stored more then once on more than one server. 
 The replication factor (number of block copies) and block size is configurable on a per-file basis. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 14. Working with HDFS, at it’s lowest level, is simple. You can either access the HDFS file browser using the HUE web interface: Using the Hadoop HUE WebUI to browse HDFS Or using a special Hadoop command line tool (“hadoop”) that is part of the Hadoop client. Speccing the “fs” argument allows for interacting with the filesystem of a remote Hadoop cluster: NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 15. Some more very basic examples: hadoop fs -mkdir /user/hadoop/dir1
 hadoop fs -ls /user/hadoop/dir1
 hadoop fs -rm -r /user/hadoop/dir1
 hadoop fs -put /path_to_loca_dir/local_file.txt /user/hadoop/hdfs_dir/ hdfs_file.txt Note that the paths shown in the examples above are HDFS paths and not paths from the local file system of where the “hadoop fs” command line is executed. It’s important to note that while writing / reading files from HDFS is the lowest-level access to HDFS, end-users (developers, data analysts) working with Hadoop rely on several other Hadoop data access frameworks which allow queries and data processing on top of HDFS-stored data without having to directly interact with the Hadoop filesystem. Frameworks such as Cloudera Impala or HIVE allow end-users to write SQL queries on top of data stored in HDFS. A SQL query used to directly access and visualize data from HDFS using the Hadoop HUE web UI. Bottom line - HDFS is the Hadoop filesystem, the low-level data storage layer. Users can interact with HDFS using both the HUE WebUI and the Hadoop command line. Using these tools you can treat HDFS as if it is a regular (but distributed) filesystem - create directories, write files, read files, delete files, etc… 
 NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 16. 
 All data in Hadoop, at it’s lowest level, is just files on HDFS. 
 “Structure” (semantics, tables, records, fields) is created when accessing data, not when writing it. Hadoop YARN • Once data is stored in Hadoop, how can we coordinate access to it? The second component of the core Hadoop architecture is the data processing, resource management and scheduling framework called YARN. 
 
 Different workloads (realtime and batch) can co-exist on your Hadoop cluster. YARN facilities scheduling, resource management and application/query level execution failure protection for all types of Hadoop workloads. If Hadoop HDFS takes care of data storage, YARN takes care of managing data retrieval. With YARN, data processing workloads are executed at the same location where the data is stored rather than relying on moving data form a dedicated storage tier to a database tier. Data storage and computation coexist on the same physical nodes in the cluster. Workloads running in Hadoop under YARN can process exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity. Scale-out architecture - adding servers to your Hadoop cluster increases both processing power and storage capacity.
 
 Security & authentication – YARN works with HDFS security to make sure that only approved users can operate against the data in Hadoop.
 
 Resource manager and job scheduling – YARN employs data locality and manages cluster resources intelligently to determine optimal locations (nodes) across the cluster for data processing while allowing both long-running (batch) and short-running (realtime) applications to co-exist and access the same datasets.
 
 Flexibility – YARN allows for various data processing APIs and query frameworks to work on the same data at the same time. Some of the Hadoop data processing frameworks running under YARN are optimized for batch analytics while others NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 17. provide near realtime in-memory event processing thus providing a “best of breed” approach for accessing your data based on your use cases.
 
 Resiliency & high availability – YARN runs as a distrusted architecture across your Hadoop cluster ensuring that if a submitted job or query fails, it can independently and automatically restart and resume processing. No user intervention is required. When data stored in Hadoop is accessed, data processing is distributed across all nodes in the cluster. Distributed data sets are pieced-together automatically providing parallel reads and processing to construct the final output. Bottom line: YARN, by itself, isn’t a “query engine” or a data processing framework in Hadoop. It’s a cluster resource manager and coordinator that allows various different data processing and query engines (discussed later) to access data stored in Hadoop and “play nice” with one another. That is, share cluster resources (CPU cores, Memory, etc…) Hadoop Query APIs and Data Processing Frameworks • What does end-users (data analysts, developers, etc…) actually use to query and process data in Hadoop? Unlike most traditional relational databases which only offer SQL-based access to data, Hadoop provides a variety of APIs each optimized for specific use cases. While SQL has its benefits in simplicity and very short development cycles, it is limited when taxed with more complex computation or analytical workloads. Continuing Hadoop’s “one size does not fits all” approach, multiple different “pluggable” APIs and processing languages are available. Each specifically designed to address an individual use case with custom tailored performance and flexibility. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 18. Identify your data processing use case and then select the best optimized framework for the job. Some of these modern frameworks for retrieving and processing data stored in Hadoop are: Cloudera Impala (Interactive SQL) – high-performance interactive access to data via SQL.
 Impala provides second-level latency for SQL-based data retrieval in Hadoop.
 Impala is a fully integrated, state-of-the-art analytic Hadoop database engine specifically designed to leverage the flexibility and scalability strengths of Hadoop - combining the familiar SQL language and multi-user performance of a traditional analytic database with the performance and scalability of Hadoop. Impala workloads are not converted to Map/Reduce when executed and access Hadoop data directly. 
 
 Example Impala create table statement + query: CREATE EXTERNAL TABLE tab2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ‘/some_hdfs_folder/‘; <<<<< Note that this is a location to a folder on HDFS
 
 SELECT tab2.* <<<<< Impala query runs on top of previously crated table, accessing HDFS data FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2; 
 HIVE (Batch SQL) – BATCH optimized SQL interface 
 HIVE allows for non-developers to access data directly from Hadoop using the SQL language while providing Batch processing optimizations. HIVE automatically converts SQL code to Map/Reduce programs and pushes them onto the Hadoop cluster. Because HIVE leverages Map/Reduce, it is suited for batch processing and provides the same performance, reliability and scalability which are core strengths of Map/Reduce on Hadoop. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 19. Spark (In-memory “fast” processing) – next generation memory- optimized data processing engine 
 Spark is an extremely fast, memory-optimized general propose processing engine and considered the next-generation data processing framework for Hadoop.
 Spark is a functional data processing language which supports development in Python, JAVA and Scala. SPARK is designed for both batch processing workloads as well as streaming workloads (using Spark Streaming), interactive queries, and machine learning. Example Spark word count application in Phyton:
 
 # Open a CSV file on Hadoop HDFS
 text_file = spark.textFile("hdfs://raw_data/my_raw_datafile.csv")
 
 # Count the words in the file 
 text_file.flatMap(lambda line: line.split())
 .map(lambda word: (word, 1))
 .reduceByKey(lambda a, b: a+b) Map/Reduce (BATCH)– Distributed batch processing framework 
 MapReduce is the original core of processing in Hadoop. Map/Reduce is this programming paradigm which allows for data processing to become massive scalability across hundreds or thousands of servers in an Hadoop cluster. With MapReduce and Hadoop, compute workloads are executed in the location as the data. Data storage and computation coexist on the same physical nodes in the Hadoop cluster. MapReduce processes exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity. Apache Mahout (Machine Learning)– Scalable Machine Learning Framework.
 Mahout provides the core algorithms for clustering, classification and collaborative filtering which are implemented on top of the scalable, distributed Hadoop platform such as:
 - Recommendation mining: takes users behavior and from that tries to find items users might like. - Clustering: takes data (documents) and groups them into groups of topically related data or documents. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.
  • 20. - Classification: learns from existing categorized data what data of a specific category look like and is able to assign unlabeled documents to the correct category Note that the frameworks detailed above are just some of the most popular Hadoop frameworks which run under Yarn. Many more exist and being developed. NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812
 All Rights Reserved. Do not Distribute.