SlideShare a Scribd company logo
1 of 11
Download to read offline
C o m m u n i t y E x p e r i e n c e D i s t i l l e d
Delve into the key concepts of Hadoop and get a thorough
understanding of the Hadoop ecosystem
Hadoop Essentials
ShivaAchari
Hadoop Essentials
This book jumps into the world of Hadoop ecosystem
components and its tools in a simplified manner, and
provides you with the skills to utilize them effectively for
faster and effective development of Hadoop projects.
Starting with the concepts of Hadoop YARN, MapReduce,
HDFS, and other Hadoop ecosystem components, you
will soon learn many exciting topics such as MapReduce
patterns, data management, and real-time data analysis
using Hadoop. You will also get acquainted with many
Hadoop ecosystem components tools such as Hive,
HBase, Pig, Sqoop, Flume, Storm, and Spark.
By the end of the book, you will be confident to begin
working with Hadoop straightaway and implement the
knowledge gained in all your real-world scenarios.
Who this book is written for
If you are a system or application developer interested
in learning how to solve practical problems using the
Hadoop framework, then this book is ideal for you. This
book is also meant for Hadoop professionals who want to
find solutions to the different challenges they come across
in their Hadoop projects.
$ 29.99 US
£ 19.99 UK
Prices do not include
local sales tax or VAT
where applicable
Shiva Achari
Visit www.PacktPub.com for books, eBooks,
code, downloads, and PacktLib.
What you will learn from this book
 Get introduced to Hadoop, big data, and
the pillars of Hadoop such as HDFS,
MapReduce, and YARN
 Understand different use cases of Hadoop
along with big data analytics and real-time
analysis in Hadoop
 Explore the Hadoop ecosystem tools and
effectively use them for faster development
and maintenance of a Hadoop project
 Demonstrate YARN's capacity for database
processing
 Work with Hive, HBase, and Pig with Hadoop
to easily figure out your big data problems
 Gain insights into widely used tools such
as Sqoop, Flume, Storm, and Spark using
practical examples
HadoopEssentials
"Community
Experience
Distilled"
In this package, you will find:
 The author biography
 A preview chapter from the book, Chapter 2 'Hadoop Ecosystem'
 A synopsis of the book’s content
 More information on Hadoop Essentials
About the Author
Shiva Achari has over 8 years of extensive industry experience and is currently
working as a Big Data Architect consultant with companies such as Oracle and
Teradata. Over the years, he has architected, designed, and developed multiple
innovative and high-performance large-scale solutions, such as distributed systems,
data centers, big data management tools, SaaS cloud applications, Internet
applications, and Data Analytics solutions.
He is also experienced in designing big data and analytics applications, such as
ingestion, cleansing, transformation, correlation of different sources, data mining,
and user experience in Hadoop, Cassandra, Solr, Storm, R, and Tableau.
He specializes in developing solutions for the big data domain and possesses sound
hands-on experience on projects migrating to the Hadoop world, new developments,
product consulting, and POC. He also has hands-on expertise in technologies such as
Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper,
Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java,
and J2EE.
He has been involved in reviewing Mastering Hadoop, Packt Publishing. Shiva has
expertise in requirement analysis, estimations, technology evaluation, and system
architecture along with domain experience in telecoms, Internet applications,
document management, healthcare, and media.
Currently, he is supporting presales activities such as writing technical proposals
(RFP), providing technical consultation to customers, and managing deliveries of
big data practice groups in Teradata.
He is active on his LinkedIn page at
.
Hadoop Essentials
Hadoop is quite a fascinating and interesting project that has seen quite a lot of
interest and contributions from the various organizations and institutions. Hadoop
has come a long way, from being a batch processing system to a data lake and high-
volume streaming analysis in low latency with the help of various Hadoop ecosystem
components, specifically YARN. This progress has been substantial and has made
Hadoop a powerful system, which can be designed as a storage, transformation, batch
processing, analytics, or streaming and real-time processing system.
Hadoop project as a data lake can be divided in multiple phases such as data ingestion,
data storage, data access, data processing, and data management. For each phase, we
have different sub-projects that are tools, utilities, or frameworks to help and accelerate
the process. The Hadoop ecosystem components are tested, configurable and proven
and to build similar utility on our own it would take a huge amount of time and effort
to achieve. The core of the Hadoop framework is complex for development and
optimization. The smart way to speed up and ease the process is to utilize different
Hadoop ecosystem components that are very useful, so that we can concentrate more
on the application flow design and integration with other systems.
With the emergence of many useful sub-projects in Hadoop and other tools within
the Hadoop ecosystem, the question that arises is which tool to use when and how
effectively. This book is intended to complete the jigsaw puzzle of when and how
to use the various ecosystem components, and to make you well aware of the Hadoop
ecosystem utilities and the cases and scenarios where they should be used.
What This Book Covers
Chapter 1, Introduction to Big Data and Hadoop, covers an overview of big data and
Hadoop, plus different use case patterns with advantages and features of Hadoop.
Chapter 2, Hadoop Ecosystem, explores the different phases or layers of Hadoop project
development and some components that can be used in each layer.
Chapter 3, Pillars of Hadoop – HDFS, MapReduce, and YARN, is about the three key
basic components of Hadoop, which are HDFS, MapReduce, and YARN.
Chapter 4, Data Access Components – Hive and Pig, covers the data access components
Hive and Pig, which are abstract layers of the SQL-like and Pig Latin procedural
languages, respectively, on top of the MapReduce framework.
Chapter 5, Storage Components – HBase, is about the NoSQL component database
HBase in detail.
Chapter 6, Data Ingestion in Hadoop – Sqoop and Flume, covers the data ingestion
library tools Sqoop and Flume.
Chapter 7, Streaming and Real-time Analysis – Storm and Spark, is about the streaming
and real-time frameworks Storm and Spark built on top of YARN.
[ 21 ]
Hadoop Ecosystem
Now that we have discussed and understood big data and Hadoop, we can move
on to understanding the Hadoop ecosystem. A Hadoop cluster may have hundreds
or thousands of nodes which are difficult to design, configure, and manage
manually. Due to this, there arises a need for tools and utilities to manage systems
and data easily and effectively. Along with Hadoop, we have separate sub-projects
which are contributed by some organizations and contributors, and are managed
mostly by Apache. The sub-projects integrate very well with Hadoop and can help
us concentrate more on design and development rather than maintenance and
monitoring, and can also help in the development and data management.
Before we understand different tools and technologies, let's understand a use
case and how it differs from traditional systems.
Traditional systems
Traditional systems are good for OLTP (online transaction processing) and some
basic Data Analysis and BI use cases. Within the scope, the traditional systems are
best in performance and management. The following figure shows a traditional
system on a high-level overview:
Transactional
Database
Batch
ETL
Data
Warehouse
Business
Intelligence
Applications
Data Analysis
Application
1. Reporting
Applications
2. OBIEE
3. Standard Interface
like JDBC, ODBC
1. Excel
2. SAS/R/SPSS
3. Custom
Applications
4. Standard Interface
like JDBC, ODBC
Transactional
Database
Transactional
Database
Traditional systems with BIA
Hadoop Ecosystem
[ 22 ]
The steps for typical traditional systems are as follows:
1. Data resides in a database
2. ETL (Extract Transform Load) processes
3. Data moved into a data warehouse
4. Business Intelligence Applications can have some BI reporting
5. Data can be used by Data Analysis Application as well
When the data grows, traditional systems fail to process, or even store, the data;
and even if they do, it comes at a very high cost and effort because of the limitations
in the architecture, issue with scalability and resource constraints, incapability
or difficulty to scale horizontally.
Database trend
Database technologies have evolved over a period of time. We have RDBMS (relational
database), EDW (Enterprise data warehouse), and now Hadoop and NoSQL-based
database have emerged. Hadoop and NoSQL-based database are now the preferred
technology used for the big data problems, and some traditional systems are gradually
moving towards Hadoop and NoSQL, along with their existing systems. Some systems
have different technologies to process the data such as, Hadoop with RDBMS, Hadoop
with EDW, NoSQL with EDW, and NoSQL with Hadoop. The following figure depicts
the database trend according to Forrester Research:
RDBMS
NoSQL
Key-Value/
Column Store
OLAP/BI
Hadoop
1990
2010
OLAP/BI
RDBMS
Datawarehouse
2000
Operational
Data
RDBMS
Database trends
The figure depicts the design trends and the technology which was available and
adapted in a particular decade.
The 1990's decade was the RDBMS era which was designed for OLTP processing and
data processing was not so complex.
Chapter 2
[ 23 ]
The emergence and adaptation of data warehouse was in the 2000's, which is used
for OLAP processing and BI.
From 2010 big data systems, especially Hadoop, have been adapted by many
organizations to solve Big Data problems.
All these technologies can practically co-exist for a solution as each technology has
its pros and cons because not all problems can be solved by any one technology.
The Hadoop use cases
Hadoop can help in solving the big data problems that we discussed in Chapter 1,
Introduction to Big Data and Hadoop. Based on Data Velocity (Batch and Real time)
and Data Variety (Structured, Semi-structured and Unstructured), we have different
sets of use cases across different domains and industries. All these use cases are big
data use cases and Hadoop can effectively help in solving them. Some use cases are
depicted in the following figure:
Credit and Market Risk in Banks
Fraud Detection (Credit Card) and Financial Crimes (AML) in Banks
(including Social Network Analysis)
Event-based Marketing in Financial Services and Telecoms
Markdown Optimization in Retail
Claims and Tax Fraud in Public Sector
Video Surveillance/
Analysis
Predictive
Maintenance in
Aerospace
Social Media
Sentiment Analysis
Demand Forecasting
in Manufacturing
Disease Analysis
on Electronic Health
Records
Traditional Data
Warehousing
Text Mining
Potential Use Cases for Big Data Analytics
Real time
Data
Velocity
Batch
Structured Semi-structured
Data Variety
Unstructured
Potential use case for Big Data Analytics
Hadoop Ecosystem
[ 24 ]
Hadoop's basic data flow
A basic data flow of the Hadoop system can be divided into four phases:
1. Capture Big Data : The sources can be extensive lists that are structured,
semi-structured, and unstructured, some streaming, real-time data sources,
sensors, devices, machine-captured data, and many other sources. For data
capturing and storage, we have different data integrators such as, Flume,
Sqoop, Storm, and so on in the Hadoop ecosystem, depending on the type
of data.
2. Process and Structure: We will be cleansing, filtering, and transforming the
data by using a MapReduce-based framework or some other frameworks
which can perform distributed programming in the Hadoop ecosystem. The
frameworks available currently are MapReduce, Hive, Pig, Spark and so on.
3. Distribute Results: The processed data can be used by the BI and
analytics system or the big data analytics system for performing analysis
or visualization.
4. Feedback and Retain: The data analyzed can be fed back to Hadoop and
used for improvements and audits.
The following figure shows the data captured and then processed in a Hadoop
platform, and the results used in a Business Transactions and Interactions system,
and a Business Intelligence and Analytics system:
Unstructured
Data
Log files
Exhaust Data
Social Media
Sensors,
devices
Enterprise
Hadoop
Platform
CRM, ERP
Web, Mobile
Point of sale
Classic Data
Integration and ETL
Dashboards,
Reports,
Visualization,...
Business
Transactions
and Interactions
Business
Intelligence
and Analytics
1 Capture Big Data 2 Process and Structure 3 Distribute Results 4 Feedback and Retain
DB data
Hadoop basic data flow
Chapter 2
[ 25 ]
Hadoop integration
Hadoop architecture is designed to be easily integrated with other systems.
Integration is very important because although we can process the data efficiently
in Hadoop, but we should also be able to send that result to another system to move
the data to another level. Data has to be integrated with other systems to achieve
interoperability and flexibility.
The following figure depicts the Hadoop system integrated with different systems
and with some implemented tools for reference:
Data
Warehouse
/RDBMS
Streaming
Data
BI/Analytics Tools
Data Import/Export
NoSQL
Data Integration Tools
Hadoop Integration with other systems
Systems that are usually integrated with Hadoop are:
• Data Integration tools such as, Sqoop, Flume, and others
• NoSQL tools such as, Cassandra, MongoDB, Couchbase, and others
• ETL tools such as, Pentaho, Informatica, Talend, and others
• Visualization tools such as, Tableau, Sas, R, and others
The Hadoop ecosystem
The Hadoop ecosystem comprises of a lot of sub-projects and we can configure these
projects as we need in a Hadoop cluster. As Hadoop is an open source software and
has become popular, we see a lot of contributions and improvements supporting
Hadoop by different organizations. All the utilities are absolutely useful and help
in managing the Hadoop system efficiently. For simplicity, we will understand
different tools by categorizing them.
Hadoop Ecosystem
[ 26 ]
The following figure depicts the layer, and the tools and utilities within that layer,
in the Hadoop ecosystem:
Mahout
MachineLearning
Oozie
Workflow
Machine
Learning Scheduling
System Deployment
ServiceProgramming
Storm
Data Ingestion
Zookeeper
Coordination
Pig
Scripting
Hive
SQLQuery
Distributed Programming
NoSQL DatabaseMapReduce Framework-YARN
Hadoop core
Distributed Filesystem
Hadoop ecosystem
Distributed filesystem
In Hadoop, we know that data is stored in a distributed computing environment,
so the files are scattered across the cluster. We should have an efficient filesystem to
manage the files in Hadoop. The filesystem used in Hadoop is HDFS, elaborated as
Hadoop Distributed File System.
HDFS
HDFS is extremely scalable and fault tolerant. It is designed to efficiently process
parallel processing in a distributed environment in even commodity hardware.
HDFS has daemon processes in Hadoop, which manage the data. The processes
are NameNode, DataNode, BackupNode, and Checkpoint NameNode.
We will discuss HDFS elaborately in the next chapter.
Where to buy this book
You can buy Hadoop Essentials from the Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.
www.PacktPub.com
Stay Connected:
Get more information Hadoop Essentials

More Related Content

What's hot

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesJyrki Määttä
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947CMR WORLD TECH
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystemGeert Van Landeghem
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 

What's hot (19)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 

Similar to Hadoop essentials by shiva achari - sample chapter

Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
Hadoop Training in Delhi
Hadoop Training in DelhiHadoop Training in Delhi
Hadoop Training in DelhiAPTRON
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Hadoop 2.0-development
Hadoop 2.0-developmentHadoop 2.0-development
Hadoop 2.0-developmentKnowledgehut
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Hadoop training kit from lcc infotech
Hadoop   training kit from lcc infotechHadoop   training kit from lcc infotech
Hadoop training kit from lcc infotechlccinfotech
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxPratimakumari213460
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in AmritsarE2MATRIX
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in MohaliE2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptxTazeenSayed3
 

Similar to Hadoop essentials by shiva achari - sample chapter (20)

Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Hadoop Training in Delhi
Hadoop Training in DelhiHadoop Training in Delhi
Hadoop Training in Delhi
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Hadoop 2.0-development
Hadoop 2.0-developmentHadoop 2.0-development
Hadoop 2.0-development
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Hadoop training kit from lcc infotech
Hadoop   training kit from lcc infotechHadoop   training kit from lcc infotech
Hadoop training kit from lcc infotech
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
HDFS
HDFSHDFS
HDFS
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Hadoop essentials by shiva achari - sample chapter

  • 1. C o m m u n i t y E x p e r i e n c e D i s t i l l e d Delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem Hadoop Essentials ShivaAchari Hadoop Essentials This book jumps into the world of Hadoop ecosystem components and its tools in a simplified manner, and provides you with the skills to utilize them effectively for faster and effective development of Hadoop projects. Starting with the concepts of Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also get acquainted with many Hadoop ecosystem components tools such as Hive, HBase, Pig, Sqoop, Flume, Storm, and Spark. By the end of the book, you will be confident to begin working with Hadoop straightaway and implement the knowledge gained in all your real-world scenarios. Who this book is written for If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. This book is also meant for Hadoop professionals who want to find solutions to the different challenges they come across in their Hadoop projects. $ 29.99 US £ 19.99 UK Prices do not include local sales tax or VAT where applicable Shiva Achari Visit www.PacktPub.com for books, eBooks, code, downloads, and PacktLib. What you will learn from this book  Get introduced to Hadoop, big data, and the pillars of Hadoop such as HDFS, MapReduce, and YARN  Understand different use cases of Hadoop along with big data analytics and real-time analysis in Hadoop  Explore the Hadoop ecosystem tools and effectively use them for faster development and maintenance of a Hadoop project  Demonstrate YARN's capacity for database processing  Work with Hive, HBase, and Pig with Hadoop to easily figure out your big data problems  Gain insights into widely used tools such as Sqoop, Flume, Storm, and Spark using practical examples HadoopEssentials "Community Experience Distilled"
  • 2. In this package, you will find:  The author biography  A preview chapter from the book, Chapter 2 'Hadoop Ecosystem'  A synopsis of the book’s content  More information on Hadoop Essentials About the Author Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data Architect consultant with companies such as Oracle and Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performance large-scale solutions, such as distributed systems, data centers, big data management tools, SaaS cloud applications, Internet applications, and Data Analytics solutions. He is also experienced in designing big data and analytics applications, such as ingestion, cleansing, transformation, correlation of different sources, data mining, and user experience in Hadoop, Cassandra, Solr, Storm, R, and Tableau. He specializes in developing solutions for the big data domain and possesses sound hands-on experience on projects migrating to the Hadoop world, new developments, product consulting, and POC. He also has hands-on expertise in technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE. He has been involved in reviewing Mastering Hadoop, Packt Publishing. Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture along with domain experience in telecoms, Internet applications, document management, healthcare, and media. Currently, he is supporting presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of big data practice groups in Teradata. He is active on his LinkedIn page at .
  • 3. Hadoop Essentials Hadoop is quite a fascinating and interesting project that has seen quite a lot of interest and contributions from the various organizations and institutions. Hadoop has come a long way, from being a batch processing system to a data lake and high- volume streaming analysis in low latency with the help of various Hadoop ecosystem components, specifically YARN. This progress has been substantial and has made Hadoop a powerful system, which can be designed as a storage, transformation, batch processing, analytics, or streaming and real-time processing system. Hadoop project as a data lake can be divided in multiple phases such as data ingestion, data storage, data access, data processing, and data management. For each phase, we have different sub-projects that are tools, utilities, or frameworks to help and accelerate the process. The Hadoop ecosystem components are tested, configurable and proven and to build similar utility on our own it would take a huge amount of time and effort to achieve. The core of the Hadoop framework is complex for development and optimization. The smart way to speed up and ease the process is to utilize different Hadoop ecosystem components that are very useful, so that we can concentrate more on the application flow design and integration with other systems. With the emergence of many useful sub-projects in Hadoop and other tools within the Hadoop ecosystem, the question that arises is which tool to use when and how effectively. This book is intended to complete the jigsaw puzzle of when and how to use the various ecosystem components, and to make you well aware of the Hadoop ecosystem utilities and the cases and scenarios where they should be used.
  • 4. What This Book Covers Chapter 1, Introduction to Big Data and Hadoop, covers an overview of big data and Hadoop, plus different use case patterns with advantages and features of Hadoop. Chapter 2, Hadoop Ecosystem, explores the different phases or layers of Hadoop project development and some components that can be used in each layer. Chapter 3, Pillars of Hadoop – HDFS, MapReduce, and YARN, is about the three key basic components of Hadoop, which are HDFS, MapReduce, and YARN. Chapter 4, Data Access Components – Hive and Pig, covers the data access components Hive and Pig, which are abstract layers of the SQL-like and Pig Latin procedural languages, respectively, on top of the MapReduce framework. Chapter 5, Storage Components – HBase, is about the NoSQL component database HBase in detail. Chapter 6, Data Ingestion in Hadoop – Sqoop and Flume, covers the data ingestion library tools Sqoop and Flume. Chapter 7, Streaming and Real-time Analysis – Storm and Spark, is about the streaming and real-time frameworks Storm and Spark built on top of YARN.
  • 5. [ 21 ] Hadoop Ecosystem Now that we have discussed and understood big data and Hadoop, we can move on to understanding the Hadoop ecosystem. A Hadoop cluster may have hundreds or thousands of nodes which are difficult to design, configure, and manage manually. Due to this, there arises a need for tools and utilities to manage systems and data easily and effectively. Along with Hadoop, we have separate sub-projects which are contributed by some organizations and contributors, and are managed mostly by Apache. The sub-projects integrate very well with Hadoop and can help us concentrate more on design and development rather than maintenance and monitoring, and can also help in the development and data management. Before we understand different tools and technologies, let's understand a use case and how it differs from traditional systems. Traditional systems Traditional systems are good for OLTP (online transaction processing) and some basic Data Analysis and BI use cases. Within the scope, the traditional systems are best in performance and management. The following figure shows a traditional system on a high-level overview: Transactional Database Batch ETL Data Warehouse Business Intelligence Applications Data Analysis Application 1. Reporting Applications 2. OBIEE 3. Standard Interface like JDBC, ODBC 1. Excel 2. SAS/R/SPSS 3. Custom Applications 4. Standard Interface like JDBC, ODBC Transactional Database Transactional Database Traditional systems with BIA
  • 6. Hadoop Ecosystem [ 22 ] The steps for typical traditional systems are as follows: 1. Data resides in a database 2. ETL (Extract Transform Load) processes 3. Data moved into a data warehouse 4. Business Intelligence Applications can have some BI reporting 5. Data can be used by Data Analysis Application as well When the data grows, traditional systems fail to process, or even store, the data; and even if they do, it comes at a very high cost and effort because of the limitations in the architecture, issue with scalability and resource constraints, incapability or difficulty to scale horizontally. Database trend Database technologies have evolved over a period of time. We have RDBMS (relational database), EDW (Enterprise data warehouse), and now Hadoop and NoSQL-based database have emerged. Hadoop and NoSQL-based database are now the preferred technology used for the big data problems, and some traditional systems are gradually moving towards Hadoop and NoSQL, along with their existing systems. Some systems have different technologies to process the data such as, Hadoop with RDBMS, Hadoop with EDW, NoSQL with EDW, and NoSQL with Hadoop. The following figure depicts the database trend according to Forrester Research: RDBMS NoSQL Key-Value/ Column Store OLAP/BI Hadoop 1990 2010 OLAP/BI RDBMS Datawarehouse 2000 Operational Data RDBMS Database trends The figure depicts the design trends and the technology which was available and adapted in a particular decade. The 1990's decade was the RDBMS era which was designed for OLTP processing and data processing was not so complex.
  • 7. Chapter 2 [ 23 ] The emergence and adaptation of data warehouse was in the 2000's, which is used for OLAP processing and BI. From 2010 big data systems, especially Hadoop, have been adapted by many organizations to solve Big Data problems. All these technologies can practically co-exist for a solution as each technology has its pros and cons because not all problems can be solved by any one technology. The Hadoop use cases Hadoop can help in solving the big data problems that we discussed in Chapter 1, Introduction to Big Data and Hadoop. Based on Data Velocity (Batch and Real time) and Data Variety (Structured, Semi-structured and Unstructured), we have different sets of use cases across different domains and industries. All these use cases are big data use cases and Hadoop can effectively help in solving them. Some use cases are depicted in the following figure: Credit and Market Risk in Banks Fraud Detection (Credit Card) and Financial Crimes (AML) in Banks (including Social Network Analysis) Event-based Marketing in Financial Services and Telecoms Markdown Optimization in Retail Claims and Tax Fraud in Public Sector Video Surveillance/ Analysis Predictive Maintenance in Aerospace Social Media Sentiment Analysis Demand Forecasting in Manufacturing Disease Analysis on Electronic Health Records Traditional Data Warehousing Text Mining Potential Use Cases for Big Data Analytics Real time Data Velocity Batch Structured Semi-structured Data Variety Unstructured Potential use case for Big Data Analytics
  • 8. Hadoop Ecosystem [ 24 ] Hadoop's basic data flow A basic data flow of the Hadoop system can be divided into four phases: 1. Capture Big Data : The sources can be extensive lists that are structured, semi-structured, and unstructured, some streaming, real-time data sources, sensors, devices, machine-captured data, and many other sources. For data capturing and storage, we have different data integrators such as, Flume, Sqoop, Storm, and so on in the Hadoop ecosystem, depending on the type of data. 2. Process and Structure: We will be cleansing, filtering, and transforming the data by using a MapReduce-based framework or some other frameworks which can perform distributed programming in the Hadoop ecosystem. The frameworks available currently are MapReduce, Hive, Pig, Spark and so on. 3. Distribute Results: The processed data can be used by the BI and analytics system or the big data analytics system for performing analysis or visualization. 4. Feedback and Retain: The data analyzed can be fed back to Hadoop and used for improvements and audits. The following figure shows the data captured and then processed in a Hadoop platform, and the results used in a Business Transactions and Interactions system, and a Business Intelligence and Analytics system: Unstructured Data Log files Exhaust Data Social Media Sensors, devices Enterprise Hadoop Platform CRM, ERP Web, Mobile Point of sale Classic Data Integration and ETL Dashboards, Reports, Visualization,... Business Transactions and Interactions Business Intelligence and Analytics 1 Capture Big Data 2 Process and Structure 3 Distribute Results 4 Feedback and Retain DB data Hadoop basic data flow
  • 9. Chapter 2 [ 25 ] Hadoop integration Hadoop architecture is designed to be easily integrated with other systems. Integration is very important because although we can process the data efficiently in Hadoop, but we should also be able to send that result to another system to move the data to another level. Data has to be integrated with other systems to achieve interoperability and flexibility. The following figure depicts the Hadoop system integrated with different systems and with some implemented tools for reference: Data Warehouse /RDBMS Streaming Data BI/Analytics Tools Data Import/Export NoSQL Data Integration Tools Hadoop Integration with other systems Systems that are usually integrated with Hadoop are: • Data Integration tools such as, Sqoop, Flume, and others • NoSQL tools such as, Cassandra, MongoDB, Couchbase, and others • ETL tools such as, Pentaho, Informatica, Talend, and others • Visualization tools such as, Tableau, Sas, R, and others The Hadoop ecosystem The Hadoop ecosystem comprises of a lot of sub-projects and we can configure these projects as we need in a Hadoop cluster. As Hadoop is an open source software and has become popular, we see a lot of contributions and improvements supporting Hadoop by different organizations. All the utilities are absolutely useful and help in managing the Hadoop system efficiently. For simplicity, we will understand different tools by categorizing them.
  • 10. Hadoop Ecosystem [ 26 ] The following figure depicts the layer, and the tools and utilities within that layer, in the Hadoop ecosystem: Mahout MachineLearning Oozie Workflow Machine Learning Scheduling System Deployment ServiceProgramming Storm Data Ingestion Zookeeper Coordination Pig Scripting Hive SQLQuery Distributed Programming NoSQL DatabaseMapReduce Framework-YARN Hadoop core Distributed Filesystem Hadoop ecosystem Distributed filesystem In Hadoop, we know that data is stored in a distributed computing environment, so the files are scattered across the cluster. We should have an efficient filesystem to manage the files in Hadoop. The filesystem used in Hadoop is HDFS, elaborated as Hadoop Distributed File System. HDFS HDFS is extremely scalable and fault tolerant. It is designed to efficiently process parallel processing in a distributed environment in even commodity hardware. HDFS has daemon processes in Hadoop, which manage the data. The processes are NameNode, DataNode, BackupNode, and Checkpoint NameNode. We will discuss HDFS elaborately in the next chapter.
  • 11. Where to buy this book You can buy Hadoop Essentials from the Packt Publishing website. Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet book retailers. Click here for ordering and shipping details. www.PacktPub.com Stay Connected: Get more information Hadoop Essentials