This document provides an introduction to Hadoop, including:
- An overview of big data and the challenges it poses for data storage and processing.
- How Hadoop addresses these challenges through its distributed, scalable architecture based on MapReduce and HDFS.
- Descriptions of key Hadoop components like MapReduce, HDFS, Hive, and Sqoop.
- Examples of how to perform common data processing tasks like word counting and friend recommendations using MapReduce.
- Some best practices, limitations, and other tools in the Hadoop ecosystem.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
In this session you will learn:
Hadoop Installation and Commands
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
In this session you will learn:
Hadoop Installation and Commands
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. Learn the fundamental principles behind it, and how you can use its power to make sense of your Big Data.
Big Data and Hadoop
Become a Hadoop Expert by mastering MapReduce, Yarn, Pig, Hive, HBase, Oozie, Flume and Sqoop while working on industry based Use-cases and Projects.
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
Sharing of Hadoop cluster deployment experience in production from scratch on real hardware. Brief overview of Hadoop stack, its components, major deployment and configuration challenges, performance tuning and application tuning experience. Some “war stories” about the issues we have faced while operating, the benefits of DevOps approach for running Hadoop apps.
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
Challenges of Building a First Class SQL-on-Hadoop Engine:
Why and what is Big SQL 3.0?
Overview of the challenges
How we solved (some of) them
Architecture and interaction with Hadoop
Query rewrite
Query optimization
Future challenges
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.
Tom Kraljevic discusses the architecture of H2O on Hadoop and scheduling & launching long running in-memory processes on hadoop. And details of running open source H2O on Hadoop, using Yarn, and the things learned by the H2O team along the way.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
2. 2/39
About me
● Working with Hadoop and Hadoop related technologies for
last 4 years
● Deployed 2 large clusters, bigger one was almost 0.5 PB in
total storage
● Currently working as consultant / freelancer in Java and
Hadoop
● On site Hadoop trainings from time to time
● In meantime working on Android apps
3. 3/39
Agenda
● Big Data
● Hadoop
● MapReduce basics
● Hadoop processing framework – Map Reduce on YARN
● Hadoop Storage system – HDFS
● Using SQL on Hadoop with Hive
● Connecting Hadoop with RDBMS using Sqoop
● Example of real Hadoop architecture – examples
4. 4/39
Big Data from technological perspective
● Huge amount of data
● Data collection
● Data processing
● Hardware limitations
● System reliability:
– Partial failures
– Data recoverability
– Consistency
– Scalability
5. 5/39
Approaches to Big Data problem
● Vertical scaling
● Horizontal scaling
● Moving data to processing
● Moving processing close to data
6. 6/39
Hadoop - motivations
● Data won't fit on
one machine
● More machines →
higher chance of
failure
● Disk scan faster
than seek
● Batch vs real
time processing
● Data processing
won't fit on one
machine
● Move
computation
close to data
7. 7/39
Hadoop properties
● Linear scalability
● Distributed
● Shared (almost)
nothing
architecture
● Whole ecosystem
of tools and
techniques
● Unstructured
data
● Raw data
analysis
● Transparent data
compression
● Replication at it's
core
● Self-managing
(replication,
master election,
etc)
● Easy to use
● Massive parallel
processing
8. 8/39
Hadoop Architecture
● “Lower” layer: HDFS – data storage and retrieval system
● “Higher” layer: MapReduce – execution engine that relies on
HDFS
● Please note that there are other systems that rely on HDFS
for data storage, but won't be covered in this presentation
9. 9/39
Map Reduce basics
● Batch processing system
● Handles many distributed systems problems
● Automatic parallelization and distribution
● Fault tolerance
● Job status and monitoring
● Borrows from functional programming
● Based on Google's work: MapReduce: Simplified Data
Processing on Large Clusters
10. 10/39
Word Count pseudo code
1: def map(String key, String value)
2: foreach word in value:
3: emit(word, 1);
4:
5: def reduce(String key, int[] values)
6: int result = 0;
7: foreach val in values:
8: result += val;
9: emit(key, result);
10:
13. 13/39
What can be expressed as MapReduce?
● grep
● sort
● SQL operators, for example:
– GROUP BY
– DISTINCT
– JOIN
● Recommending friends
● Reverting web indexes
● And many more
14. 14/39
HDFS – Hadoop Distributed File System
● Optimized for streaming access (prefers throughput over
latency, no caching)
● Built-in replication
● One master server storing all metadata (Name Node)
● Multiple slaves that store data and report to master (Data
Nodes)
● JBOD optimized
● Works better on moderate number of large files vs small files
● Based on Google's work: The Google File System
16. 16/39
HDFS limitations
● No file updates
● Name Node as SPOF in basic configurations
● Limited security
● Inefficient at handling lots of small files
● No way to provide global synchronization or shared mutable
state (this can be an advantage)
17. 17/39
HDFS + MapReduce: Simplified Architecture
Name Node
Job Tracker
Master Node
Slave Node
Data Node
Task Tracker
Slave Node
Data Node
Task Tracker
Slave Node
Data Node
Task Tracker
…....
* Real setup will include
few more boxes, but they are
omitted here for simplicity
18. 18/39
Hive
● “Data warehousing for Hadoop”
● SQL interface to HDFS files (language is called HiveQL)
● SQL is translated into multiple MR jobs that are executed in
order
● Doesn't support UPDATE
● Powerful and easy to use UDF mechanism:
add jar /home/hive/my-udfs.jar
create temporary function lower as 'com.example.Lower';
select my_lower(username) from users;
19. 19/39
Hive components
● Shell – similar to MySQL shell
● Driver – responsible for executing jobs
● Compiler – translates SQL into MR job
● Execution engine – manages jobs and job stages (one SQL
usually is translated into multiple MR jobs)
● Metastore – schema, location in HDFS, data format
● JDBC interface – allows for any JDBC compatible client to
connect
20. 20/39
Hive examples 1/2
● CREATE TABLE page_view
(view_time INT, user_id BIGINT,
page_url STRING, referrer_url STRING,
ip STRING);
● CREATE TABLE users(user_id BIGINT, age INT);
● SELECT * From page_view LIMIT 10;
● SELECT
user_id,
COUNT(*) AS c
FROM users
WHERE view_time > 10
GROUP BY user_id;
21. 21/39
Hive examples 2/2
● CREATE TABLE page_views_age AS
SELECT
pv.page_url,
u.age,
COUNT(*) AS count
FROM page_view pv
JOIN users u ON (u.user_id = pv.user_id)
GRUP BY pv.page_url, u.age;
22. 22/39
Hive best practices 1/2
● Use partitions, especially on date columns
● Compress where possible
● JOIN optimization hive.auto.convert.join=true
● Improve parallelism: hive.exec.parallel=true
23. 23/39
Hive best practices 2/2
● SELECT COUNT(DISTINCT user_id) FROM logs;
● SELECT COUNT(*) FROM (SELECT DISTINCT user_id FROM logs);
image source: http://www.slideshare.net/oom65/optimize-hivequeriespptx
24. 24/39
Sqoop
● SQL to Hadoop import/export tool
● Performs a MapReduce query that interacts with target
database via JDBC
● Can work with almost all JDBC databases
● Can “natively” import and export Hive tables
● Import supports:
– Full databases
– Full tables
– Query results
● Export can update/append data to SQL tables
26. 26/39
Hadoop problems
● Relatively hard to setup – Linux knowledge required
● Hard to find logs – multiple directories on each server
● Name Node can be a SPOF if configured incorrectly
● Not real time – jobs take some setup/warm up time (other
projects try to address that
● Performance not visible until you exceed 3-5 servers
● Hard to convince people to use it from the start in some
projects (Hive via JDBC can help here)
● Relatively complicated configuration management
27. 27/39
Hadoop ecosystem
● HBase – Big Table database
● Spark – Real time query engine
● Flume – log collection
● Impala – similar to Spark
● HUE – Hive console (MySQL workbench / phpMyAdmin) +
user permission
● Oozie – Job scheduling, orchestration, dependency, etc
28. 28/39
Use case examples
● Generic production snapshot updates
– Using asynchronous mechanisms
– Using more synchronous approach
● Friends/product recommendations
29. 29/39
Hadoop use case example: snapshots
● Log collection, aggregation
● Periodic batch jobs (hourly, daily)
● Jobs integrate collected logs and production data
● Results from batch jobs feed production system
● Hadoop jobs generate reports for business users
30. 30/39
Hadoop pipeline – feedback loop
Production system X
generates logs
RabbitMQ
integration step
logs
Production system Y
generates logs
logs
Hadoop
HDFS + MR
Multiple rabbit
consumers write to HDFSlogs
logs – HDFS writes
RDBMS:
stores models
feeds production system
Daily jobs
Daily processing
Results of daily processing
Updated “snapshots”
Current “snapshots”
Updates “snapshots”
stored on production servers
31. 31/39
Feedback loop using sqoop
Hadoop
HDFS + MR
RDBMS:
stores data
for production system
Daily jobs
sqoop export
Hadoop MR jobsqoop import
32. 32/39
Agenda
● Big Data
● Hadoop
● MapReduce basics
● Hadoop processing framework – Map Reduce on YARN
● Hadoop Storage system – HDFS
● Using SQL on Hadoop with Hive
● Connecting Hadoop with RDBMS using Sqoop
● Example of real Hadoop architecture – examples
33. 33/39
How to recommend friends – PYMK 1/4
● Database of users
– CREATE TABLE users (id INT);
● Each user has a list of friends (assume integers)
– CREATE TABLE friends (user1 INT, user2 INT);
● For simplicity: relationship is always bidirectional
● Possible to do in SQL (run on RDBMS or on Hive):
● SELECT users.id, new_friend, COUNT(*) AS
common_friends
FROM users JOIN friends f1 JOIN f2 ….
….
….
34. 34/39
PYMK: 2/4 Example
0: 1,2,3
1: 3
2: 1,4,5
3: 0,1
4: 5
5: 2,4
We expect to see following recommendations:
(1,3)
(0,4)
(0,5)
0
1
2
3
4
5
35. 35/39
PYMK 3/4
● For each user emit pairs for all his friends
– Example: user X has friends: 1,5,6, we emit: (1,5), (1,6), (5,6)
● Sort all pairs by first user
● Eliminate direct friendships, if 5&6 are friends, remove them
● Sort all pairs by frequency
● Group by each user in pair
36. 36/39
PYMK 4/5 mapper
//user: integer, friends: integer list
function map(user, friends)
for i = 0 to friends.length-1:
emit(user, (1, friends[i])) //direct friends
for j = i+1 to friends.length-1:
//indirect friends
emit(friends[i], (2, friends[j]))
emit(friends[j], (2, friends[i]))
37. 37/39
PYMK 5/5 reducer
//user: integer, rlist: list of pairs (path_length,
rfriend)
reduce(user, rlist):
recommened = new Map()
for(path_length, rfriend) in rlist:
if(path_length == 1)//direct friends
recommened.remove(rfriend)
if(path_length == 2)//recommend them
recommened.incrementOrAdd(rfriend)
recommend_list = recommened.toList()
recommend_list.sortBy(_.2)
emit(user, recommend_list.toString())
38. 38/39
Additional sources
● Data-Intensive Text Processing with MapReduce:
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book
-final.pdf
● Programming Hive:
http://shop.oreilly.com/product/0636920023555.do
● Cloudera Quick Start VM:
http://www.cloudera.com/content/support/en/downloads/quic
kstart_vms/cdh-5-1-x1.html
● Hadoop: The Definitive Guide:
http://shop.oreilly.com/product/0636920021773.do