Hadoop data analysis

•Download as PPTX, PDF•

1 like•633 views

Vakul Vankadaru

This ppt is about how hadoop is helpful for us and architecture of hadoop

Technology Education

3
Solution
• Open Source Apache Project
• Runs on
• Linux, Mac OS/X, Windows, and Solaris
• Commodity hardware

Who uses hadoop ?
4
Hadoop optimized content

What is Hadoop ?
• Hadoop is a software framework for distributed processing
of large datasets across large clusters of computers
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes
• Hadoop is based on a simple programming model called
MapReduce
5

Hadoopclusters
One of the largest clusters in the world runs on around 2000 nodes at yahoo does
hundred's of jobs every month
6

Hadoop Architecture
7
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)

Hadoop Distributed File System (HDFS)
9
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)

Main Properties of HDFS
• Large
• Replication
• Failure
• Fault Tolerance
10

Hadoop MapReduce
• Handles large-scale web search applications
• Popular programming module for data centres
• Effective programming model for data mining, machine
learning and search applications in data centers.
• Helps to abstract from the issues of parallelization, scheduling,
input partitioning, failover, replication.
11

Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Number of mappers
• Concept of locality
13
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3

Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
14
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example, 1 map-reduce
job consists of 4 map tasks and 3
reduce tasks

Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
• Shuffling and Sorting:
• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>> 15

Example 1: Word Count
• Job: Count the occurrences of each word in a data set
16
Map
Tasks
Reduce
Tasks

What's hot

Hadoop_EcoSystem_Pradeep_MGPradeep MG

Streaming API, Spark and RubyManohar Amrutkar

Introduction to hadoopRon Sher

Future of HCatalogDataWorks Summit

Apache Hadoop Big Data TechnologyJay Nagar

Big data hadoop training in pune course content advanto softwareAdvanto Software

2012 apache hadoop_map_reduce_windows_azureDataPlato, Crossing the line

MapReduce ParadigmDilip Reddy

Hadoop: The elephant in the roomcacois

Hadoop - Introduction to HDFSVibrant Technologies & Computers

Apache GiraphAhmet Emre Aladağ

Hadoop course curriculm alogarg

Resource schedulingGhazal Tashakor

Hadoop TechnologyAtul Kushwaha

2011.10.14 Apache Giraph - HortonworksAvery Ching

Big data and hadoopRoushan Sinha

HadoopYojana Nanaware

Hadoop fault-toleranceRavindra Bandara

TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen

Geek campjdhok

What's hot (20)

Hadoop_EcoSystem_Pradeep_MG

Streaming API, Spark and Ruby

Introduction to hadoop

Future of HCatalog

Apache Hadoop Big Data Technology

Big data hadoop training in pune course content advanto software

2012 apache hadoop_map_reduce_windows_azure

MapReduce Paradigm

Hadoop: The elephant in the room

Hadoop - Introduction to HDFS

Apache Giraph

Hadoop course curriculm

Resource scheduling

Hadoop Technology

2011.10.14 Apache Giraph - Hortonworks

Big data and hadoop

Hadoop

Hadoop fault-tolerance

TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)

Geek camp

Viewers also liked

Resume of Vimal 4.1Vimal Suthar

Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkJongwook Woo

Basic Sentiment Analysis using HiveQubole

Traffic data analysis using HADOOPKirthan S Holla

Hadoop - Stock AnalysisVaibhav Jain

TRAFFIC DATA ANALYSIS USING HADOOPKirthan S Holla

Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI

Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman

Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Hortonworks

HW09 Social network analysis with HadoopCloudera, Inc.

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman

Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski

Hadoop Interview Questions and AnswersBig Data Interview Questions

Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi

Twitter sentiment analysis project reportBharat Khanna

Practical sentiment analysisDiana Maynard

Video Analysis in HadoopDataWorks Summit

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Introduction to Sentiment AnalysisJaganadh Gopinadhan

Large-scale social media analysis with Hadoopjakehofman

Viewers also liked (20)

Resume of Vimal 4.1

Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark

Basic Sentiment Analysis using Hive

Traffic data analysis using HADOOP

Hadoop - Stock Analysis

TRAFFIC DATA ANALYSIS USING HADOOP

Log analysis with Hadoop in livedoor 2013

Distributed Data Analysis with Hadoop and R - Strangeloop 2011

Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...

HW09 Social network analysis with Hadoop

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Hadoop Interview Questions and Answers

Hadoop 31-frequently-asked-interview-questions

Twitter sentiment analysis project report

Practical sentiment analysis

Video Analysis in Hadoop

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop

Introduction to Sentiment Analysis

Large-scale social media analysis with Hadoop

Similar to Hadoop data analysis

HadoopAnil Reddy

Hadoop Map-Reduce from the subject: Big Data AnalyticsRUHULAMINHAZARIKA

002 Introduction to hadoop v3Dendej Sawarnkatat

Hadoop ecosystemMohamed Ali Mahmoud khouder

Introduction to HadoopYork University

2. hadoop fundamentalsLokesh Ramaswamy

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

HadoopGirish Khanzode

Big Data Technologies - HadoopTalentica Software

Big dataMayuri Verma

Big dataAlisha Roy

Hadoop-Quick introductionSandeep Singh

Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies

Hadoop – Architecture.pptxSakthiVinoth78

Hadoop and Mapreduce for .NET User GroupCsaba Toth

Hadoop institutes-in-bangaloreKelly Technologies

Hadoop fault tolerancePallav Jha

Large scale computing with mapreducehansen3032

Asbury Hadoop OverviewBrian Enochson

Big Data and Hadoop with MapReduce ParadigmsArundhati Kanungo

Similar to Hadoop data analysis (20)

Hadoop

Hadoop Map-Reduce from the subject: Big Data Analytics

002 Introduction to hadoop v3

Hadoop ecosystem

Introduction to Hadoop

2. hadoop fundamentals

Hadoop trainting in hyderabad@kelly technologies

Hadoop

Big Data Technologies - Hadoop

Big data

Hadoop-Quick introduction

Hadoop trainting-in-hyderabad@kelly technologies

Hadoop – Architecture.pptx

Hadoop and Mapreduce for .NET User Group

Hadoop institutes-in-bangalore

Hadoop fault tolerance

Large scale computing with mapreduce

Asbury Hadoop Overview

Big Data and Hadoop with MapReduce Paradigms

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Sample pptx for embedding into website for demoHarshalMandlekar2

What is Artificial Intelligence?????????blackmambaettijean

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

How to write a Business Continuity PlanDatabarracks

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Sample pptx for embedding into website for demo

What is Artificial Intelligence?????????

Scale your database traffic with Read & Write split using MySQL Router

Dev Dives: Streamline document processing with UiPath Studio Web

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Unraveling Multimodality with Large Language Models.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

"Debugging python applications inside k8s environment", Andrii Soldatenko

DSPy a system for AI to Write Prompts and Do Fine Tuning

SIP trunking in Janus @ Kamailio World 2024

How to write a Business Continuity Plan

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Anypoint Exchange: It’s Not Just a Repo!

Time Series Foundation Models - current state and future directions

Ensuring Technical Readiness For Copilot in Microsoft 365

Hadoop data analysis

1. 1 BIG DATA ANALYSIS

2. Problem Statement.. 2

3. 3 Solution • Open Source Apache Project • Runs on • Linux, Mac OS/X, Windows, and Solaris • Commodity hardware

4. Who uses hadoop ? 4 Hadoop optimized content

5. What is Hadoop ? • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is based on a simple programming model called MapReduce 5

6. Hadoopclusters One of the largest clusters in the world runs on around 2000 nodes at yahoo does hundred's of jobs every month 6

7. Hadoop Architecture 7 Master node (single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce)

8. Hadoop: How it Works?? 8

9. Hadoop Distributed File System (HDFS) 9 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)

10. Main Properties of HDFS • Large • Replication • Failure • Fault Tolerance 10

11. Hadoop MapReduce • Handles large-scale web search applications • Popular programming module for data centres • Effective programming model for data mining, machine learning and search applications in data centers. • Helps to abstract from the issues of parallelization, scheduling, input partitioning, failover, replication. 11

12. MapReduce Phases 12

13. Properties of MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Number of mappers • Concept of locality 13 • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3

14. Properties of MapReduce Engine (Cont’d) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress 14 Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

15. Key-Value Pairs • Mappers and Reducers are users’ code (provided functions) • Just need to obey the Key-Value pairs interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Reducers: • Consume <key, <list of values>> • Produce <key, value> • Shuffling and Sorting: • Hidden phase between mappers and reducers • Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> 15

16. Example 1: Word Count • Job: Count the occurrences of each word in a data set 16 Map Tasks Reduce Tasks

17. 17

18. THANK YOU..!!! 18

Hadoop data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop data analysis

Similar to Hadoop data analysis (20)

Recently uploaded

Recently uploaded (20)

Hadoop data analysis