Big data and hadoop

•Download as PPTX, PDF•

1 like•198 views

This document discusses big data and how Hadoop solves issues with processing and storing extremely large datasets. It introduces Hadoop, describing its main components HDFS for distributed storage and MapReduce for distributed processing. Hadoop allows applications to run on large clusters of commodity hardware to handle failures and scale easily. The document provides examples of how MapReduce and Hive are used and describes a Twitter sentiment analysis application.

Data & Analytics

Contents
 What is Big Data?
 Limitations to the existing solutions
 How Hadoop solves the problem
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop main Components
 MapReduce execution
 File Read and Write
 Sentiment Analysis

Big Data
 Extremely large datasets ( Data is in TBs and PBs ),
 Facebook has the world’s largest Hadoop Cluster with 400 TB(2011)
data(currently 22 PB of data) and generates 20TB of data/day,
 NYSE generates 1TB data/day,
 The internet archive store around 2PB of data and is growing at a very fast
rate,
 The WayBack Machine is an example of Internet archive store, it is digital
archive of the WWW and other information on the internet, their intent is to
capture and archive content that would be lost whenever a site is changed or
closed down,

web.archive.org
 http://web.archive.org/web/20140626111157/http://thapar.edu/index.asp

Limitations to the existing solutions
 Slow to process
 Seek Time of general storages:
 IDE drive – 75 MB/s, 10ms
 SATA drive – 300 MB/s, 8.5ms
 SSD – 800 MB/s, 2ms
 Scaling is expensive
 Unreliable machines : risk of data loss

Introduction to Hadoop
 Apache Hadoop is a set of algorithms (an open-source software
framework written in Java) for distributed storage and distributed
processing of very large data sets (Big Data) on computer.
 All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common
and thus should be automatically handled in software by the framework.

 In December 2004, Google Labs published a paper on
the MapReduce algorithm, which allows very large scale computations to be
trivially parallelized across large clusters of servers.
 Doug Cutting, an employee at Yahoo, realized the importance of this paper
and extended the reality of it to handle extremely large search problems.
 In 2005, he created the open-source Hadoop framework that allows
applications based on the MapReduce paradigm to be run on large clusters of
commodity hardware.

Hadoop main components
 Two main components:
HDFS – Hadoop Distributed File System (Storage):
 Distributed across nodes (Datanodes),
 NameNode tracks locations,
MapReduce (Processing):
 Splits task across processors,
 Self healing, high bandwidth,
 Clustered storage,
 Jobtracker manages the tasktrackers

Modes of working
Three modes:
 Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just
use local FS, helpful in debugging,
 Pseudomode(Single Node Cluster) : configure the files to run on a single
cluster, R = 1
 Distributed Mode : use Hadoop at full scale, consists of thousands of nodes,
use this mode when we work on large data

Replication and Block Size
 Default replication factor is 3 and block size is 64MB ( recommended 128MB )
 Can be updated by changing the configuration files

Hive
 Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
 Developed by Facebook.
 HiveQL – SQL like query language,
 Hive queries are converted into MR first ( at the backend ), therefore slower
than running MR program,

Big Data – The road ahead us
 Huge repositories of structured and unstructured data across various digital
platforms and social media,
 Beyond traditional database methods to analyse,
 Big data promises growth and long term sustainability,
 Threats – data integrity, security breach

What's hot

EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.

Introduction to Hadoop and Hadoop component rebeccatho

Hadoop introductionshubham kuwar

Hadoop basicsLaxmi Rauth

Introduction to bigdataJayanthi Janani

Collecting and analyzing sensor data with hadoop or other no sql databasesMatteo Redaelli

Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu

Big dataAlisha Roy

Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.

Introduction to hadoopChad Richeson

Hive and data analysis using pandasPurna Chander K

HADOOPHarinder Kaur

An introduction to Big-Data processing applying hadoopAmir Sedighi

The Family of HadoopNam Nham

Terabyte-scale image similarity search: experience and best practiceDenis Shestakov

Big data, Hadoop, NoSQL DB - introductionkvaderlipa

Intro to Hadoop and MapReduceJosi Aranda

Big datarevathireddyb

What's hot (18)

EclipseCon Keynote: Apache Hadoop - An Introduction

Introduction to Hadoop and Hadoop component

Hadoop introduction

Hadoop basics

Introduction to bigdata

Collecting and analyzing sensor data with hadoop or other no sql databases

Design of Hadoop Distributed File System

Big data

Sf NoSQL MeetUp: Apache Hadoop and HBase

Introduction to hadoop

Hive and data analysis using pandas

HADOOP

An introduction to Big-Data processing applying hadoop

The Family of Hadoop

Terabyte-scale image similarity search: experience and best practice

Big data, Hadoop, NoSQL DB - introduction

Intro to Hadoop and MapReduce

Big data

Similar to Big data and hadoop

hadoopswatic018

Apache Hadoop Big Data TechnologyJay Nagar

Seminar pptRajatTripathi34

Hadoop TechnologyAtul Kushwaha

Big dataAbilash Mavila

How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah

Hadoop introduction , Why and What is Hadoop ?sudhakara st

OPERATING SYSTEM .pptxAltafKhadim

Hadoop and BigData - July 2016Ranjith Sekar

Big Data and HadoopFlavio Vit

Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training

Big data and hadoop overvewKunal Khanna

Hadoop: Distributed Data ProcessingCloudera, Inc.

Hadoopyasser hassen

Hadoop .pdfSudhanshiBakre1

Big data and hadoop anupamaAnupama Prabhudesai

BIGDATA MODULE 3.pdfDIVYA370851

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

HadoopAhmad Kabeer

THE SOLUTION FOR BIG DATATarak Tar

Similar to Big data and hadoop (20)

hadoop

Apache Hadoop Big Data Technology

Seminar ppt

Hadoop Technology

Big data

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Hadoop introduction , Why and What is Hadoop ?

OPERATING SYSTEM .pptx

Hadoop and BigData - July 2016

Big Data and Hadoop

Top Hadoop Big Data Interview Questions and Answers for Fresher

Big data and hadoop overvew

Hadoop: Distributed Data Processing

Hadoop

Hadoop .pdf

Big data and hadoop anupama

BIGDATA MODULE 3.pdf

Hadoop ecosystem framework n hadoop in live environment

Hadoop

THE SOLUTION FOR BIG DATA

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

RadioAdProWritingCinderellabyButleri.pdfgstagge

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

How we prevented account sharing with MFAAndrei Kaleshka

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Customer Service Analytics - Make Sense of All Your Data.pptx

Data Science Jobs and Salaries Analysis.pptx

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

RadioAdProWritingCinderellabyButleri.pdf

Call Girls In Mahipalpur O9654467111 Escorts Service

How we prevented account sharing with MFA

9654467111 Call Girls In Munirka Hotel And Home Service

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

04242024_CCC TUG_Joins and Relationships

Call Girls in Saket 99530🔝 56974 Escort Service

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Big data and hadoop

2. Contents  What is Big Data?  Limitations to the existing solutions  How Hadoop solves the problem  Introduction to Hadoop  Hadoop Eco-System  Hadoop main Components  MapReduce execution  File Read and Write  Sentiment Analysis

4. Big Data  Extremely large datasets ( Data is in TBs and PBs ),  Facebook has the world’s largest Hadoop Cluster with 400 TB(2011) data(currently 22 PB of data) and generates 20TB of data/day,  NYSE generates 1TB data/day,  The internet archive store around 2PB of data and is growing at a very fast rate,  The WayBack Machine is an example of Internet archive store, it is digital archive of the WWW and other information on the internet, their intent is to capture and archive content that would be lost whenever a site is changed or closed down,

5. Unstructured Data ( 80:20 )

6. web.archive.org  http://web.archive.org/web/20140626111157/http://thapar.edu/index.asp

7. Limitations to the existing solutions  Slow to process  Seek Time of general storages:  IDE drive – 75 MB/s, 10ms  SATA drive – 300 MB/s, 8.5ms  SSD – 800 MB/s, 2ms  Scaling is expensive  Unreliable machines : risk of data loss

8. Infrastructure Providers

9. Hadoop solves the problem

10. Introduction to Hadoop  Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer.  All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

11.  In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale computations to be trivially parallelized across large clusters of servers.  Doug Cutting, an employee at Yahoo, realized the importance of this paper and extended the reality of it to handle extremely large search problems.  In 2005, he created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware.

12.

13. Hadoop main components  Two main components: HDFS – Hadoop Distributed File System (Storage):  Distributed across nodes (Datanodes),  NameNode tracks locations, MapReduce (Processing):  Splits task across processors,  Self healing, high bandwidth,  Clustered storage,  Jobtracker manages the tasktrackers

14.

15. Modes of working Three modes:  Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just use local FS, helpful in debugging,  Pseudomode(Single Node Cluster) : configure the files to run on a single cluster, R = 1  Distributed Mode : use Hadoop at full scale, consists of thousands of nodes, use this mode when we work on large data

16. Replication and Block Size  Default replication factor is 3 and block size is 64MB ( recommended 128MB )  Can be updated by changing the configuration files

17.

18.

19. MapReduce Programming Model

20. Example of MapReduce

21. Hive  Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.  Developed by Facebook.  HiveQL – SQL like query language,  Hive queries are converted into MR first ( at the backend ), therefore slower than running MR program,

22. Twitter Sentiment Analysis

23. Java program to get tweets

24.

25.

26. Sample Data

27. Big Data – The road ahead us  Huge repositories of structured and unstructured data across various digital platforms and social media,  Beyond traditional database methods to analyse,  Big data promises growth and long term sustainability,  Threats – data integrity, security breach

Editor's Notes

1) they have been archiving cached pages of web sites onto their large cluster of Linux nodes. They revisit sites every few weeks or months and archive a new version if the content has changed,
1) Seek time - time a program or device takes to locate a particular piece of data
Hadoop design principle was: The system should manage and heal itself in case of failures, Automatically and transparently route around failure, Proportional in capacity with resource change, Lower latency, Simple core Store and process large amounts of data,
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Zoo Keeper is a centralized service for maintaining the services. Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Apache HBase is used when we need random, realtime read/write access to Big Data.
Namenodes is the master, it is metastore in HDFS i.e. it keeps tracks of all the files, blocks, datanodes for each blocks Also it contains transaction log files like file creation, deletion etc. There is also a standby node for namenode which is known as the SNN ( secondary namenode ), what it does is it connects to the namenode after regular interval of time and gets the edit logs and fsimage. Edit logs contains the details of addition, deletion etc of a file, FSimage contains the in-node details like modification time, access time, access permission etc. Now if the namenode fails then the SNN already contains the edit logs and fsimage. So when the cluster is restarted is restarted the fsimage of Namenode is updated automatically so there will be no overhead of copying editlogs at the moment of restart. Thus saving time.
This is Hadoop cluster. Each cluster contains racks, each rack contains blocks, each block contains datanodes where files are stored(after splitting). Each rack contains master nodes i.e. jobtrackers and namenodes.
R=1 because only one JT and NN is used.
Files 1 and 3 have r = 2 Files 2,4,5 have r = 3.
Executed in two phases – mapping and reducing, Each phase has two functions called mapper and reducer, Map phase takes input from user and feeds into mapper class, Reduce phase process output generated from mapper class, Simply mapping is to filter and reducing is to aggregate,

Big data and hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Big data and hadoop

Similar to Big data and hadoop (20)

Recently uploaded

Recently uploaded (20)

Big data and hadoop

Editor's Notes