SlideShare a Scribd company logo
1 of 27
Contents
 What is Big Data?
 Limitations to the existing solutions
 How Hadoop solves the problem
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop main Components
 MapReduce execution
 File Read and Write
 Sentiment Analysis
Big Data
 Extremely large datasets ( Data is in TBs and PBs ),
 Facebook has the world’s largest Hadoop Cluster with 400 TB(2011)
data(currently 22 PB of data) and generates 20TB of data/day,
 NYSE generates 1TB data/day,
 The internet archive store around 2PB of data and is growing at a very fast
rate,
 The WayBack Machine is an example of Internet archive store, it is digital
archive of the WWW and other information on the internet, their intent is to
capture and archive content that would be lost whenever a site is changed or
closed down,
Unstructured Data ( 80:20 )
web.archive.org
 http://web.archive.org/web/20140626111157/http://thapar.edu/index.asp
Limitations to the existing solutions
 Slow to process
 Seek Time of general storages:
 IDE drive – 75 MB/s, 10ms
 SATA drive – 300 MB/s, 8.5ms
 SSD – 800 MB/s, 2ms
 Scaling is expensive
 Unreliable machines : risk of data loss
Infrastructure Providers
Hadoop solves the problem
Introduction to Hadoop
 Apache Hadoop is a set of algorithms (an open-source software
framework written in Java) for distributed storage and distributed
processing of very large data sets (Big Data) on computer.
 All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common
and thus should be automatically handled in software by the framework.
 In December 2004, Google Labs published a paper on
the MapReduce algorithm, which allows very large scale computations to be
trivially parallelized across large clusters of servers.
 Doug Cutting, an employee at Yahoo, realized the importance of this paper
and extended the reality of it to handle extremely large search problems.
 In 2005, he created the open-source Hadoop framework that allows
applications based on the MapReduce paradigm to be run on large clusters of
commodity hardware.
Hadoop main components
 Two main components:
HDFS – Hadoop Distributed File System (Storage):
 Distributed across nodes (Datanodes),
 NameNode tracks locations,
MapReduce (Processing):
 Splits task across processors,
 Self healing, high bandwidth,
 Clustered storage,
 Jobtracker manages the tasktrackers
Modes of working
Three modes:
 Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just
use local FS, helpful in debugging,
 Pseudomode(Single Node Cluster) : configure the files to run on a single
cluster, R = 1
 Distributed Mode : use Hadoop at full scale, consists of thousands of nodes,
use this mode when we work on large data
Replication and Block Size
 Default replication factor is 3 and block size is 64MB ( recommended 128MB )
 Can be updated by changing the configuration files
MapReduce Programming Model
Example of MapReduce
Hive
 Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
 Developed by Facebook.
 HiveQL – SQL like query language,
 Hive queries are converted into MR first ( at the backend ), therefore slower
than running MR program,
Twitter Sentiment Analysis
Java program to get tweets
Sample Data
Big Data – The road ahead us
 Huge repositories of structured and unstructured data across various digital
platforms and social media,
 Beyond traditional database methods to analyse,
 Big data promises growth and long term sustainability,
 Threats – data integrity, security breach

More Related Content

What's hot

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesMatteo Redaelli
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoopChad Richeson
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionkvaderlipa
 
Intro to Hadoop and MapReduce
Intro to Hadoop and MapReduceIntro to Hadoop and MapReduce
Intro to Hadoop and MapReduceJosi Aranda
 

What's hot (18)

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to bigdata
Introduction to bigdataIntroduction to bigdata
Introduction to bigdata
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databases
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big data
Big dataBig data
Big data
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
HADOOP
HADOOPHADOOP
HADOOP
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
Intro to Hadoop and MapReduce
Intro to Hadoop and MapReduceIntro to Hadoop and MapReduce
Intro to Hadoop and MapReduce
 
Big data
Big dataBig data
Big data
 

Similar to Big data and hadoop

Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 

Similar to Big data and hadoop (20)

hadoop
hadoophadoop
hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big data
Big dataBig data
Big data
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop
HadoopHadoop
Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Big data and hadoop

  • 1.
  • 2. Contents  What is Big Data?  Limitations to the existing solutions  How Hadoop solves the problem  Introduction to Hadoop  Hadoop Eco-System  Hadoop main Components  MapReduce execution  File Read and Write  Sentiment Analysis
  • 3.
  • 4. Big Data  Extremely large datasets ( Data is in TBs and PBs ),  Facebook has the world’s largest Hadoop Cluster with 400 TB(2011) data(currently 22 PB of data) and generates 20TB of data/day,  NYSE generates 1TB data/day,  The internet archive store around 2PB of data and is growing at a very fast rate,  The WayBack Machine is an example of Internet archive store, it is digital archive of the WWW and other information on the internet, their intent is to capture and archive content that would be lost whenever a site is changed or closed down,
  • 7. Limitations to the existing solutions  Slow to process  Seek Time of general storages:  IDE drive – 75 MB/s, 10ms  SATA drive – 300 MB/s, 8.5ms  SSD – 800 MB/s, 2ms  Scaling is expensive  Unreliable machines : risk of data loss
  • 10. Introduction to Hadoop  Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer.  All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.
  • 11.  In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale computations to be trivially parallelized across large clusters of servers.  Doug Cutting, an employee at Yahoo, realized the importance of this paper and extended the reality of it to handle extremely large search problems.  In 2005, he created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware.
  • 12.
  • 13. Hadoop main components  Two main components: HDFS – Hadoop Distributed File System (Storage):  Distributed across nodes (Datanodes),  NameNode tracks locations, MapReduce (Processing):  Splits task across processors,  Self healing, high bandwidth,  Clustered storage,  Jobtracker manages the tasktrackers
  • 14.
  • 15. Modes of working Three modes:  Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just use local FS, helpful in debugging,  Pseudomode(Single Node Cluster) : configure the files to run on a single cluster, R = 1  Distributed Mode : use Hadoop at full scale, consists of thousands of nodes, use this mode when we work on large data
  • 16. Replication and Block Size  Default replication factor is 3 and block size is 64MB ( recommended 128MB )  Can be updated by changing the configuration files
  • 17.
  • 18.
  • 21. Hive  Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.  Developed by Facebook.  HiveQL – SQL like query language,  Hive queries are converted into MR first ( at the backend ), therefore slower than running MR program,
  • 23. Java program to get tweets
  • 24.
  • 25.
  • 27. Big Data – The road ahead us  Huge repositories of structured and unstructured data across various digital platforms and social media,  Beyond traditional database methods to analyse,  Big data promises growth and long term sustainability,  Threats – data integrity, security breach

Editor's Notes

  1. 1) they have been archiving cached pages of web sites onto their large cluster of Linux nodes. They revisit sites every few weeks or months and archive a new version if the content has changed,
  2. 1) Seek time - time a program or device takes to locate a particular piece of data
  3. Hadoop design principle was: The system should manage and heal itself in case of failures, Automatically and transparently route around failure, Proportional in capacity with resource change, Lower latency, Simple core Store and process large amounts of data,
  4. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Zoo Keeper is a centralized service for maintaining the services. Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Apache HBase is used when we need random, realtime read/write access to Big Data.
  5. Namenodes is the master, it is metastore in HDFS i.e. it keeps tracks of all the files, blocks, datanodes for each blocks Also it contains transaction log files like file creation, deletion etc. There is also a standby node for namenode which is known as the SNN ( secondary namenode ), what it does is it connects to the namenode after regular interval of time and gets the edit logs and fsimage. Edit logs contains the details of addition, deletion etc of a file, FSimage contains the in-node details like modification time, access time, access permission etc. Now if the namenode fails then the SNN already contains the edit logs and fsimage. So when the cluster is restarted is restarted the fsimage of Namenode is updated automatically so there will be no overhead of copying editlogs at the moment of restart. Thus saving time.
  6. This is Hadoop cluster. Each cluster contains racks, each rack contains blocks, each block contains datanodes where files are stored(after splitting). Each rack contains master nodes i.e. jobtrackers and namenodes.
  7. R=1 because only one JT and NN is used.
  8. Files 1 and 3 have r = 2 Files 2,4,5 have r = 3.
  9. Executed in two phases – mapping and reducing, Each phase has two functions called mapper and reducer, Map phase takes input from user and feeds into mapper class, Reduce phase process output generated from mapper class, Simply mapping is to filter and reducing is to aggregate,