SlideShare a Scribd company logo
Hadoop Jon 
By HumoyunJon Lee
90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST 
THREE YEARS ALONE, AND IT IS GROWING 
AT EVEN A MORE RAPID RATE. 
BIG DATA 
The world has been exponential data growth, due to social media, 
mobility, E-commerce and other factors. 
• Volume 
• Variety 
• Velocity
“Big Data is like teenage sex; 
everyone talks about it, 
nobody really knows how to do it, 
everyone thinks everyone else is doing it, 
so everyone claims they are doing it” 
Dan Ariely, Duke University
Big Data Ecosystem
To Address This Issue 
We need HadoopJon
A Shared Nothing Network or 
What is that Hadoop
The Apache Hadoop software library is a framework that allows for the 
distributed processing of large data sets across clusters of computers using 
simple programming models. It is designed to scale up from single servers to 
thousands of machines, each offering local computation and storage. Rather 
than rely on hardware to deliver high-availability, the library itself is designed 
to detect and handle failures at the application layer, so delivering a highly-available 
service on top of a cluster of computers, each of which may be prone 
to failures.
Prerequisites : 
• Installing Java v1.5+ 
• Adding dedicated Hadoop system user. 
• Configuring SSH access. 
• Disabling IPv6. 
Installing HadoopJon
Configuring Hadoop : 
a. hadoop-env.sh 
b. core-site.xml 
c. mapred-site.xml 
d. hdfs-site.xml
Hadoop comes with several web interfaces which are by 
default available at these locations: 
• http://localhost:50070/ – web UI of the NameNode daemon 
• http://localhost:50030/ – web UI of the JobTracker daemon 
• http://localhost:50060/ – web UI of the TaskTracker daemon 
Hadoop Web Interfaces
Reliable 
Hadoop 
Features 
Flexible Economical 
Scalable 
Hadoop Key Characteristics:
• Scalable – New nodes can be added as needed, and added without 
needing to change data formats, how data is loaded, how jobs are 
written, or the applications on top. 
• Economical – Hadoop brings massively parallel computing to 
commodity servers. The result is a sizeable decrease in the cost per 
terabyte of storage, which in turn makes it affordable to model all 
your data.
• Flexible – Hadoop is schema-less, and can absorb any type of data, 
structured or not, from any number of sources. Data from multiple 
sources can be joined and aggregated in arbitrary ways enabling 
deeper analyses than any one system can provide. 
• Reliable – When you lose a node, the system redirects work to 
another location of the data and continues processing without missing 
a beat
Hadoop Ecosystem
HDFS Architecture
• HDFS is designed to store a very large amount of information 
(terabytes or petabytes). This requires spreading the data across a 
large number of machines. 
• HDFS stores data reliably. If individual machines in the cluster fail, 
data is still being available with data redundancy. 
Hadoop Distributed File 
System (HDFS):
• HDFS provides fast, scalable access to the information loaded on the 
clusters. It is possible to serve a larger number of clients by simply 
adding more machines to the cluster. 
• HDFS integrate well with Hadoop MapReduce, allowing data to be 
read and computed upon locally whenever needed. 
• HDFS was originally built as infrastructure for the Apache Nutch 
web search engine project
Hadoop does not require expensive, highly reliable hardware. It is 
designed to run on clusters of commodity hardware, an HDFS instance 
may consist of hundreds or thousands of server machines, each storing 
part of the file system’s data. The fact that there are a huge number of 
components and that each component has a non-trivial probability of 
failure means that some component of HDFS is always non-functional. 
Therefore, detection of faults and quick, automatic recovery from them 
is a core architectural goal of HDFS. 
Commodity Hardware Failure:
Applications that run on HDFS need continuous access to their data 
sets. HDFS is designed more for batch processing rather than interactive 
use by users. The emphasis is on high throughput of data access rather 
than low latency of data access. 
Continuous Data Access:
Applications that run on HDFS have large data sets. A typical file in 
HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support 
large files. 
It is also worth examining the 
applications for which using HDFS 
does not work so well. While this 
may change in the future, these are 
areas where HDFS is not a good fit 
today: 
Very Large Data Files:
• Low-latency data access 
• Lots of small files 
• Multiple writers, arbitrary file modifications
• Pig is an open-source high-level dataflow 
system. 
• It provides a simple language for queries and 
data manipulation Pig Latin, that is compiled 
into MapReduce jobs that are run on Hadoop. 
• Why is it important? 
- Companies like Yahoo, Google and Microsoft 
are collecting vast sets in the form of click 
steams, search logs, and web crawls. 
- Some form of ad-hoc processing and analysis 
of all of this information is required. 
What is Pig
• An ad-hoc way of creating and executing MapReduce jobs on very 
large data sets 
• Rapid Development 
• No Java is required 
• Developed byYahoo! 
Why was Pig created?
• Pig is a data flow language. It is at the top of Hadoop and makes it 
possible to create complex jobs to process large volumes of data 
quicly and efficiently. 
• It will consume any data that you feed it: Structured, semi-structured, 
or unstructured. 
• Pig provides the common data operations (filters, joins, ordering) and 
nested data types (tuple, bags, and maps) which are missing in 
MapReduce. 
• PIG scripts are easier and faster to write than standard Java Hadoop 
jobs and PIG has lot of clever optimizations like multi query 
execution, which can make your complex queries execute quiker. 
Where I should Use PIG
• Hive is a data warehouse infrastructure built 
on top of Hadoop. 
• It facilitates querying large datasets residing 
on a distributed storage. 
• It provides a mechanism to project structure 
on to the data and query the data using a 
SQL-like query language called “HiveQL”. 
What is Hive
• Hive was developed by Facebook and was open sourced in 2008 . 
• Data stored in Hadoop is inaccessible to business users. 
• High level languages like Pig, Cascading etc are geared towards 
developers. 
• SQL is a common language that is known to many. Hive was 
developed to give access to data stored in HadoopJon, translating 
SQL like queries into map reduce jobs. 
Why hive was developed
Hammaga rahmat 
Nahorgi Presentatsiya 
over

More Related Content

What's hot

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 

What's hot (20)

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop
HadoopHadoop
Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 

Similar to Hadoop jon

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Abdul Nasir
 

Similar to Hadoop jon (20)

Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
HDFS
HDFSHDFS
HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 

Recently uploaded

Grow Your Instagram Profile Organically A Guide to Real Engagement.pdf
Grow Your Instagram Profile Organically A Guide to Real Engagement.pdfGrow Your Instagram Profile Organically A Guide to Real Engagement.pdf
Grow Your Instagram Profile Organically A Guide to Real Engagement.pdf
SocioCosmos
 
“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...
AJHSSR Journal
 
How to blow up on social media simple di
How to blow up on social media simple diHow to blow up on social media simple di
How to blow up on social media simple di
RachaelOnuche
 

Recently uploaded (16)

Non-Financial Information and Firm Risk Non-Financial Information and Firm Risk
Non-Financial Information and Firm Risk Non-Financial Information and Firm RiskNon-Financial Information and Firm Risk Non-Financial Information and Firm Risk
Non-Financial Information and Firm Risk Non-Financial Information and Firm Risk
 
Children's Data Privacy_April-22_2024.pdf
Children's Data Privacy_April-22_2024.pdfChildren's Data Privacy_April-22_2024.pdf
Children's Data Privacy_April-22_2024.pdf
 
Social Media kdjhadhnjbdsjbdff fjkjasfkl
Social Media kdjhadhnjbdsjbdff fjkjasfklSocial Media kdjhadhnjbdsjbdff fjkjasfkl
Social Media kdjhadhnjbdsjbdff fjkjasfkl
 
Get Ahead with YouTube Growth Services....
Get Ahead with YouTube Growth Services....Get Ahead with YouTube Growth Services....
Get Ahead with YouTube Growth Services....
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO.pptx
 
Looking to Drive Traffic from Pinterest?
Looking to Drive Traffic from Pinterest?Looking to Drive Traffic from Pinterest?
Looking to Drive Traffic from Pinterest?
 
7 Tips on Social Media Marketing strategy
7 Tips on Social Media Marketing strategy7 Tips on Social Media Marketing strategy
7 Tips on Social Media Marketing strategy
 
Grow Your Instagram Profile Organically A Guide to Real Engagement.pdf
Grow Your Instagram Profile Organically A Guide to Real Engagement.pdfGrow Your Instagram Profile Organically A Guide to Real Engagement.pdf
Grow Your Instagram Profile Organically A Guide to Real Engagement.pdf
 
“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...“To be integrated is to feel secure, to feel connected.” The views and experi...
“To be integrated is to feel secure, to feel connected.” The views and experi...
 
Multilingual SEO Services | Multilingual Keyword Research | Filose
Multilingual SEO Services |  Multilingual Keyword Research | FiloseMultilingual SEO Services |  Multilingual Keyword Research | Filose
Multilingual SEO Services | Multilingual Keyword Research | Filose
 
Call Girls Dehradun | ₹,9500 Pay Cash 9719300533 Free Home Delivery Escorts S...
Call Girls Dehradun | ₹,9500 Pay Cash 9719300533 Free Home Delivery Escorts S...Call Girls Dehradun | ₹,9500 Pay Cash 9719300533 Free Home Delivery Escorts S...
Call Girls Dehradun | ₹,9500 Pay Cash 9719300533 Free Home Delivery Escorts S...
 
Want to Amplify Your Pinterest Content?...
Want to Amplify Your Pinterest Content?...Want to Amplify Your Pinterest Content?...
Want to Amplify Your Pinterest Content?...
 
Experience genuine and sustainable growth on TikTok.
Experience genuine and sustainable growth on TikTok.Experience genuine and sustainable growth on TikTok.
Experience genuine and sustainable growth on TikTok.
 
How social media marketing helps businesses in 2024.pdf
How social media marketing helps businesses in 2024.pdfHow social media marketing helps businesses in 2024.pdf
How social media marketing helps businesses in 2024.pdf
 
How to blow up on social media simple di
How to blow up on social media simple diHow to blow up on social media simple di
How to blow up on social media simple di
 
Top 10 Best Motivational Movies Of Bollywood
Top 10 Best Motivational Movies Of BollywoodTop 10 Best Motivational Movies Of Bollywood
Top 10 Best Motivational Movies Of Bollywood
 

Hadoop jon

  • 1. Hadoop Jon By HumoyunJon Lee
  • 2. 90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST THREE YEARS ALONE, AND IT IS GROWING AT EVEN A MORE RAPID RATE. BIG DATA The world has been exponential data growth, due to social media, mobility, E-commerce and other factors. • Volume • Variety • Velocity
  • 3. “Big Data is like teenage sex; everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it” Dan Ariely, Duke University
  • 5. To Address This Issue We need HadoopJon
  • 6. A Shared Nothing Network or What is that Hadoop
  • 7. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 8.
  • 9.
  • 10. Prerequisites : • Installing Java v1.5+ • Adding dedicated Hadoop system user. • Configuring SSH access. • Disabling IPv6. Installing HadoopJon
  • 11. Configuring Hadoop : a. hadoop-env.sh b. core-site.xml c. mapred-site.xml d. hdfs-site.xml
  • 12. Hadoop comes with several web interfaces which are by default available at these locations: • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon Hadoop Web Interfaces
  • 13. Reliable Hadoop Features Flexible Economical Scalable Hadoop Key Characteristics:
  • 14. • Scalable – New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. • Economical – Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • 15. • Flexible – Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. • Reliable – When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat
  • 18. • HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. • HDFS stores data reliably. If individual machines in the cluster fail, data is still being available with data redundancy. Hadoop Distributed File System (HDFS):
  • 19. • HDFS provides fast, scalable access to the information loaded on the clusters. It is possible to serve a larger number of clients by simply adding more machines to the cluster. • HDFS integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally whenever needed. • HDFS was originally built as infrastructure for the Apache Nutch web search engine project
  • 20. Hadoop does not require expensive, highly reliable hardware. It is designed to run on clusters of commodity hardware, an HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Commodity Hardware Failure:
  • 21. Applications that run on HDFS need continuous access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. Continuous Data Access:
  • 22. Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support large files. It is also worth examining the applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today: Very Large Data Files:
  • 23. • Low-latency data access • Lots of small files • Multiple writers, arbitrary file modifications
  • 24. • Pig is an open-source high-level dataflow system. • It provides a simple language for queries and data manipulation Pig Latin, that is compiled into MapReduce jobs that are run on Hadoop. • Why is it important? - Companies like Yahoo, Google and Microsoft are collecting vast sets in the form of click steams, search logs, and web crawls. - Some form of ad-hoc processing and analysis of all of this information is required. What is Pig
  • 25. • An ad-hoc way of creating and executing MapReduce jobs on very large data sets • Rapid Development • No Java is required • Developed byYahoo! Why was Pig created?
  • 26.
  • 27. • Pig is a data flow language. It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quicly and efficiently. • It will consume any data that you feed it: Structured, semi-structured, or unstructured. • Pig provides the common data operations (filters, joins, ordering) and nested data types (tuple, bags, and maps) which are missing in MapReduce. • PIG scripts are easier and faster to write than standard Java Hadoop jobs and PIG has lot of clever optimizations like multi query execution, which can make your complex queries execute quiker. Where I should Use PIG
  • 28. • Hive is a data warehouse infrastructure built on top of Hadoop. • It facilitates querying large datasets residing on a distributed storage. • It provides a mechanism to project structure on to the data and query the data using a SQL-like query language called “HiveQL”. What is Hive
  • 29. • Hive was developed by Facebook and was open sourced in 2008 . • Data stored in Hadoop is inaccessible to business users. • High level languages like Pig, Cascading etc are geared towards developers. • SQL is a common language that is known to many. Hive was developed to give access to data stored in HadoopJon, translating SQL like queries into map reduce jobs. Why hive was developed
  • 30. Hammaga rahmat Nahorgi Presentatsiya over