SlideShare a Scribd company logo
1 of 19
Big Data on Hadoop
Dina Abu Khader
Tech BuzzWord of 2012
What is Big Data
Data with the 4V’s Properties (Characteristics)
(Volume/Velocity/Variety/Value)
● Volume: Data is too big (Petabytes/Exabytes), exceeds the capacity of
RDBMS databases.
● Velocity: amount of data generated is large, 30TB/day.
● Variety: Structured and unstructured data. Access-logs, DB records,
NoSql documents, images.
● Value: What you are trying to solve, what type of information you want
to get: Live recommendation, analytics, processing large amount of data.
“For every two degrees the temperature goes up, check-ins at ice-cream
shops go up by 2%” - Andrew Hogue, Foursquare.
W Questions on Hadoop
● Why use Hadoop?
● When to use Hadoop?
● What is Hadoop?
● How to setup?
Why/When to use Hadoop
● When you don’t want your answers in real-time.
● Storage trends: Cost per gigabytes is high, datasets are
big.
● Time/Skills: High learning curve.
● Non-confidential data.
● When you are throwing away valuable data.
(Java developers with data science skills are in incredibly high demand)
What is Hadoop
Open source framework for storing and processing large sets of data in a
distributed environment.
Core of Hadoop:
● HDFS - Storage
● YARN - Cluster Resource Manager
● MapReduce - Processing Part
● EcoSystem - Applications
Hadoop Architecture
HDFS-Hadoop Distributed File System
Similar to existing distributed file systems, but it runs on
low-cost servers and is highly fault-tolerant.
Goals :
● To overcome hardware failures
● Large datasets, horizontally scalable
● Simple coherency model: Write once read multiple
access model for files
3 S of 4TB (Raid 0) = 12TB => Hadoop replica factor 3 => 12/3 = 4TB
HDFS - continued
Hadoop splits files into small blocks which are distributed
among nodes.
HDFS has NameNode (NN), DataNode (DN)
● NN : Master of system, track location {Filename ,
#Replica, BlockId}
● DN : Store files as blocks.
HDFS Features
● Rack awareness
● Minimal data motion
● Utilities
● Rollback
● StandBy-NN
● Highly operable
HDFS - continued
Hadoop shell commands FS (FileSystem) are very similar to Linux commands
● hadoop fs -ls
● hadoop fs -cat /user/dina.khader/readme
Options:
cat, chgrp, chmod, chown, copyFromLocal, copyToLocal, cp, du, dus, get, ls,
lsr, mkdir, movefromLocal, mv, put, rm, rmr, setrep, stat, tail, test, text,
touchz
YARN
Was introduced in hadoop 2.x
Main components of YARN:
1. ResourceManager
2. NodeManager
3. JobHistoryServer
MapReduce
A programming model for processing large
datasets with a parallel, distributed algorithm
on a cluster.
MapReduce Example
Eco-System
MapReduce gives data seekers a lot of power and flexibility but it also adds a
lot of complexity.
Therefore, there is a set of tools that make that easier like:
● Hive: SQL-like interface to access data stored on HDFS.
● Pig: Scripting platform to process data.
● Hbase: Column-oriented NoSql DB, well suited for sparse data.
Other Hadoop EcoSystem components:
● Zookeeper: Centralized service for maintaining configuration
information.
● Oozie: Workflow scheduler system to manage Hadoop jobs.
● Sqoop/Flume: Transferring data from RDBMS/other sources into Hadoop.
How to setup Hadoop
● Cloudera (Cloudera Manager)
● Hortonworks (Ambari)
● Manual (Painful :) )
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.2.5.17/ambari.repo
cp ambari.repo /etc/yum.repos.d
yum install ambari-server
ambari-server setup
Quiz
Assuming the following:
● We have configured 64 MB block size
● Replication factor 3
● Rack-awareness
● File size 224 MB
● 3 servers with 4TB RAID 0
Questions :
Copy the file to HDFS please explain
1. How many blocks will be generated?
2. What is the size of these blocks?
3. What will happen if one node went down?
= (224/64) * 3 = 12 blocks
9 blocks 64MB, 3 blocks 32 MB
Rebalance to nearest server in
Rack.
Thank You!
Questions?

More Related Content

What's hot

An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionkvaderlipa
 

What's hot (20)

Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Geek camp
Geek campGeek camp
Geek camp
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to JOSA TechTalks - Big Data on Hadoop

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 

Similar to JOSA TechTalks - Big Data on Hadoop (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Anju
AnjuAnju
Anju
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 

More from Jordan Open Source Association

JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJordan Open Source Association
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJordan Open Source Association
 
JOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best PracticesJOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best PracticesJordan Open Source Association
 

More from Jordan Open Source Association (20)

JOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented ArchitectureJOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented Architecture
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
OpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ ScaleOpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ Scale
 
Data-Driven Digital Transformation
Data-Driven Digital TransformationData-Driven Digital Transformation
Data-Driven Digital Transformation
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Processing Arabic Text
Processing Arabic TextProcessing Arabic Text
Processing Arabic Text
 
JOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your CostsJOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your Costs
 
JOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in ProductionJOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in Production
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec Explained
 
JOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and ReduxJOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and Redux
 
JOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best PracticesJOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best Practices
 
Web app architecture
Web app architectureWeb app architecture
Web app architecture
 
Intro to the Principles of Graphic Design
Intro to the Principles of Graphic DesignIntro to the Principles of Graphic Design
Intro to the Principles of Graphic Design
 
Intro to Graphic Design Elements
Intro to Graphic Design ElementsIntro to Graphic Design Elements
Intro to Graphic Design Elements
 
JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
JOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised LearningJOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised Learning
 
JOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to ProductionJOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to Production
 
JOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to dockerJOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to docker
 
D programming language
D programming languageD programming language
D programming language
 

Recently uploaded

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 

Recently uploaded (20)

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 

JOSA TechTalks - Big Data on Hadoop

  • 1. Big Data on Hadoop Dina Abu Khader
  • 3. What is Big Data Data with the 4V’s Properties (Characteristics) (Volume/Velocity/Variety/Value) ● Volume: Data is too big (Petabytes/Exabytes), exceeds the capacity of RDBMS databases. ● Velocity: amount of data generated is large, 30TB/day. ● Variety: Structured and unstructured data. Access-logs, DB records, NoSql documents, images. ● Value: What you are trying to solve, what type of information you want to get: Live recommendation, analytics, processing large amount of data. “For every two degrees the temperature goes up, check-ins at ice-cream shops go up by 2%” - Andrew Hogue, Foursquare.
  • 4. W Questions on Hadoop ● Why use Hadoop? ● When to use Hadoop? ● What is Hadoop? ● How to setup?
  • 5. Why/When to use Hadoop ● When you don’t want your answers in real-time. ● Storage trends: Cost per gigabytes is high, datasets are big. ● Time/Skills: High learning curve. ● Non-confidential data. ● When you are throwing away valuable data. (Java developers with data science skills are in incredibly high demand)
  • 6. What is Hadoop Open source framework for storing and processing large sets of data in a distributed environment. Core of Hadoop: ● HDFS - Storage ● YARN - Cluster Resource Manager ● MapReduce - Processing Part ● EcoSystem - Applications
  • 8. HDFS-Hadoop Distributed File System Similar to existing distributed file systems, but it runs on low-cost servers and is highly fault-tolerant. Goals : ● To overcome hardware failures ● Large datasets, horizontally scalable ● Simple coherency model: Write once read multiple access model for files 3 S of 4TB (Raid 0) = 12TB => Hadoop replica factor 3 => 12/3 = 4TB
  • 9. HDFS - continued Hadoop splits files into small blocks which are distributed among nodes. HDFS has NameNode (NN), DataNode (DN) ● NN : Master of system, track location {Filename , #Replica, BlockId} ● DN : Store files as blocks.
  • 10.
  • 11. HDFS Features ● Rack awareness ● Minimal data motion ● Utilities ● Rollback ● StandBy-NN ● Highly operable
  • 12. HDFS - continued Hadoop shell commands FS (FileSystem) are very similar to Linux commands ● hadoop fs -ls ● hadoop fs -cat /user/dina.khader/readme Options: cat, chgrp, chmod, chown, copyFromLocal, copyToLocal, cp, du, dus, get, ls, lsr, mkdir, movefromLocal, mv, put, rm, rmr, setrep, stat, tail, test, text, touchz
  • 13. YARN Was introduced in hadoop 2.x Main components of YARN: 1. ResourceManager 2. NodeManager 3. JobHistoryServer
  • 14. MapReduce A programming model for processing large datasets with a parallel, distributed algorithm on a cluster.
  • 16. Eco-System MapReduce gives data seekers a lot of power and flexibility but it also adds a lot of complexity. Therefore, there is a set of tools that make that easier like: ● Hive: SQL-like interface to access data stored on HDFS. ● Pig: Scripting platform to process data. ● Hbase: Column-oriented NoSql DB, well suited for sparse data. Other Hadoop EcoSystem components: ● Zookeeper: Centralized service for maintaining configuration information. ● Oozie: Workflow scheduler system to manage Hadoop jobs. ● Sqoop/Flume: Transferring data from RDBMS/other sources into Hadoop.
  • 17. How to setup Hadoop ● Cloudera (Cloudera Manager) ● Hortonworks (Ambari) ● Manual (Painful :) ) wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.2.5.17/ambari.repo cp ambari.repo /etc/yum.repos.d yum install ambari-server ambari-server setup
  • 18. Quiz Assuming the following: ● We have configured 64 MB block size ● Replication factor 3 ● Rack-awareness ● File size 224 MB ● 3 servers with 4TB RAID 0 Questions : Copy the file to HDFS please explain 1. How many blocks will be generated? 2. What is the size of these blocks? 3. What will happen if one node went down? = (224/64) * 3 = 12 blocks 9 blocks 64MB, 3 blocks 32 MB Rebalance to nearest server in Rack.

Editor's Notes

  1. When you want to make predictions based on large historical data.
  2. Ideas Framework to organize works on BigData Set of rules to work with BigData Philosophy: HDFS MapReduce Move Processing to Data The Eco-system Hadoop is more of a data warehousing system - so it needs a system like MapReduce to actually process the data.
  3. http://hortonworks.com/hadoop/yarn/
  4. User cheap servers and dont worrry It's designed to be robust, in that your Big Data applications will continue to run even when individual servers — or clusters — fail. 3 S of 4TB (Raid 0) = 12TB => Hadoop replica factor 3 => 12/3 = 4TB
  5. NN: Master of the system, maintain and manage blocks which are on the DN These specific features ensure that the Hadoop clusters are highly functional and highly available: Rack awareness allows consideration of a node’s physical location, when allocating storage and scheduling tasks Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack and provides very high aggregate read/write bandwidth. Utilities diagnose the health of the files system and can rebalance the data on different nodes Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of human or system errors An upgrade of HDFS makes a copy of the previous version’s metadata and data. Doing an upgrade does not double the storage requirements of the cluster, as the datanodes use hard links to keep two references (for the current and previous version) to the same block of data. This design makes it straightforward to roll back to the previous version of the filesystem, should you need to. You should understand that any changes made to the data on the upgraded system will be lost after the rollback completes. You can keep only the previous version of the filesystem: you can’t roll back several versions. Therefore, to carry out another upgrade to HDFS data and metadata, you will need to delete the previous version, a process called finalizing the upgrade. Once an upgrade is finalized, there is no procedure for rolling back to a previous version. Standby NameNode provides redundancy and supports high availability Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention. This design allows a single operator to maintain a cluster of 1000s of nodes.
  6. Check HDFS
  7. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.5.0/bk_getting-started-guide/content/ch_hdp2_getting_started_chp2_1.html ?????? split two major responsibility of the JobTracker (Job scheduling/monitoring), MapReduce (Resource Management) into separate daemons: Global ResourceManager, ApplicationManager
  8. Moving data through the network is slow and very expensive on Bandwidth and I/O 10-100 maps/node Map: is a function which converts items to another kind of list of items. Reduce: is a function which “collects” the items in lists and performs some computation on all of them thus reducing them to a single value. MapReduce moves code (Jars) to nodes that have the required data and process them in parallel . ode (Jars) to nodes that have the required data and process them in parallel .
  9. First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. Note that a reducer is different than a reduce task. A reducer can run multiple reduce tasks. Note also that shuffling and sorting are performed locally, by each reducer, for its own input data, whereas partioning is not local.
  10. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.