SlideShare a Scribd company logo
1 of 18
Debarchan Sarkar
Sunil Kumar Chakrapani
The call would start soon, please be on mute.
Thanks for your time and patience.
 Recap - What is Big DATA?
 Problems Introduced
 Traditional Architecture
 Cluster Architecture
 Where it all started?
 How does It work, A 50000 feet overview
 How does it work 1 & 2
 Hadoop Distributed Architecture
 HDFS Architecture
Internet of things
Audio /
Video
Log
Files
Text/Image
Social
Sentiment
Data Market
Feeds
eGov Feeds
Weather
Wikis / Blogs
Click
Stream
Sensors / RFID /
Devices
Spatial & GPS
Coordinates
WEB 2.0Mobile
Advertisin
g
CollaborationeCommerce
Digital
Marketing
Search Marketing
Web Logs
Recommendation
s
ERP / CRM
Sales
Pipeline
Payables
Payroll
Inventory
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety - variability
Volume
1980
190,000$
2010
0.07$
1990
9,000$
2000
15$
Storage/GB
ERP / CRM WEB
2.0
Internet of
things
1990 2010
Stores 1370 MB of data
Read
@ 4.4MB/S transfer rate
1 TB is a norm
Read
@ 100MB/S transfer rate
Takes 5 minutes Takes 2.5 hours
1 Machine 10 Machine
 4 I/O Channels
 Each channel: 100 MB/s
 ~ 45 minutes
 4 I/O Channels
 Each channel: 100 MB/s
 ~4.5 Minutes
A common way of avoiding data loss is through replication
Servers
SAN
Storage
1 U
1 U
1 U
1 U
1 U
1 U
1 U
1 U 1 U
1 U
 Google File System
 Map Reduce
 HDFS: HADOOP Distributed File
System
 MapReduce
// Map Reduce function in
JavaScript
var map = function (key,
value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};
var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};
RACK 1 - DataNodes RACK 2 - DataNodes
File Metadata
/user/kc/data01.txt – Block 1,2,3,4
/user/apb/data02.txt– Block 5,6
1 1
1
2 2
3
3
2
34 4
45
5
5 6
6
6
Block1: R1DN01, R1DN02, R2DN01
Block2:R1DN01, R1DN02, R2DN03
Block3:R1DN02, R1DN03, R2DN01
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
NameNode Secondary NameNode
• Reads fsimage and edits file
• Transaction in edits are merged With
fsimage and edits is emptied
• A client application creates a new file
in HDFS
• Name node logs that transaction in
the edits file
Checkpoint
• Secondary Namenode periodically
creates checkpoints of the namespace
• It downloads fsimage and edit from the
active NameNode
• Merges fsimage and edits locally
• Uploads the new image back to the
active NameNode
• fs.checkpoint.period
• fs.checkpoint.size
 During start up the NameNode loads the file system state from the fsimage and the
edits log file.
 Waits for DataNodes to report their blocks.
 During this time NameNode stays in Safemode.
 Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it
does not allow any modifications to file system or blocks.
 Normally the NameNode leaves Safemode automatically after the DataNodes have reported
that most file system blocks are available.
1 2 3
1. HDFS
client caches
the file data
into a
temporary
local file
Step 2
Step 3
Step 4
Step 5
Name Node
Data Node
Support Team’s blog:
http://blogs.msdn.com/b/bigdatasupport/
Facebook Page:
https://www.facebook.com/MicrosoftBigData
Facebook Group:
https://www.facebook.com/groups/bigdatalearnings/
Twitter: @debarchans
Read more:
http://en.wikipedia.org/wiki/Hadoop
http://en.wikipedia.org/wiki/Big_data
Next Session:
Apache Hadoop – Map Reduce

More Related Content

What's hot

Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Databricks
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Shi Shao Feng
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsAlessandro Menabò
 
RedisConf17 - Geofencing using Redis Geospatial Queries
RedisConf17 - Geofencing using Redis Geospatial QueriesRedisConf17 - Geofencing using Redis Geospatial Queries
RedisConf17 - Geofencing using Redis Geospatial QueriesRedis Labs
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation ContestAMIT BORUDE
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
 
CtrlS - DR on Demand
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on DemandCTRLS
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodDatabricks
 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.darach
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 

What's hot (20)

Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
RedisConf17 - Geofencing using Redis Geospatial Queries
RedisConf17 - Geofencing using Redis Geospatial QueriesRedisConf17 - Geofencing using Redis Geospatial Queries
RedisConf17 - Geofencing using Redis Geospatial Queries
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Jee conf
Jee confJee conf
Jee conf
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
CtrlS - DR on Demand
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on Demand
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Dcc Ppt
Dcc PptDcc Ppt
Dcc Ppt
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 

Similar to Apache Hadoop - A Deep Dive (Part 1 - HDFS)

IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Getting Started with Amazon Redshift
 Getting Started with Amazon Redshift Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseAltibase
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]APNIC
 
Distributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory ServerDistributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory ServerLDAPCon
 
5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data WarehousingTeradata
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesAmazon Web Services
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsAmazon Web Services
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Deep FME Server Integration with DWDS
Deep FME Server Integration with DWDSDeep FME Server Integration with DWDS
Deep FME Server Integration with DWDSSafe Software
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 

Similar to Apache Hadoop - A Deep Dive (Part 1 - HDFS) (20)

IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Getting Started with Amazon Redshift
 Getting Started with Amazon Redshift Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- Altibase
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
 
Distributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory ServerDistributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory Server
 
5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
 
8 technical-dns-workshop-day4
8 technical-dns-workshop-day48 technical-dns-workshop-day4
8 technical-dns-workshop-day4
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Deep FME Server Integration with DWDS
Deep FME Server Integration with DWDSDeep FME Server Integration with DWDS
Deep FME Server Integration with DWDS
 
Best practices and trends in people soft
Best practices and trends in people softBest practices and trends in people soft
Best practices and trends in people soft
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 

Recently uploaded

Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptxChapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptxkusamee0
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive FutureFuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive FutureBoston Institute of Analytics
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Voces Mineras
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 

Recently uploaded (20)

Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptxChapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive FutureFuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 

Apache Hadoop - A Deep Dive (Part 1 - HDFS)

  • 1. Debarchan Sarkar Sunil Kumar Chakrapani The call would start soon, please be on mute. Thanks for your time and patience.
  • 2.  Recap - What is Big DATA?  Problems Introduced  Traditional Architecture  Cluster Architecture  Where it all started?  How does It work, A 50000 feet overview  How does it work 1 & 2  Hadoop Distributed Architecture  HDFS Architecture
  • 3. Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertisin g CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendation s ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety - variability Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things
  • 4. 1990 2010 Stores 1370 MB of data Read @ 4.4MB/S transfer rate 1 TB is a norm Read @ 100MB/S transfer rate Takes 5 minutes Takes 2.5 hours
  • 5. 1 Machine 10 Machine  4 I/O Channels  Each channel: 100 MB/s  ~ 45 minutes  4 I/O Channels  Each channel: 100 MB/s  ~4.5 Minutes
  • 6. A common way of avoiding data loss is through replication
  • 8. 1 U 1 U 1 U 1 U 1 U 1 U 1 U 1 U 1 U 1 U
  • 9.  Google File System  Map Reduce  HDFS: HADOOP Distributed File System  MapReduce
  • 10.
  • 11. // Map Reduce function in JavaScript var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") {context.write(words[i].to LowerCase(), 1);} }}; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 12.
  • 13. RACK 1 - DataNodes RACK 2 - DataNodes File Metadata /user/kc/data01.txt – Block 1,2,3,4 /user/apb/data02.txt– Block 5,6 1 1 1 2 2 3 3 2 34 4 45 5 5 6 6 6 Block1: R1DN01, R1DN02, R2DN01 Block2:R1DN01, R1DN02, R2DN03 Block3:R1DN02, R1DN03, R2DN01
  • 15. NameNode Secondary NameNode • Reads fsimage and edits file • Transaction in edits are merged With fsimage and edits is emptied • A client application creates a new file in HDFS • Name node logs that transaction in the edits file Checkpoint • Secondary Namenode periodically creates checkpoints of the namespace • It downloads fsimage and edit from the active NameNode • Merges fsimage and edits locally • Uploads the new image back to the active NameNode • fs.checkpoint.period • fs.checkpoint.size
  • 16.  During start up the NameNode loads the file system state from the fsimage and the edits log file.  Waits for DataNodes to report their blocks.  During this time NameNode stays in Safemode.  Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks.  Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available.
  • 17. 1 2 3 1. HDFS client caches the file data into a temporary local file Step 2 Step 3 Step 4 Step 5 Name Node Data Node
  • 18. Support Team’s blog: http://blogs.msdn.com/b/bigdatasupport/ Facebook Page: https://www.facebook.com/MicrosoftBigData Facebook Group: https://www.facebook.com/groups/bigdatalearnings/ Twitter: @debarchans Read more: http://en.wikipedia.org/wiki/Hadoop http://en.wikipedia.org/wiki/Big_data Next Session: Apache Hadoop – Map Reduce

Editor's Notes

  1. Explain checkpoint