SlideShare a Scribd company logo
1 of 38
HADOOP 
BIG DATA 
Presented by Chandra Sekhar 
YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM
What is Hadoop? 
Apache Hadoop is a framework that allows for the 
distributed processing of large data sets across clusters 
of commodity computers using a simple programming 
model.
PRESENTATION FLOW 
1. How Hadoop STORES Data. 
2. How Hadoop PROCESS Data. 
3. Architecture of Hadoop 
4. ROI 
5. Resources
CHALLENGES LIKE 
OPPORTUNITIES: 
● Out of all People who sailed between 1997 - 2005, should I target the 
people who purchased alcohol package or Spa Package? 
● Based on the onboard spending of adult men from New York who 
have ever sailed with us, who can be targeted to sail on Azamara ? 
● Which first time guest will be a high roller ? 
COST SAVINGS: 
● On a sailing, Who and How many will have genuine complaints vs 
whining? 
● Which propulsion will break next? 
PRODUCTIVITY : 
● Which employee will quit next ? 
We have answers to most of these questions somewhere in our warehouses.
What is so Great about Hadoop ? 
● Why all this buzz? 
● Is it a hype? 
● Is it a dot com ? 
● How does Hadoop Handle ? 
Next Slide is a good example
At Yahoo in 2008
Hadoop is ideal For 
● Write once, Read many times operation.. 
● No edits, No Updates.. 
● Movie files, Music files, Flight data 
recorders, Logs, XML files are all fine ( 
DB records as well.)
HOW HADOOP STORES 
DATA 
● Hadoop uses blocks to store Files. 
● Default Block size is 64MB 
● Every block gets replicated thrice. 
● A 100 MB file will take up 2 blocks ( + 
Replication factor of 3 = 6 blocks) 
● 1 GB File, not a problem … 48 blocks
OLD VS NEW 
● You can set replication for older files to 2, 
and new files to 3 or even 4. 
● You can compress the files .
More on Blocks.. 
Because a unit of storage is block, It 
does not really matter how many 
files, or how big the files are .. 
But. 
Hadoop prefers large files instead of 
many small files. Why ?
Why Large Files ? 
When a block gets created, the addresses of 
block location , gets stored in namenode in 
memory For faster retrievel. 
It is not mandated,but it is efficient to have 
few entries . Usually multiple files get 
merged into a single file ( ex : all Assignment 
manager logs of a day into a single huge file)
Data Loss is extremely Rare .. 
Here is why
HOW HADOOP 
PROCESSES DATA 
MAP REDUCE
MAP REDUCE 
Map Function 
● Reads the data 
● Usually does the preprocessing 
● Hands over the records to Reduce 
Function for further processing 
( Ex : Eliminate all records where the age is 
less than 18 )
More about Processing 
● A single huge file ( ex: 1GB ) file could be 
processed by several mappers ( usually one block = 
1 mapper, so about 16 Map jobs. 
● If a simple logic, then you can disable reduce 
function and map job can process the logic. 
● A Mapreduce job can pick up a web log from our 
website, join to a Siebel table and the output written 
to a TIBCO Queue to write to AS400 ( or MongoDB 
directly)
Hadoop Eco-System
Mapreduce Flow
KEY VALUE PAIR 
Hello World Example 
File Content : 
The mouse runs faster than the Cat
Map function output 
Map Job output : (K1, VI) 
(The, 1) 
(mouse,1) 
(runs,1) 
(faster,1) 
(than,1) 
(the,1) 
(cat,1)
Reducer Function 
Reducer Job output : (K1, VI) 
(The, 2) 
(mouse,1) 
(runs,1) 
(faster,1) 
(than,1) 
(cat,1)
Hadoop Programming 
Languages 
Java, Any scripting languages , HIVE, PIG 
etc
Sample code in Java
Same Code in Python
Same Code in PIG 
A = load '/home/cloudera/wordcountproblem' using 
TextLoader as (data:chararray); 
B = foreach A generate FLATTEN(TOKENIZE(data)) as 
word; 
C = group B by word; 
D = foreach C generate group, COUNT(B); 
store D into '/home/cloudera/Chandra7' using 
PigStorage(',');
Same Code in HIVE 
SELECT word, COUNT(*) FROM input LATERAL 
VIEW explode(split(text, ' ')) lTable as 
word GROUP BY word;
More on data processing 
● Map function output is always sorted by 
the Key. 
● Map data is intermediate data , so it is not 
saved in HDFS, only in the local node and 
gets deleted after reducer finishes.
ARCHITECTURE.
ROI 
One study : Storing and Processing 1 TB 
Traditional RDBMS : $37,000 / year 
Data Appliance : $5000 / year 
Hadoop Cluster : $ 2000 /yearSource : HBR Big 
Data@work page 60
Wikibon Study 
BREAK EVEN TIMEFRAME 
Big data Approach : 
4 months 
Traditional DW Appliance Approach : 26 months
Resources 
Youtube “Stanford university,Amr Awadallah
‘Must Read’ to get Certified.. 
http://www.amazon.com/review/R3BSEBI4I4SNUL
THANK YOU

More Related Content

What's hot

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm alogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyFirman Gautama
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Vam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory AllocatorVam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory AllocatorEmery Berger
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use CaseTiman Rebel
 
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
Dumbo Hadoop Streaming Made Elegant And Easy Klaas BosteelsDumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
Dumbo Hadoop Streaming Made Elegant And Easy Klaas BosteelsGeorge Ang
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarArti Bhatia
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data EnvironmentsSri Ambati
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 

What's hot (20)

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data Technology
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Vam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory AllocatorVam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory Allocator
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
Dumbo Hadoop Streaming Made Elegant And Easy Klaas BosteelsDumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinar
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
Hadoop 101 v2
Hadoop 101 v2Hadoop 101 v2
Hadoop 101 v2
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 

Viewers also liked

Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?NUS-ISS
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01Aseem Chakrabarthy
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services EnglishPascal Spelier
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Viewers also liked (8)

Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services English
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Similar to Hadoop And Big Data - My Presentation To Selective Audience

A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothmanDenis Rothman
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Design of a_dsl_by_ruby_for_heavy_computations
Design of a_dsl_by_ruby_for_heavy_computationsDesign of a_dsl_by_ruby_for_heavy_computations
Design of a_dsl_by_ruby_for_heavy_computationsKoichi Fujikawa
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 

Similar to Hadoop And Big Data - My Presentation To Selective Audience (20)

A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Design of a_dsl_by_ruby_for_heavy_computations
Design of a_dsl_by_ruby_for_heavy_computationsDesign of a_dsl_by_ruby_for_heavy_computations
Design of a_dsl_by_ruby_for_heavy_computations
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Hadoop And Big Data - My Presentation To Selective Audience

  • 1. HADOOP BIG DATA Presented by Chandra Sekhar YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM
  • 2. What is Hadoop? Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
  • 3. PRESENTATION FLOW 1. How Hadoop STORES Data. 2. How Hadoop PROCESS Data. 3. Architecture of Hadoop 4. ROI 5. Resources
  • 4.
  • 5. CHALLENGES LIKE OPPORTUNITIES: ● Out of all People who sailed between 1997 - 2005, should I target the people who purchased alcohol package or Spa Package? ● Based on the onboard spending of adult men from New York who have ever sailed with us, who can be targeted to sail on Azamara ? ● Which first time guest will be a high roller ? COST SAVINGS: ● On a sailing, Who and How many will have genuine complaints vs whining? ● Which propulsion will break next? PRODUCTIVITY : ● Which employee will quit next ? We have answers to most of these questions somewhere in our warehouses.
  • 6.
  • 7. What is so Great about Hadoop ? ● Why all this buzz? ● Is it a hype? ● Is it a dot com ? ● How does Hadoop Handle ? Next Slide is a good example
  • 8. At Yahoo in 2008
  • 9.
  • 10. Hadoop is ideal For ● Write once, Read many times operation.. ● No edits, No Updates.. ● Movie files, Music files, Flight data recorders, Logs, XML files are all fine ( DB records as well.)
  • 11. HOW HADOOP STORES DATA ● Hadoop uses blocks to store Files. ● Default Block size is 64MB ● Every block gets replicated thrice. ● A 100 MB file will take up 2 blocks ( + Replication factor of 3 = 6 blocks) ● 1 GB File, not a problem … 48 blocks
  • 12. OLD VS NEW ● You can set replication for older files to 2, and new files to 3 or even 4. ● You can compress the files .
  • 13. More on Blocks.. Because a unit of storage is block, It does not really matter how many files, or how big the files are .. But. Hadoop prefers large files instead of many small files. Why ?
  • 14. Why Large Files ? When a block gets created, the addresses of block location , gets stored in namenode in memory For faster retrievel. It is not mandated,but it is efficient to have few entries . Usually multiple files get merged into a single file ( ex : all Assignment manager logs of a day into a single huge file)
  • 15. Data Loss is extremely Rare .. Here is why
  • 16. HOW HADOOP PROCESSES DATA MAP REDUCE
  • 17. MAP REDUCE Map Function ● Reads the data ● Usually does the preprocessing ● Hands over the records to Reduce Function for further processing ( Ex : Eliminate all records where the age is less than 18 )
  • 18. More about Processing ● A single huge file ( ex: 1GB ) file could be processed by several mappers ( usually one block = 1 mapper, so about 16 Map jobs. ● If a simple logic, then you can disable reduce function and map job can process the logic. ● A Mapreduce job can pick up a web log from our website, join to a Siebel table and the output written to a TIBCO Queue to write to AS400 ( or MongoDB directly)
  • 21. KEY VALUE PAIR Hello World Example File Content : The mouse runs faster than the Cat
  • 22. Map function output Map Job output : (K1, VI) (The, 1) (mouse,1) (runs,1) (faster,1) (than,1) (the,1) (cat,1)
  • 23. Reducer Function Reducer Job output : (K1, VI) (The, 2) (mouse,1) (runs,1) (faster,1) (than,1) (cat,1)
  • 24. Hadoop Programming Languages Java, Any scripting languages , HIVE, PIG etc
  • 26. Same Code in Python
  • 27. Same Code in PIG A = load '/home/cloudera/wordcountproblem' using TextLoader as (data:chararray); B = foreach A generate FLATTEN(TOKENIZE(data)) as word; C = group B by word; D = foreach C generate group, COUNT(B); store D into '/home/cloudera/Chandra7' using PigStorage(',');
  • 28. Same Code in HIVE SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;
  • 29. More on data processing ● Map function output is always sorted by the Key. ● Map data is intermediate data , so it is not saved in HDFS, only in the local node and gets deleted after reducer finishes.
  • 31.
  • 32.
  • 33.
  • 34. ROI One study : Storing and Processing 1 TB Traditional RDBMS : $37,000 / year Data Appliance : $5000 / year Hadoop Cluster : $ 2000 /yearSource : HBR Big Data@work page 60
  • 35. Wikibon Study BREAK EVEN TIMEFRAME Big data Approach : 4 months Traditional DW Appliance Approach : 26 months
  • 36. Resources Youtube “Stanford university,Amr Awadallah
  • 37. ‘Must Read’ to get Certified.. http://www.amazon.com/review/R3BSEBI4I4SNUL

Editor's Notes

  1. SLIDE FROM YAHOO IN 2008
  2. More on the block size in the next couple of slides
  3. http://wikibon.org/wiki/v/Financial_Comparison_of_Big_Data_MPP_Solution_and_Data_Warehouse_Appliance