SlideShare a Scribd company logo
1 of 25
Download to read offline
Solving Big Data problems using Hadoop
Ravi Chaturvedi
2© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Background
3© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Why Big Data Problems even Exist ?
4© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Matrix Multiplication Problem
x=
O(n3) = O(n2) x O(n)
O(n) !!!
5© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Google Search Architecture
Britney Spears
[‘Britney’, ‘Spears’]
6© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Search Index Creation Problem
50+ billion web pages x 20KB = 1000+ terabytes (1 PB +)
- One computer can read 50 MB/sec from disk
- 7+ months to read the web
http://googleblog.blogspot.in/2008/07/we-knew-web-was-big.html
Storage Computation
- GFS (Google File System Paper)
- HDFS (Open Source Impl.)
- Map Reduce (Paper)
- Hadoop Map Reduce (Open Source Impl.)
7© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Hadoop Distributed File System (HDFS)
8© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Hadoop Distributed File System (HDFS) – Goals
- Hardware Failure (MTBF)
- Streaming data access (Throughput)
- Large Data Set (TB to PB)
- Simple Coherency Model (write-once – read-many)
- Portability Across Heterogeneous Hardware and Software Platforms.
- Moving Computation is Cheaper than Moving Data (locality of data)
9© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
GFS – Architecture
10© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
HDFS – Architecture
11© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
HDFS – Cluster Architecture
12© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
HDFS – Rack Failure
13© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Functional Programming Review
14© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Map Function
func func func func func func
15© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Map Function Properties
- Does not change the existing data structure
- Idempotence
- Order of Operation doesn’t matter
- Independent
16© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Reduce Function
func func func func func func
result
initial
17© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Reduce Function Properties
- Does not change the existing data structure
- Order of Operation doesn’t matter if operation is commutative and associative
Commutative Law:
You can swap numbers over and still get the same answer.
a + b = b + a a x b = b x a
Associative Law:
It doesn't matter how you group the numbers (i.e. which you calculate first).
a + (b + c) = (a + b) + c a x (b x c) = (a x b) x c
18© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Hadoop Map-Reduce
19© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Hadoop Map Reduce – Problem Fitment
- Lazy Convergence / Eventual Consistency - no ordering
- Idempotence – make the same operation multiple time
- Straightforward Partial Restart – no state thing.
- Process Isolation – shared nothing
20© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Hadoop Map Reduce – Properties
- Automatic Parallelization and distribution
- Fault Tolerant
- Provide status monitoring tools
- Clean abstraction for programmer
21© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Hadoop Map Reduce - Architecture
22© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Problems in Finance
23© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Real-time P&L Calculation
24© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Risk Calculation
25© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL
Thank You!

More Related Content

Similar to Solving Big Data problems using Hadoop

HDFS & MapReduce
HDFS & MapReduceHDFS & MapReduce
HDFS & MapReduceSkillspeed
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopSkillspeed
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Consider performance and security for SharePoint WP/App
Consider performance and security for SharePoint WP/AppConsider performance and security for SharePoint WP/App
Consider performance and security for SharePoint WP/AppTuấn Hải
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Nati Shalom
 
Big Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargBig Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargAgile Testing Alliance
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityThe Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityNeo4j
 
Eric Andersen Keynote
Eric Andersen KeynoteEric Andersen Keynote
Eric Andersen KeynoteData Con LA
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
 
Semantic Web Standards and the Variety “V” of Big Data
Semantic Web Standards and  the Variety “V” of Big DataSemantic Web Standards and  the Variety “V” of Big Data
Semantic Web Standards and the Variety “V” of Big Databobdc
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and QuboleAmazon Web Services
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and QuboleAmazon Web Services
 

Similar to Solving Big Data problems using Hadoop (20)

HDFS & MapReduce
HDFS & MapReduceHDFS & MapReduce
HDFS & MapReduce
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Consider performance and security for SharePoint WP/App
Consider performance and security for SharePoint WP/AppConsider performance and security for SharePoint WP/App
Consider performance and security for SharePoint WP/App
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)
 
Big Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargBig Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya Garg
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityThe Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
 
Eric Andersen Keynote
Eric Andersen KeynoteEric Andersen Keynote
Eric Andersen Keynote
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
 
Semantic Web Standards and the Variety “V” of Big Data
Semantic Web Standards and  the Variety “V” of Big DataSemantic Web Standards and  the Variety “V” of Big Data
Semantic Web Standards and the Variety “V” of Big Data
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Solving Big Data problems using Hadoop

  • 1. Solving Big Data problems using Hadoop Ravi Chaturvedi
  • 2. 2© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Background
  • 3. 3© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Why Big Data Problems even Exist ?
  • 4. 4© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Matrix Multiplication Problem x= O(n3) = O(n2) x O(n) O(n) !!!
  • 5. 5© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Google Search Architecture Britney Spears [‘Britney’, ‘Spears’]
  • 6. 6© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Search Index Creation Problem 50+ billion web pages x 20KB = 1000+ terabytes (1 PB +) - One computer can read 50 MB/sec from disk - 7+ months to read the web http://googleblog.blogspot.in/2008/07/we-knew-web-was-big.html Storage Computation - GFS (Google File System Paper) - HDFS (Open Source Impl.) - Map Reduce (Paper) - Hadoop Map Reduce (Open Source Impl.)
  • 7. 7© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Hadoop Distributed File System (HDFS)
  • 8. 8© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Hadoop Distributed File System (HDFS) – Goals - Hardware Failure (MTBF) - Streaming data access (Throughput) - Large Data Set (TB to PB) - Simple Coherency Model (write-once – read-many) - Portability Across Heterogeneous Hardware and Software Platforms. - Moving Computation is Cheaper than Moving Data (locality of data)
  • 9. 9© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL GFS – Architecture
  • 10. 10© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL HDFS – Architecture
  • 11. 11© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL HDFS – Cluster Architecture
  • 12. 12© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL HDFS – Rack Failure
  • 13. 13© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Functional Programming Review
  • 14. 14© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Map Function func func func func func func
  • 15. 15© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Map Function Properties - Does not change the existing data structure - Idempotence - Order of Operation doesn’t matter - Independent
  • 16. 16© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Reduce Function func func func func func func result initial
  • 17. 17© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Reduce Function Properties - Does not change the existing data structure - Order of Operation doesn’t matter if operation is commutative and associative Commutative Law: You can swap numbers over and still get the same answer. a + b = b + a a x b = b x a Associative Law: It doesn't matter how you group the numbers (i.e. which you calculate first). a + (b + c) = (a + b) + c a x (b x c) = (a x b) x c
  • 18. 18© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Hadoop Map-Reduce
  • 19. 19© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Hadoop Map Reduce – Problem Fitment - Lazy Convergence / Eventual Consistency - no ordering - Idempotence – make the same operation multiple time - Straightforward Partial Restart – no state thing. - Process Isolation – shared nothing
  • 20. 20© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Hadoop Map Reduce – Properties - Automatic Parallelization and distribution - Fault Tolerant - Provide status monitoring tools - Clean abstraction for programmer
  • 21. 21© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Hadoop Map Reduce - Architecture
  • 22. 22© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Problems in Finance
  • 23. 23© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Real-time P&L Calculation
  • 24. 24© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Risk Calculation
  • 25. 25© COPYRIGHT 2013 SAPIENT CORPORATION | CONFIDENTIAL Thank You!