SlideShare a Scribd company logo
Functional Big Data
Agenda 
MapReduce 
Google 
Scaling Out 
Key Value Store 
Chaining 
Fault Tolerance 
Functional Example 
Business Problem 
Design 
Processes 
Schema 
Big Data Guidelines
MapReduce
Google MapReduce 
+ Paper published in 2004 
+ Implemented in 2003 
+ Production use at Google 
+ Built for Google 
+ Not open sourced
Google in 2004 
+ Clusters of 100s or 1000s of servers 
o Linux 
o dual-processor x86 
o 2-4 GB memory 
o 100BaseT or GigE 
o inexpensive IDE hard drives 
+ Servers fail every day 
+ Network maintenance is constant
Scaling Out 
+ Scaling up (faster computer) doesn’t get far 
+ Scaling out is the only next step 
+ Hundreds/thousands of modest computers 
outperform the biggest single computers 
+ Scaling one to a few is hard 
+ Scaling a few to many is easy 
+ Scaling many to massive is (almost) trivial
Concurrency
Intermediate Data 
+ Input data is split between the workers 
+ Map workers create key/value pairs 
+ Reduce workers read in all intermediate 
data and sort by key 
+ Reduce workers then iterate over the sorted 
data producing a result for each key
Key Value Store
Rinse and Repeat 
+ Often the results of one MapReduce are 
used as input to another 
+ Building on a powerful basic functional 
model complex data processing can be 
accomplished
Chaining
Fault Tolerance 
+ Likelihood of failure rises with number of 
servers and processing time 
+ Resiliency is a necessity at scale 
+ Scheduler/Supervisor (master) reassigns 
failed jobs and ensures reduce workers find 
the (right) data
Scheduling
Supervision
Functional Example
Example Business Problem 
Scenario: 
A mobile operator wants to know if an instant 
messaging (IM) service would be useful to 
current subscribers. 
Question: 
What percentage of text messages (SMS) 
are part of a conversation?
Challenge 
✓ 10 million subscribers 
✓ average of 100 SMS a month per subscriber 
✓ ∴ one billion SMS each month 
✓ call detail records (CDR) include SMS but also 
voice and data events 
✓ ∴ 20 billion (20,000,000,000) records/month
Requirements 
+ Identify SMS conversations 
o messages sent or received with one other party 
o interval between messages < 10 minutes 
o at least three messages exchanged 
+ Provide result as 
o ratio of conversational to non-conversational SMS 
o per subscriber 
o per month
Process Design
Filter 
+ Read events from CDR files 
o records are in chronological order 
o read files in chronological order 
+ Discard non-SMS events 
+ Distribute SMS events to Map processes 
o Consistent distribution by subscriber
Hashing 
+ To analyze interval between 
messages one process must 
handle all events for a 
particular subscriber 
+ Simple Hash: 
o M = last four digits of subscriber’s 
mobile number 
o N = number of processes available 
o Pid = M rem N
Map 
+ Read subscriber’s stored data 
+ Find other party in set 
+ Increment total count of messages 
+ Is previous message < 10 minutes? 
o Is next previous message < 10m before previous? 
 Increment conversational messages count 
+ Update previous and next previous times
Schema Design
Interim Data 
+ We are using an in memory key value store 
+ The key is the subscriber number 
+ The value is a set of OtherParty 
+ OtherParty data structure contains counts 
+ When the map is complete we transfer the 
data to disk for persistence
Reduce 
+ Collect intermediate data 
from disk copies 
+ Iterate through all parties for 
each subscriber 
+ Total all party counts 
+ Provide result as percentage 
of conversational messages 
to total messages
Big Data Guidelines 
+ Find opportunities for concurrency 
+ Choose the right containers for your data 
+ Use memory as effectively as possible 
+ Minimize copying data 
+ Avoid any unnecessary overhead 
+ Anything you are going to do hundreds of 
billions of times should be efficient!
Thank you.
SLASSCOM TECH TALKS 
https://www.facebook.com/SlasscomTechnologyForum 
http://www.slasscom.lk/events 
https://twitter.com/slasscom 
www.slideshare.net/slasscomtechforum

More Related Content

Similar to Functional Big Data (by Vance Shipley)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Cloudera, Inc.
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
IRJET Journal
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing system
Ayisha Kowsar
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
Jérôme Kehrli
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Kumari Surabhi
 
The BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformThe BUsiness of Windows Azure Platform
The BUsiness of Windows Azure Platform
Dan Moore
 
Big Data
Big DataBig Data
Big Data
Priyanka Tuteja
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
Guido Schmutz
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
Stylight
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
VMware Tanzu Korea
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
Dharmesh Tank
 
IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)
Dave Cortright
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
Splunk
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
daveconnors
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 

Similar to Functional Big Data (by Vance Shipley) (20)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing system
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
The BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformThe BUsiness of Windows Azure Platform
The BUsiness of Windows Azure Platform
 
Big Data
Big DataBig Data
Big Data
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)
 
Big Data
Big DataBig Data
Big Data
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 

Recently uploaded

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 

Recently uploaded (20)

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 

Functional Big Data (by Vance Shipley)

  • 2. Agenda MapReduce Google Scaling Out Key Value Store Chaining Fault Tolerance Functional Example Business Problem Design Processes Schema Big Data Guidelines
  • 4. Google MapReduce + Paper published in 2004 + Implemented in 2003 + Production use at Google + Built for Google + Not open sourced
  • 5. Google in 2004 + Clusters of 100s or 1000s of servers o Linux o dual-processor x86 o 2-4 GB memory o 100BaseT or GigE o inexpensive IDE hard drives + Servers fail every day + Network maintenance is constant
  • 6. Scaling Out + Scaling up (faster computer) doesn’t get far + Scaling out is the only next step + Hundreds/thousands of modest computers outperform the biggest single computers + Scaling one to a few is hard + Scaling a few to many is easy + Scaling many to massive is (almost) trivial
  • 8. Intermediate Data + Input data is split between the workers + Map workers create key/value pairs + Reduce workers read in all intermediate data and sort by key + Reduce workers then iterate over the sorted data producing a result for each key
  • 10. Rinse and Repeat + Often the results of one MapReduce are used as input to another + Building on a powerful basic functional model complex data processing can be accomplished
  • 12. Fault Tolerance + Likelihood of failure rises with number of servers and processing time + Resiliency is a necessity at scale + Scheduler/Supervisor (master) reassigns failed jobs and ensures reduce workers find the (right) data
  • 16. Example Business Problem Scenario: A mobile operator wants to know if an instant messaging (IM) service would be useful to current subscribers. Question: What percentage of text messages (SMS) are part of a conversation?
  • 17. Challenge ✓ 10 million subscribers ✓ average of 100 SMS a month per subscriber ✓ ∴ one billion SMS each month ✓ call detail records (CDR) include SMS but also voice and data events ✓ ∴ 20 billion (20,000,000,000) records/month
  • 18. Requirements + Identify SMS conversations o messages sent or received with one other party o interval between messages < 10 minutes o at least three messages exchanged + Provide result as o ratio of conversational to non-conversational SMS o per subscriber o per month
  • 20. Filter + Read events from CDR files o records are in chronological order o read files in chronological order + Discard non-SMS events + Distribute SMS events to Map processes o Consistent distribution by subscriber
  • 21. Hashing + To analyze interval between messages one process must handle all events for a particular subscriber + Simple Hash: o M = last four digits of subscriber’s mobile number o N = number of processes available o Pid = M rem N
  • 22. Map + Read subscriber’s stored data + Find other party in set + Increment total count of messages + Is previous message < 10 minutes? o Is next previous message < 10m before previous?  Increment conversational messages count + Update previous and next previous times
  • 24. Interim Data + We are using an in memory key value store + The key is the subscriber number + The value is a set of OtherParty + OtherParty data structure contains counts + When the map is complete we transfer the data to disk for persistence
  • 25. Reduce + Collect intermediate data from disk copies + Iterate through all parties for each subscriber + Total all party counts + Provide result as percentage of conversational messages to total messages
  • 26. Big Data Guidelines + Find opportunities for concurrency + Choose the right containers for your data + Use memory as effectively as possible + Minimize copying data + Avoid any unnecessary overhead + Anything you are going to do hundreds of billions of times should be efficient!
  • 28. SLASSCOM TECH TALKS https://www.facebook.com/SlasscomTechnologyForum http://www.slasscom.lk/events https://twitter.com/slasscom www.slideshare.net/slasscomtechforum

Editor's Notes

  1. In order to successfully handle really big data requires massive concurrency and in the real world this requires fault tolerance.
  2. Google didn’t invent map and reduce but they were the first to apply the paradigm in a general way on a massive scale.
  3. … or, more probably, a number of results. By dividing the work we can assign it to many servers. This concurrency is what allows scale.
  4. Here is an example of something which Google do as part of their core business. Google places web sites which are linked to by many other web sites higher in search results (PageRank). To determine this a map reads web pages found by crawlers and creates key/value pairs. These are written in memory and then pushed out in blocks to disk. A reduce reads these disk blocks and sorts all the intermediate data by key. The reduce function then iterates over all the pairs for a key and outputs one result for each key.
  5. The results from one MapReduce can, and often are, provided as input for further MapReduce runs.
  6. Something like RAID, maybe Reduced Array of Inexpensive Servers (RAIS)? The can and do fail individually without the system failing.
  7. The user process forks all of the other processes which will be used including a master process. The master then assigns those processes work to perform, either map or reduce roles.
  8. The master process monitors each worker by sending a ping periodically. When it detects that a server has failed (or is no longer reachable) it will reassign that server’s work to another worker. After this reassignment each of the reduce workers will be notified to ignore the failed server and instead get the interim data from the newly assigned server.
  9. This is a contrived example.
  10. That’s billion with a ‘B’. In Canada that’s 1,000 million.
  11. There is an obvious hole in this pseudo code, the first two messages of the conversation are not included in the conversational totals. I could have accommodated that but I left it out to keep the example as simple possible.