SlideShare a Scribd company logo
1 of 30
A Brief Introduction
Intro
 Intended to whet the audience’s appetite and possibly
start a discussion within this environment on this subject
among interested parties
 Big Data is relatively new and still emerging. In the last
decade, it has influenced the emergence of new
companies as well we the way old business handle data
Puzzles for the IA
 How do the likes of Google and Bing index the entire
World Wide Web?
 How do Amazon and eBay maintain a global , dynamic,
open online market
 How can the NSA process phone records and make
meaning of the data?
 How does Facebook handle all your nice pictures and
produce them when you ask?
The 3Vs of Big Data
The 4Vs of Big Data
6Vs of Big Data
 Volume – massive data approaching petabytes
 Velocity – highly transient data
 Variety – data is not pre-defined and not quite
structured
 Veracity – Is the data that is being stored, and
mined meaningful to the problem being analyzed?
 Volatility – How long can we keep the data?
 Validity – Is the data valid and accurate for the
intended use?
Huh? 3, 4,
or 6?
Why Big Data? – Saptak Sen
Human
Fault
Tolerance
Minimize
CAPEX
Hyper
Scale on
Demand
Low
Learning
Curve
Key Drivers for Big Data
Platforms
 Businesses want systems that can survive human error or
malicious intent.
 Businesses do not want to spend too much when the ROIs
are not yet clear
 Businesses want the assurance that even if they start
small, they can easily grow big by expanding rather than
replacing systems.
 Businesses and their staff want to spend the minimum
amount of time, money and effort on training
RoadBlock: CAP Theorem
Consistency
Availability
Fault
Tolerance
All three of consistency, availability and partition tolerance cannot be
guaranteed by any distributed system
CAP Explained
 Consistency
 At any point all result sets fetched from different nodes in a
distributed system are the same
 Availability
 Data is available when required and response time is within
acceptable limits
 Partition Tolerance
 The system can survive having large data sharded across
many drives/nodes
Lambda Architecture
 Batch Layer
 More or less raw data continuously growing. The originally
data considered non-updateable and non-deleteable
 Speed Layer
 Stores near real-time data. Provides a view of data within a
specific window. Cuts down the Batch layer’s high latency
 Serving Layer
 Allows for low-latency queries. Speed layer updates this
layer.
Lambda Architecture
MapReduce
 Programming model developed and publicized by Google
in 2004 to address the need to analyze extremely large
volumes of data
 Map() and Reduce() functions are typically written in Java
but can also be written in other programming languages
 “ … Synchronisation is the worst enemy of parallelism … “
- Saptak Sen
Detour: CXPACKET Wait Events
FINISH
CXPACKET Wait Events
FINISH
MapReduce Process
Apache Hadoop
 An open-source, Java-based software framework that
supports data-intensive distributed applications.
 Supports running applications on large clusters of
commodity hardware.
 Originally created and Open Sourced by Yahoo
 Scalability, built-in data redundancy and relatively low-
cost
Hadoop Architecture
RDBMS vs Hadoop MapReduce
Hadoop Framework
 Hadoop Common
 contains libraries and utilities needed by other Hadoop
modules
 Hadoop Distributed File System (HDFS)
 distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across
the cluster.
 Hadoop YARN
 resource-management platform for managing compute
resources in clusters and using them for scheduling apps
 Hadoop MapReduce
NoSQL
 NoSQL Databases are distributed and are considered
better options than RDBMSes for applications that can
handle the absence of one of the CAP properties
 NoSQL as a superior technology over SQL for dealing with
the complexities of large volumes of data is a debatable
proposition
 NoSQL as a term was first used by Carlo Strozzi in 1998
and subsequently made popular by Eric Evans in 2009
NoSQL Families
 Key-Value Databases
 Dynamo, Riak, Oracle NoSQL
 Document
 MongoDB
 Column Family Database
 Cassandra, BigTable, HBase
 Graph Databases
 Neo4J
Dynamo
 Created by Amazon to meet their requirements for
Extreme Availability, Extreme Performance and Extreme
Scalability
 Amazon was willing to sacrifice consistency (‘C’ in the
well know ACID relational database model) in order to
provide higher availability
 Components of the Dynamo Key-Value Store were
Functional Segmentation, Sharding, Replication, and
BLOBs
MongoDB
 Cross-platform document-oriented database system
developed between 2007 and 2010 by a company then
known as 10gen
 Considered the most popular NoSQL Database now used
by such big names as eBay, SourceForge and The New
york Times
 http://www.mongodb.org/
Large Volume: Amazon vs eBay
Summary
 Necessity is the mother of invention. The Internet
Revolution has driven deep thinkers to amazing heights in
the last decade and Big Data is one of the results
 Big companies like Google and Yahoo showed the
maturity and security by sharing with the world such
ground breaking discoveries
 There is certainly a place in the future for Big Data and
that furture is approaching at light speed. It may pay to
key in.
… or questions 
Authorities’ Quotes
 Ignatius Fernandez
 “ … still believes in the superiority of relational technology
but believes that the relational camp needs to get its act
together if it wants to compete with the NoSQL camp in
performance, scalability, and availability … “
 Arup
 “ … but in the case of the phone records, and especially
collated with other records to identify criminal or pseudo-
criminal activities such as financial records, travel records,
etc., the traditional databases such as Oracle and DB2 likely
will not scale well… “
 - Cetin Ozbutun
 "Oracle Big Data Appliance X4-2 continues to raise the big
data bar, offering the industry's only comprehensive
appliance for Hadoop to securely meet enterprise big data
challenges“
 Google
 “ …Bigtable is used by more than sixty Google products and
projects, including Google Analytics, Google Finance,
Orkut, Personalized Search, Writely, and Google Earth … “

More Related Content

Similar to Big Data Basic Concepts | Presented in 2014

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Relational Technologies Under Siege: Will Handsome Newcomers Displace the St...
Relational Technologies Under Siege:  Will Handsome Newcomers Displace the St...Relational Technologies Under Siege:  Will Handsome Newcomers Displace the St...
Relational Technologies Under Siege: Will Handsome Newcomers Displace the St...Neil Raden
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Big Data
Big DataBig Data
Big DataNGDATA
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Data lakehouse fallacies
 Data lakehouse fallacies Data lakehouse fallacies
Data lakehouse fallaciesNeil Raden
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 

Similar to Big Data Basic Concepts | Presented in 2014 (20)

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Relational Technologies Under Siege: Will Handsome Newcomers Displace the St...
Relational Technologies Under Siege:  Will Handsome Newcomers Displace the St...Relational Technologies Under Siege:  Will Handsome Newcomers Displace the St...
Relational Technologies Under Siege: Will Handsome Newcomers Displace the St...
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Big Data
Big DataBig Data
Big Data
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Data lakehouse fallacies
 Data lakehouse fallacies Data lakehouse fallacies
Data lakehouse fallacies
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Big Data Basic Concepts | Presented in 2014

  • 2. Intro  Intended to whet the audience’s appetite and possibly start a discussion within this environment on this subject among interested parties  Big Data is relatively new and still emerging. In the last decade, it has influenced the emergence of new companies as well we the way old business handle data
  • 3. Puzzles for the IA  How do the likes of Google and Bing index the entire World Wide Web?  How do Amazon and eBay maintain a global , dynamic, open online market  How can the NSA process phone records and make meaning of the data?  How does Facebook handle all your nice pictures and produce them when you ask?
  • 4. The 3Vs of Big Data
  • 5. The 4Vs of Big Data
  • 6. 6Vs of Big Data  Volume – massive data approaching petabytes  Velocity – highly transient data  Variety – data is not pre-defined and not quite structured  Veracity – Is the data that is being stored, and mined meaningful to the problem being analyzed?  Volatility – How long can we keep the data?  Validity – Is the data valid and accurate for the intended use? Huh? 3, 4, or 6?
  • 7. Why Big Data? – Saptak Sen Human Fault Tolerance Minimize CAPEX Hyper Scale on Demand Low Learning Curve
  • 8. Key Drivers for Big Data Platforms  Businesses want systems that can survive human error or malicious intent.  Businesses do not want to spend too much when the ROIs are not yet clear  Businesses want the assurance that even if they start small, they can easily grow big by expanding rather than replacing systems.  Businesses and their staff want to spend the minimum amount of time, money and effort on training
  • 9. RoadBlock: CAP Theorem Consistency Availability Fault Tolerance All three of consistency, availability and partition tolerance cannot be guaranteed by any distributed system
  • 10. CAP Explained  Consistency  At any point all result sets fetched from different nodes in a distributed system are the same  Availability  Data is available when required and response time is within acceptable limits  Partition Tolerance  The system can survive having large data sharded across many drives/nodes
  • 11. Lambda Architecture  Batch Layer  More or less raw data continuously growing. The originally data considered non-updateable and non-deleteable  Speed Layer  Stores near real-time data. Provides a view of data within a specific window. Cuts down the Batch layer’s high latency  Serving Layer  Allows for low-latency queries. Speed layer updates this layer.
  • 13. MapReduce  Programming model developed and publicized by Google in 2004 to address the need to analyze extremely large volumes of data  Map() and Reduce() functions are typically written in Java but can also be written in other programming languages  “ … Synchronisation is the worst enemy of parallelism … “ - Saptak Sen
  • 14. Detour: CXPACKET Wait Events FINISH
  • 17. Apache Hadoop  An open-source, Java-based software framework that supports data-intensive distributed applications.  Supports running applications on large clusters of commodity hardware.  Originally created and Open Sourced by Yahoo  Scalability, built-in data redundancy and relatively low- cost
  • 19. RDBMS vs Hadoop MapReduce
  • 20. Hadoop Framework  Hadoop Common  contains libraries and utilities needed by other Hadoop modules  Hadoop Distributed File System (HDFS)  distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.  Hadoop YARN  resource-management platform for managing compute resources in clusters and using them for scheduling apps  Hadoop MapReduce
  • 21. NoSQL  NoSQL Databases are distributed and are considered better options than RDBMSes for applications that can handle the absence of one of the CAP properties  NoSQL as a superior technology over SQL for dealing with the complexities of large volumes of data is a debatable proposition  NoSQL as a term was first used by Carlo Strozzi in 1998 and subsequently made popular by Eric Evans in 2009
  • 22. NoSQL Families  Key-Value Databases  Dynamo, Riak, Oracle NoSQL  Document  MongoDB  Column Family Database  Cassandra, BigTable, HBase  Graph Databases  Neo4J
  • 23. Dynamo  Created by Amazon to meet their requirements for Extreme Availability, Extreme Performance and Extreme Scalability  Amazon was willing to sacrifice consistency (‘C’ in the well know ACID relational database model) in order to provide higher availability  Components of the Dynamo Key-Value Store were Functional Segmentation, Sharding, Replication, and BLOBs
  • 24. MongoDB  Cross-platform document-oriented database system developed between 2007 and 2010 by a company then known as 10gen  Considered the most popular NoSQL Database now used by such big names as eBay, SourceForge and The New york Times  http://www.mongodb.org/
  • 26. Summary  Necessity is the mother of invention. The Internet Revolution has driven deep thinkers to amazing heights in the last decade and Big Data is one of the results  Big companies like Google and Yahoo showed the maturity and security by sharing with the world such ground breaking discoveries  There is certainly a place in the future for Big Data and that furture is approaching at light speed. It may pay to key in.
  • 28.
  • 29. Authorities’ Quotes  Ignatius Fernandez  “ … still believes in the superiority of relational technology but believes that the relational camp needs to get its act together if it wants to compete with the NoSQL camp in performance, scalability, and availability … “  Arup  “ … but in the case of the phone records, and especially collated with other records to identify criminal or pseudo- criminal activities such as financial records, travel records, etc., the traditional databases such as Oracle and DB2 likely will not scale well… “
  • 30.  - Cetin Ozbutun  "Oracle Big Data Appliance X4-2 continues to raise the big data bar, offering the industry's only comprehensive appliance for Hadoop to securely meet enterprise big data challenges“  Google  “ …Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth … “

Editor's Notes

  1. IA – Internet Age
  2. Source
  3. Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
  4. Volume Gigabytes 10^9 bytes Terabytes 10^12 bytes Petabytes 10^15 bytes Exabytes 10^18 bytes Source: http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/ Ref: http://theinnovationenterprise.com/summits/big-data-boston-2014
  5. Human Fault Tolerance – Businesses require systems that can survive human error or malicious intent. Hence the craze for Data Protection and Disaster Recovery Hyper Scale on Demand – businesses want the assurance that even if they start small, they can easily grow big by expanding rather than replacing systems. Minimize CAPEX – Businesses do not want to spend too much when the ROIs are not yet clear Low Learning Curve – Businesses and their staff want to spend the minimum amount of time, money and effort on training Saptak Sen – Senior Product Manager, Azure Data Platform
  6. Meeting the above requirements faces a challenge called the CAP Theorem on Brewer’s Theorem
  7. Consistency: At any point all result sets fetched from different data structures are available Availability: Data is available when required and response is as defined in SLA Partition Tolerance: the system can survive having large data sharded across many drives As you scale I/O disk I/O lags other components of s typical system Consistency here is slightly different from ACID consistency BASE, an acronym for Basically Available Soft-state services with Eventual-consistency
  8. OLAP can sacrifice Consistency for availability but not OLTP Highlight Eventual Consistency Sharding: Partition all tables in a schema in the exact same way. Shards live on different servers in shared nothing fashion
  9. Batch Layer – All raw data comes in here to the system Speed Layer – The response time limitations of the first layer are mitigated Serving Layer - Acquisition – the Batch Layer Organization Analysis
  10. “ … Synchronisation is the worst enemy of parallelism … “ - Saptak Sen
  11. Source: Ignatius Fernandez http://www.confio.com/webinars/nosql-big-data/lib/playback.html Parallelism occurs in both Map and Reduce phases Note Key Value Pairs. Highlight Word Count problem (or Fruit Count problem) Acquisition – the Batch Layer Organization Analysis Hello World for Big Data
  12. From Google and Yahoos approach to solving their high volume data problems. http://arup.blogspot.com/2013/06/demystifying-big-data-for-oracle.html Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant HDInsight is Microsoft’s Implementation of Hadoop http://searchcloudcomputing.techtarget.com/definition/Hadoop
  13. Name Node Stores Metadata and manages access Each block replicated across several Data Nodes
  14. http://www.snia.org/sites/default/education/tutorials/2013/fall/BigData/SergeBazhievsky_Introduction_to_Hadoop_MapReduce_v2.pdf Designed for analytics but Facebook customied to support realtime through messages in 2008.
  15. Hadoop MapReduce A programming model for large scale data processing
  16. http://www.slideshare.net/AswaniVonteddu/big-data-nosql-the-dba
  17. Concepts: Distributed Hash Tables, Eventual Consistency, Replication and Data Partitioning, Example: Amazon Dynamo Concepts: Distributed Key Value Stores, Supports Nested Columns, Example: Cassandra
  18. If you are not prepared to work with small schemas, by definition you are not interested in NoSQL. First task in Amazons development of Dynamo was breaking up the monolithic schema. Amazon’s Dynamo did not use distributed transactions but asynchronous replication as such but attempted to provide what is called “Eventual Consistency” Worthy of Note is that eBay uses Oracle and regular SQL to achieve the same objectives are Amazon according to Ignatius Fernandez
  19. Source: Randy Shoup. http://www.infoq.com/presentations/shoup-ebay-architectural-principles