SlideShare a Scribd company logo
1 of 27
Title
presenters
Big Data
Tiago Knoch - Software developer
Agenda
• What and Why Big Data?
• 4 Vs
• NoSQL
• CAP Theorem
What is Big Data?
Source: http://olap.com/wp-content/uploads/2013/11/bigstock-Big-data-concept-in-word-tag-c-49922318.jpg
Why Big Data?
Source: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/60seconds.jpg
4 Vs
• Volume
• Velocity
• Variety
• Veracity
Volume
Source: http://www-01.ibm.com/software/data/bigdata/images/4-Vs-of-big-data.jpg
Volume
• 2.7 Zetabytes of data exist in the digital universe today (2012)
• Ford’s modern hybrid Fusion model generates up to
25 GB of data per hour
• Google processes 20 PB a day (2008)
• Facebook has 30+ PB of user generated data
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
A petabyte (PB) is 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
Volume
Volume
Velocity
Velocity
• YouTube users upload 48 hours of new video every minute
• The LHC experiments represent about 150 million sensors
delivering data 40 million times per second.
• Twitter has 50 million tweets per day (2012)
• Prozone tracks 10 data points per second for every player,
or 1.4 million data points per game
Variety
Variety
• Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
• Static data vs. streaming data
Veracity
Veracity (complexity)
Ventana report (02/2014) indicated that, in every
analytic exercise, 40-60% of time is spent on "data
preparation" processes - removing duplicates, fixing
partial entries, eliminating null/blank entries,
concatenating data, collapsing columns or splitting
columns, aggregating results into buckets, and
more.
NoSQL
NoSQL
A NoSQL database provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular
relations used in relational databases. Motivations for this
approach include simplicity of design, horizontal scaling, and
finer control over availability. The data structures used by
NoSQL databases (e.g. key-value, graph, or document) differ
from those used in relational databases, making some
operations faster in NoSQL and others faster in relational
databases.
NoSQL
• Large Volumes of data
• Dynamic Schemas
• Auto-Sharding
• Replication
• Horizontally Scalable
NoSQL
CAP Theorem
• Consistency - A read is guaranteed to return the
most recent write for a given client.
• Availability - The system will always respond to
a request (even if it's not the latest data or
consistent across the system).
• Partition Tolerance - The system continues to
operate if individual servers fail or can't be
reached.
CAP Theorem
Source: http://robertgreiner.com/2014/08/cap-theorem-revisited/
CAP Theorem
CAP Theorem
AP - Availability/Partition Tolerance -
Return the most recent version of the data
you have, which could be stale. This
system state will also accept writes that
can be processed later when the partition
is resolved. Choose Availability over
Consistency when your business
requirements allow for some flexibility
around when the data in the system
synchronizes or when the system needs to
continue to function in spite of external
errors (shopping carts, etc.)
CAP Theorem
CP - Consistency/Partition Tolerance -
Wait for a response from the partitioned
node which could result in a timeout
error. The system can also choose to
return an error, depending on the
scenario you desire. Choose
Consistency over Availability when your
business requirements dictate atomic
reads and writes.
CAP Theorem
• CP can also
have some
Availability
through
Replication
Summary
• Solution depends of the problem!
• There’s more than one way.
• Don’t throw away RDBMS
• Big Data is FUN!
Questions?

More Related Content

What's hot

Business intelligence architectures.pdf
Business intelligence architectures.pdfBusiness intelligence architectures.pdf
Business intelligence architectures.pdf
Anand572211
 

What's hot (20)

Big data
Big dataBig data
Big data
 
Introduction to Big Data & Big Data 1.0 System
Introduction to Big Data & Big Data 1.0 SystemIntroduction to Big Data & Big Data 1.0 System
Introduction to Big Data & Big Data 1.0 System
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Big Data
Big DataBig Data
Big Data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Data
 
Big Data
Big DataBig Data
Big Data
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
Big data Introduction
Big data IntroductionBig data Introduction
Big data Introduction
 
5 v of big data
5 v of big data5 v of big data
5 v of big data
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
big data Presentation
big data Presentationbig data Presentation
big data Presentation
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
Business intelligence architectures.pdf
Business intelligence architectures.pdfBusiness intelligence architectures.pdf
Business intelligence architectures.pdf
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation Slides
 
Big data mining
Big data miningBig data mining
Big data mining
 
IoT and Big Data
IoT and Big DataIoT and Big Data
IoT and Big Data
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
 

Similar to Big Data Introduction

Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
How To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLHow To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQL
DataStax
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 

Similar to Big Data Introduction (20)

VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big data
Big dataBig data
Big data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStack
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
How To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLHow To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQL
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 

Recently uploaded

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 

Recently uploaded (20)

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 

Big Data Introduction

  • 2. Agenda • What and Why Big Data? • 4 Vs • NoSQL • CAP Theorem
  • 3. What is Big Data? Source: http://olap.com/wp-content/uploads/2013/11/bigstock-Big-data-concept-in-word-tag-c-49922318.jpg
  • 4. Why Big Data? Source: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/60seconds.jpg
  • 5. 4 Vs • Volume • Velocity • Variety • Veracity
  • 7. Volume • 2.7 Zetabytes of data exist in the digital universe today (2012) • Ford’s modern hybrid Fusion model generates up to 25 GB of data per hour • Google processes 20 PB a day (2008) • Facebook has 30+ PB of user generated data • CERN’s Large Hydron Collider (LHC) generates 15 PB a year A petabyte (PB) is 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
  • 11. Velocity • YouTube users upload 48 hours of new video every minute • The LHC experiments represent about 150 million sensors delivering data 40 million times per second. • Twitter has 50 million tweets per day (2012) • Prozone tracks 10 data points per second for every player, or 1.4 million data points per game
  • 13. Variety • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Static data vs. streaming data
  • 15. Veracity (complexity) Ventana report (02/2014) indicated that, in every analytic exercise, 40-60% of time is spent on "data preparation" processes - removing duplicates, fixing partial entries, eliminating null/blank entries, concatenating data, collapsing columns or splitting columns, aggregating results into buckets, and more.
  • 16. NoSQL
  • 17. NoSQL A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, graph, or document) differ from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases.
  • 18. NoSQL • Large Volumes of data • Dynamic Schemas • Auto-Sharding • Replication • Horizontally Scalable
  • 19. NoSQL
  • 20. CAP Theorem • Consistency - A read is guaranteed to return the most recent write for a given client. • Availability - The system will always respond to a request (even if it's not the latest data or consistent across the system). • Partition Tolerance - The system continues to operate if individual servers fail or can't be reached.
  • 23. CAP Theorem AP - Availability/Partition Tolerance - Return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. Choose Availability over Consistency when your business requirements allow for some flexibility around when the data in the system synchronizes or when the system needs to continue to function in spite of external errors (shopping carts, etc.)
  • 24. CAP Theorem CP - Consistency/Partition Tolerance - Wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. Choose Consistency over Availability when your business requirements dictate atomic reads and writes.
  • 25. CAP Theorem • CP can also have some Availability through Replication
  • 26. Summary • Solution depends of the problem! • There’s more than one way. • Don’t throw away RDBMS • Big Data is FUN!

Editor's Notes

  1. Ask audience for first word when they think of Big Data
  2. Let’s find out first WHY the need for Big Data!
  3. Web 2.0 story – from contente consumers to contente producers
  4. PB = PetaBytes F1, Pharma, Industry, Military, Governments, etc etc etc
  5. Everything has a sensor now, everything is mobile, everything has internet access
  6. Broadband!
  7. LHC = Large Hadron Collider
  8. Big Data Veracity refers to the biases, noise and abnormality in data Analysis!
  9. Doesn’t mean “No Sql” Classic relational databases are not good for big data!
  10. Source: Wikipedia
  11. Wikipedia
  12. CAP (by Eric Brewer, 1998) states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees Classic model is CA, doesn’t have partition tolerance so replication of data is not that easy. NEXT SLIDE HAS IMAGES
  13. Let’s just focus in CP and AP
  14. AMAZON EXAMPLE!
  15. Mobile games example! Or Web ads! Atomicity - Everything in a transaction must happen successfully or none of the changes are committed. This avoids a transaction that changes multiple pieces of data from failing halfway and only making a few changes.
  16. If the data is replicated the probability to be available (to be most updated) is high
  17. RDBMS = Relational database management system