SlideShare a Scribd company logo
1 of 14
Download to read offline
Introduction to Big Data
Definition of Big Data
 "Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.“
 "Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the potential to
be mined for information.“
 Data growing way faster than computation speeds
 A single machine can no longer process or even store all this data!
The Big Data problem
Where does Big Data come from?
 Online recorded content:
 Clicks
 Ad views
 Server requests
 .. everything what happens online can potentially be recorded
 User generated content (Facebook, Twitter, Instagram, etc)
 Smartphone users reach to their phone 150 times a day (2013)
 Health and scientific computing
 Large Hadron Collider produces about double amount of data than Twitter every year
 Internet of Things (IoT)
 smart thermostat systems
 automobiles with built-in sensors
 all kind of “smart” devices of various sizes
Example scales of Big Data
 EIR communication logs: 1.4 TB / day
 Facebook logs: 60 TB / day
 Google total web index: ~10+ PB (10000TB)
 Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)
 ..as a reminder..
 time to read 1TB from disk: 3 hours (100MB/s)
 Google web index could be read from disk serialized in ~3.4 years
How do we program this thing?
6
OK but I don’t work at Google yet ...
Startup example
 Let’s design a simple web tracker from scratch
 Register and count each page view for a number of clients
 “Keep simple things simple”
 Version 1.0:
 Problem?
 Huge number of page views => massive DB load on concurrent updates => DB
timeouts => FAIL
Version 2.0
 Why write each count?!
 Let’s introduce a queue and buffer updates
 Problem?
 # of page views and # of clients keep increasing => DB overload => FAIL
Version 3.0
 The bottleneck is the write-heavy DB
 Let’s shard the database!
 Problems?!
 Have to keep adding new servers and re-sharding existing databases
 Re-sharding online is tricky (maybe introduce pending queues?)
 A single code failure corrupts a huge set of data collected over years
 Maintenance nightmare
Is there a way out?
 We need new tools which handle:
 automatic sharding and re-sharding
 automatic replication and rebalancing
 fault tolerance
 effortless horizontal scaling
 But we need to adapt ourselves as well. We need:
 a new definition of “data” (data ≠ information)
 new architectures (Lambda Architecture)
 immutable data (for scaling and fault tolerance)
 functional programming concepts
 No, writing 25 years old structural code in this year’s favorite language
won’t cut it anymore
Big Data tooling
 Apache Hadoop distributed filesystem (HDFS)
 Distributed, scalable, portable filesytem written in Java
 Open source, 10 years old (!) project
 Handles files in the gigabytes-terabytes range
 Manages automatic replication and rebalancing of data
 Facebook had 21 PB of storage on HDFS in 2010
 Yahoo had a cluster of 10 000 Hadoop nodes in 2008
 Apache Spark
 Next generation data processing engine written in Scala
 Open source, 5 years old project
 Up to 100 times faster than Hadoop MapReduce
 Uses functional programming techniques to process data
 Can scale down to get run in an IDE!
Apache Spark by a glance
13
The good news
 The right tools are available and open-source
 The knowledge is available and mostly free
 It’s all ready to get learned!

More Related Content

What's hot

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAmpoolIO
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 

What's hot (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big Data
Big DataBig Data
Big Data
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 

Viewers also liked

Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applicationsali easazadeh
 
Virtualization, the cloud enabler
Virtualization, the cloud enablerVirtualization, the cloud enabler
Virtualization, the cloud enablerPraveen Hanchinal
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Big Data and Social Media
Big Data and Social MediaBig Data and Social Media
Big Data and Social MediaAmy Shuen
 
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...Santiago Castelo
 
Big Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in EmergenciesBig Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in EmergenciesThomas Dybro Lundorf
 
Klarity - Asia digital analytic summit
Klarity -  Asia digital analytic summitKlarity -  Asia digital analytic summit
Klarity - Asia digital analytic summitNDN Group
 
Introduction to Social Media
Introduction to Social MediaIntroduction to Social Media
Introduction to Social MediaGerald Hensel
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Product Placement: The Present & The Future
Product Placement: The Present & The FutureProduct Placement: The Present & The Future
Product Placement: The Present & The Futureitandlaw
 
Big Data Social Media & Smart Apps
Big Data Social Media & Smart AppsBig Data Social Media & Smart Apps
Big Data Social Media & Smart AppsGiacomo Nasilli
 

Viewers also liked (20)

Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applications
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Virtualization, the cloud enabler
Virtualization, the cloud enablerVirtualization, the cloud enabler
Virtualization, the cloud enabler
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big Data and Social Media
Big Data and Social MediaBig Data and Social Media
Big Data and Social Media
 
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
 
Social media & big data
Social media & big dataSocial media & big data
Social media & big data
 
Big Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in EmergenciesBig Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in Emergencies
 
Klarity - Asia digital analytic summit
Klarity -  Asia digital analytic summitKlarity -  Asia digital analytic summit
Klarity - Asia digital analytic summit
 
Introduction to Social Media
Introduction to Social MediaIntroduction to Social Media
Introduction to Social Media
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Product Placement: The Present & The Future
Product Placement: The Present & The FutureProduct Placement: The Present & The Future
Product Placement: The Present & The Future
 
Big Data Social Media & Smart Apps
Big Data Social Media & Smart AppsBig Data Social Media & Smart Apps
Big Data Social Media & Smart Apps
 

Similar to Introduction to Big Data

Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single DatabaseDatafiniti
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
Big data and hadoop introduction
Big data and hadoop introductionBig data and hadoop introduction
Big data and hadoop introductionAjay Mittal
 

Similar to Introduction to Big Data (20)

00 hadoop welcome_transcript
00 hadoop welcome_transcript00 hadoop welcome_transcript
00 hadoop welcome_transcript
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Final deck
Final deckFinal deck
Final deck
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Big data and hadoop introduction
Big data and hadoop introductionBig data and hadoop introduction
Big data and hadoop introduction
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Introduction to Big Data

  • 2. Definition of Big Data  "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.“  "Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.“  Data growing way faster than computation speeds  A single machine can no longer process or even store all this data! The Big Data problem
  • 3. Where does Big Data come from?  Online recorded content:  Clicks  Ad views  Server requests  .. everything what happens online can potentially be recorded  User generated content (Facebook, Twitter, Instagram, etc)  Smartphone users reach to their phone 150 times a day (2013)  Health and scientific computing  Large Hadron Collider produces about double amount of data than Twitter every year  Internet of Things (IoT)  smart thermostat systems  automobiles with built-in sensors  all kind of “smart” devices of various sizes
  • 4.
  • 5. Example scales of Big Data  EIR communication logs: 1.4 TB / day  Facebook logs: 60 TB / day  Google total web index: ~10+ PB (10000TB)  Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)  ..as a reminder..  time to read 1TB from disk: 3 hours (100MB/s)  Google web index could be read from disk serialized in ~3.4 years
  • 6. How do we program this thing? 6
  • 7. OK but I don’t work at Google yet ...
  • 8. Startup example  Let’s design a simple web tracker from scratch  Register and count each page view for a number of clients  “Keep simple things simple”  Version 1.0:  Problem?  Huge number of page views => massive DB load on concurrent updates => DB timeouts => FAIL
  • 9. Version 2.0  Why write each count?!  Let’s introduce a queue and buffer updates  Problem?  # of page views and # of clients keep increasing => DB overload => FAIL
  • 10. Version 3.0  The bottleneck is the write-heavy DB  Let’s shard the database!  Problems?!  Have to keep adding new servers and re-sharding existing databases  Re-sharding online is tricky (maybe introduce pending queues?)  A single code failure corrupts a huge set of data collected over years  Maintenance nightmare
  • 11. Is there a way out?  We need new tools which handle:  automatic sharding and re-sharding  automatic replication and rebalancing  fault tolerance  effortless horizontal scaling  But we need to adapt ourselves as well. We need:  a new definition of “data” (data ≠ information)  new architectures (Lambda Architecture)  immutable data (for scaling and fault tolerance)  functional programming concepts  No, writing 25 years old structural code in this year’s favorite language won’t cut it anymore
  • 12. Big Data tooling  Apache Hadoop distributed filesystem (HDFS)  Distributed, scalable, portable filesytem written in Java  Open source, 10 years old (!) project  Handles files in the gigabytes-terabytes range  Manages automatic replication and rebalancing of data  Facebook had 21 PB of storage on HDFS in 2010  Yahoo had a cluster of 10 000 Hadoop nodes in 2008  Apache Spark  Next generation data processing engine written in Scala  Open source, 5 years old project  Up to 100 times faster than Hadoop MapReduce  Uses functional programming techniques to process data  Can scale down to get run in an IDE!
  • 13. Apache Spark by a glance 13
  • 14. The good news  The right tools are available and open-source  The knowledge is available and mostly free  It’s all ready to get learned!