SlideShare a Scribd company logo
1 of 34
Download to read offline
Where Is Your Data?:
An Introduction to Problems and
Bottlenecks in Data Systems
!
John Joo, Program Director
David Drummond, Program Director
!
Insight Data Engineering
Program mentors are data engineers from
top technology companies including:
Goals
• Understand the different components of the
tech stack at a high level.
• Understand the hardware bottlenecks that
dictate the tech stack.
• Understand the tech stacks that are generally
used for different types of companies, and why.
Computing basics
Various ports
(I/O)
up to ~ 10GB/s
CPU
(processor)
~ 1GHz
Hard Drive
(storage)
~ 250GB
RAM
(memory)
~ 8GB
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~ 8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~ 8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Network Processing Storage
What does this look like for a
business?
Data @ Point of Sale
• 1 Transaction → 2 kb
• What did Customer buy?
• How much did Customer
spend?
• When did Customer make
this transaction?
Daily Data @ Individual Store
• ~50,000 transactions / store /
day → 100 MB
• Servers at back of store
• What items were sold today?
• What was our revenue for
today?
• How much was refunded today?
• What do we need to do to
restock for tomorrow?
Yearly Data @ Individual Store
• 20 million transactions → 40 GB /
year
• What are some seasonal trends in
purchased items?
• How should we target our coupons or
advertisements to local customers?
• Who were the most efficient
employees?
• Should the store’s hours change
depending on the time of year?
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Yearly Data @ All Stores
• 7 billion transactions → 10 TB / year
• Requires in data centers
• What national sales campaigns should we
run? Ads, coupons, commercials, web.
• What should the CEO's compensation
be?
• Where should we open Supercenters,
Discount Stores, Neighborhood Stores,
Walmart Expresses?
• What music should we play in the stores?
Complete Historic
Data @ All Stores
• 16 years (1992 - 2008)
• 1 trillion transactions → 2.5 PB
• Data centers
• “Area 71” in Caverna, Missouri.
• 125,000-square-foot
• 460 TB
• Colorado Springs
• 210,000-square-foot
• $100 million
Area 71
Various ports
(I/O)
RAM
(memory)
CPU
(processor)
Hard Drive
(storage)
Network Processing Storage
Bottlenecks in Data Systems
Proper data system design should consider
these limiting bottlenecks:
• Loading data into the CPU and memory
• Finding data on the disk
• Moving data across the network
Bottlenecks: Loading Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
Bottlenecks: Loading Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Distributed computing with ample memory
Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
• Solution: SSD and structuring data in the order it is accessed
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Moving Data
• Moving data from machine to machine over a network
Bottlenecks: Moving Data
• Solution: Keeping data close to the processors
• Moving data from machine to machine over a network
Bottlenecks: Example
• Processing a 2 kB transaction in memory, sequentially and
randomly on disk, or across the network
100 :1 200 :1 50 :1
Tech Stacks for Companies
Depending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility
Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data
Large Firms with Stable Growth
• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data
• 7 PB / Day
• 1 kW / TB
• ~$20 / TB / Month
Start-Ups with Exponential Growth
• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day
• $20-50 / TB / Mo
Start-Ups with Exponential Growth
• Example: Netflix - AWS fails on Christmas Eve
• Con: You can rent the computers, but you own the failure
Data Pipeline
Ingestion
Realtime Processing
File System Batch Processing
Database
Gathering
data in a
reliable way
Storing the
unstructured
data redundantly
Processing the
data in large
batches at the
data center
Processing live
streaming data reliably
Organizing
data for quick
access
Conclusion
• Understand the different components of the
tech stack at a high level
• Understand the hardware bottlenecks that
dictate the tech stack
• Understand the tech stacks that are generally
used for different types of companies, and why

More Related Content

What's hot

MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.George Joseph
 
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"Mike Nelson
 
London VMUG Presentation 19th July 2012
London VMUG Presentation 19th July 2012London VMUG Presentation 19th July 2012
London VMUG Presentation 19th July 2012Chris Evans
 
Big Data Business Transformation - Big Picture and Blueprints
Big Data Business Transformation - Big Picture and BlueprintsBig Data Business Transformation - Big Picture and Blueprints
Big Data Business Transformation - Big Picture and BlueprintsAshnikbiz
 
Datavail Health Check
Datavail Health CheckDatavail Health Check
Datavail Health CheckDatavail
 
Why Now May Be The Time To Consider A Managed Services Approach to Database A...
Why Now May Be The Time To Consider A Managed Services Approach to Database A...Why Now May Be The Time To Consider A Managed Services Approach to Database A...
Why Now May Be The Time To Consider A Managed Services Approach to Database A...Datavail
 

What's hot (9)

MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Teradata Intelligent Memory
Teradata Intelligent MemoryTeradata Intelligent Memory
Teradata Intelligent Memory
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
 
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
 
London VMUG Presentation 19th July 2012
London VMUG Presentation 19th July 2012London VMUG Presentation 19th July 2012
London VMUG Presentation 19th July 2012
 
Big Data Business Transformation - Big Picture and Blueprints
Big Data Business Transformation - Big Picture and BlueprintsBig Data Business Transformation - Big Picture and Blueprints
Big Data Business Transformation - Big Picture and Blueprints
 
Lecture1
Lecture1Lecture1
Lecture1
 
Datavail Health Check
Datavail Health CheckDatavail Health Check
Datavail Health Check
 
Why Now May Be The Time To Consider A Managed Services Approach to Database A...
Why Now May Be The Time To Consider A Managed Services Approach to Database A...Why Now May Be The Time To Consider A Managed Services Approach to Database A...
Why Now May Be The Time To Consider A Managed Services Approach to Database A...
 

Viewers also liked

Tailwind Strategies Overview Oct 2009
Tailwind Strategies Overview Oct 2009Tailwind Strategies Overview Oct 2009
Tailwind Strategies Overview Oct 2009tailwindstrategies
 
Bottlenecks -- some ramblings and a bit of data from maize PAGXXII
Bottlenecks -- some ramblings and a bit of data from maize PAGXXIIBottlenecks -- some ramblings and a bit of data from maize PAGXXII
Bottlenecks -- some ramblings and a bit of data from maize PAGXXIIjrossibarra
 
The Knowledge Reengineering Bottleneck
The Knowledge Reengineering BottleneckThe Knowledge Reengineering Bottleneck
The Knowledge Reengineering BottleneckRinke Hoekstra
 
Pqm bottlenecks
Pqm   bottlenecksPqm   bottlenecks
Pqm bottlenecksdhvani1234
 
Top Devops bottlenecks, constraints and best practices
Top Devops bottlenecks, constraints and best practicesTop Devops bottlenecks, constraints and best practices
Top Devops bottlenecks, constraints and best practicesMike Kavis
 
Performance Bottleneck Identification
Performance Bottleneck IdentificationPerformance Bottleneck Identification
Performance Bottleneck IdentificationMustufa Batterywala
 

Viewers also liked (7)

Tailwind Strategies Overview Oct 2009
Tailwind Strategies Overview Oct 2009Tailwind Strategies Overview Oct 2009
Tailwind Strategies Overview Oct 2009
 
Bottlenecks -- some ramblings and a bit of data from maize PAGXXII
Bottlenecks -- some ramblings and a bit of data from maize PAGXXIIBottlenecks -- some ramblings and a bit of data from maize PAGXXII
Bottlenecks -- some ramblings and a bit of data from maize PAGXXII
 
The Knowledge Reengineering Bottleneck
The Knowledge Reengineering BottleneckThe Knowledge Reengineering Bottleneck
The Knowledge Reengineering Bottleneck
 
Pqm bottlenecks
Pqm   bottlenecksPqm   bottlenecks
Pqm bottlenecks
 
Top Devops bottlenecks, constraints and best practices
Top Devops bottlenecks, constraints and best practicesTop Devops bottlenecks, constraints and best practices
Top Devops bottlenecks, constraints and best practices
 
People as Bottlenecks
People as BottlenecksPeople as Bottlenecks
People as Bottlenecks
 
Performance Bottleneck Identification
Performance Bottleneck IdentificationPerformance Bottleneck Identification
Performance Bottleneck Identification
 

Similar to Where Your Data Is Stored and Processed

Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22David E Drummond
 
DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920Daniel Cohen
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...Linaro
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"Chris Dwan
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Webinar: Sizing Up Object Storage for the Enterprise
Webinar: Sizing Up Object Storage for the EnterpriseWebinar: Sizing Up Object Storage for the Enterprise
Webinar: Sizing Up Object Storage for the EnterpriseStorage Switzerland
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)guest0f8e278
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The TrenchesGeorge Ang
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Melt iron heterogeneous computing - lspe v3
Melt iron   heterogeneous computing - lspe v3Melt iron   heterogeneous computing - lspe v3
Melt iron heterogeneous computing - lspe v3Rinka Singh
 
Connect internal hardware components.pptx
Connect internal hardware components.pptxConnect internal hardware components.pptx
Connect internal hardware components.pptxabdifetah
 

Similar to Where Your Data Is Stored and Processed (20)

Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22
 
DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920
 
The New Model
The New ModelThe New Model
The New Model
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Webinar: Sizing Up Object Storage for the Enterprise
Webinar: Sizing Up Object Storage for the EnterpriseWebinar: Sizing Up Object Storage for the Enterprise
Webinar: Sizing Up Object Storage for the Enterprise
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Melt iron heterogeneous computing - lspe v3
Melt iron   heterogeneous computing - lspe v3Melt iron   heterogeneous computing - lspe v3
Melt iron heterogeneous computing - lspe v3
 
Connect internal hardware components.pptx
Connect internal hardware components.pptxConnect internal hardware components.pptx
Connect internal hardware components.pptx
 

Recently uploaded

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 

Recently uploaded (20)

Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 

Where Your Data Is Stored and Processed

  • 1. Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems ! John Joo, Program Director David Drummond, Program Director ! Insight Data Engineering
  • 2.
  • 3. Program mentors are data engineers from top technology companies including:
  • 4. Goals • Understand the different components of the tech stack at a high level. • Understand the hardware bottlenecks that dictate the tech stack. • Understand the tech stacks that are generally used for different types of companies, and why.
  • 6. Various ports (I/O) up to ~ 10GB/s CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB RAM (memory) ~ 8GB
  • 7. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~ 8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB
  • 8. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~ 8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB Network Processing Storage
  • 9. What does this look like for a business?
  • 10.
  • 11. Data @ Point of Sale • 1 Transaction → 2 kb • What did Customer buy? • How much did Customer spend? • When did Customer make this transaction?
  • 12. Daily Data @ Individual Store • ~50,000 transactions / store / day → 100 MB • Servers at back of store • What items were sold today? • What was our revenue for today? • How much was refunded today? • What do we need to do to restock for tomorrow?
  • 13. Yearly Data @ Individual Store • 20 million transactions → 40 GB / year • What are some seasonal trends in purchased items? • How should we target our coupons or advertisements to local customers? • Who were the most efficient employees? • Should the store’s hours change depending on the time of year?
  • 14. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB
  • 15. Yearly Data @ All Stores • 7 billion transactions → 10 TB / year • Requires in data centers • What national sales campaigns should we run? Ads, coupons, commercials, web. • What should the CEO's compensation be? • Where should we open Supercenters, Discount Stores, Neighborhood Stores, Walmart Expresses? • What music should we play in the stores?
  • 16. Complete Historic Data @ All Stores • 16 years (1992 - 2008) • 1 trillion transactions → 2.5 PB • Data centers • “Area 71” in Caverna, Missouri. • 125,000-square-foot • 460 TB • Colorado Springs • 210,000-square-foot • $100 million Area 71
  • 18. Bottlenecks in Data Systems Proper data system design should consider these limiting bottlenecks: • Loading data into the CPU and memory • Finding data on the disk • Moving data across the network
  • 19. Bottlenecks: Loading Data • All data that is processed must be loaded into the CPU Disk Storage Memory CPU Price Speed
  • 20. Bottlenecks: Loading Data • All data that is processed must be loaded into the CPU Disk Storage Memory CPU Price Speed • Solution: Distributed computing with ample memory
  • 21. Bottlenecks: Finding Data • Finding a new file on disk (known as random seeks) Actuator arm with head that reads from disk End of Desired File Beginning of Desired File
  • 22. Bottlenecks: Finding Data • Finding a new file on disk (known as random seeks) • Solution: SSD and structuring data in the order it is accessed Actuator arm with head that reads from disk End of Desired File Beginning of Desired File
  • 23. Bottlenecks: Moving Data • Moving data from machine to machine over a network
  • 24. Bottlenecks: Moving Data • Solution: Keeping data close to the processors • Moving data from machine to machine over a network
  • 25. Bottlenecks: Example • Processing a 2 kB transaction in memory, sequentially and randomly on disk, or across the network 100 :1 200 :1 50 :1
  • 26. Tech Stacks for Companies Depending on your growth plans: • Single system with small data • Distributed data center with large data • Renting computers for flexibility
  • 27. Small Firms with Small Data • Example: Small medical firm with slow growth • Pros: Easy to maintain, data locality, inexpensive • Cons: Difficult to grow quickly, risky, not ideal for analysis
  • 28. Small Firms with Small Data • Example: Small medical firm with slow growth • Pros: Easy to maintain, data locality, inexpensive • Cons: Difficult to grow quickly, risky, not ideal for analysis
  • 29. Small Firms with Small Data
  • 30. Large Firms with Stable Growth • Example: Facebook with steadily growing data centers • Pros: Economies of scale, redundancy, innovative design • Cons: Upfront capital, dedicated maintenance • >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month
  • 31. Start-Ups with Exponential Growth • Example: AirBnB - rent processing and storage from AWS • Pros: Scales easily, no maintenance, no upfront capital • Cons: Expensive in the long run, depend on data provider • 50 GB / Day • $20-50 / TB / Mo
  • 32. Start-Ups with Exponential Growth • Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure
  • 33. Data Pipeline Ingestion Realtime Processing File System Batch Processing Database Gathering data in a reliable way Storing the unstructured data redundantly Processing the data in large batches at the data center Processing live streaming data reliably Organizing data for quick access
  • 34. Conclusion • Understand the different components of the tech stack at a high level • Understand the hardware bottlenecks that dictate the tech stack • Understand the tech stacks that are generally used for different types of companies, and why