SlideShare a Scribd company logo
What is Big Data in a Nutshell?:
An Introduction to Problems and
Bottlenecks in Data Systems
Zach Gazak
David E Drummond
Insight Data Science & Engineering
Program mentors are data teams from top
technology companies including:
500+
Fellows
100+
Companies
Goals
• Understand what can be done with “Big Data” and
the scale of the data.
• Understand the hardware bottlenecks that dictate
the technology “stack”.
• Understand different stacks that are used for
different types of companies, and why.
Facebook is Data
Types of Data
• Audio / Visual:
Images and Videos
• Text: Comments,
Notes, Profile Content
• Interactions: Likes,
Friendships, Groups
• Site usage: Log in,
Scroll, Click, Post, etc.
Types of Data
• Audio / Visual:
Images and Videos
• Text: Comments,
Notes, Profile Content
• Interactions: Likes,
Friendships, Groups
• Site usage: Log in,
Scroll, Click, Post, etc.
Unstructured
Structured
How is it Used?
Business Intelligence / Analytics Customer engagement
How is it Used?
Research and Development
Product Iteration and Improvement
How is it Used?
How much data is there?
For Zach:
• ~1 MB per month
• Unstructured data only
How much data is there?
For 1.2 billion Zachs ~ 1.2 petabytes per month
How is this done?
Hardware basics
Various ports
(I/O)
up to ~ 10GB/s
CPU
(processor)
~ 1GHz
Hard Drive
(storage)
~ 250GB
RAM
(memory)
~ 8GB
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~ 8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Network Processing Storage
Bottlenecks in Data Systems
Proper data system design should consider these
limiting bottlenecks:
• Processing time by the CPU
• Loading data into the CPU and memory
• Finding data on the disk
• Reading data from the disk
• Moving data across the network
Bottlenecks: Processing Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
Bottlenecks: Processing Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Storage Hierachy, Supercomputers, Distributed Systems
Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
• Solution: SSD and structured databases for specific use cases
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Moving Data
• Moving data from machine to machine over a network
Bottlenecks: Moving Data
• Solution: Keeping data close to the processors (MapReduce)
• Moving data from machine to machine over a network
Bottlenecks: Example
• Processing a 2 kB transaction in memory, sequentially and
randomly on disk, or across the network
100 :1 200 :1 50 :1
Open Questions
• Will processors continue to improve?
• Are there new types of processing?
• What if memory replaced hard
disks?
Quantum Computing
GPU and Deep Learning
Memory Optimized
Tech Stacks for Companies
Depending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility (cloud)
Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data
Large Firms with Stable Growth
• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data
• 7 PB / Day
• 1 kW / TB
• ~$20 / TB / Month
Start-Ups with Exponential Growth
• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day
• $20-50 / TB / Mo
Start-Ups with Exponential Growth
• Example: Netflix - AWS fails on Christmas Eve
• Con: You can rent the computers, but you own the failure
Questions?
• info@insightdatascience.com
• jzgazak@gmail.com
• david@insightdatascience.com

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
MindsMapped Consulting
 
Making Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterpriseMaking Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterprise
Tony Baer
 
i schools - panel session
i schools - panel sessioni schools - panel session
i schools - panel session
ARDC
 
Data Intelligence Overview
Data Intelligence OverviewData Intelligence Overview
Data Intelligence Overview
GDPR SMEs
 
Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019
Phil Watt
 
Finals(Group3)
Finals(Group3)Finals(Group3)
Finals(Group3)
MarkNathanHernandez
 
Finals
FinalsFinals
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Dez Blanchfield
 
Datacenter Pulse Stack v2
Datacenter Pulse Stack v2Datacenter Pulse Stack v2
Datacenter Pulse Stack v2
Jan Wiersma
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
Dux Chandegra
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Technology on a Shoestring Michelle Murrain
Technology on a Shoestring Michelle MurrainTechnology on a Shoestring Michelle Murrain
Technology on a Shoestring Michelle Murrain
webhostingguy
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
SamiraChandan
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
Becky Nahas
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
17aroumougamh
 
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Research Support Team, IT Services, University of Oxford
 

What's hot (18)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Making Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterpriseMaking Big Data a First Class citizen in the enterprise
Making Big Data a First Class citizen in the enterprise
 
i schools - panel session
i schools - panel sessioni schools - panel session
i schools - panel session
 
Data Intelligence Overview
Data Intelligence OverviewData Intelligence Overview
Data Intelligence Overview
 
Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019
 
Finals(Group3)
Finals(Group3)Finals(Group3)
Finals(Group3)
 
Finals
FinalsFinals
Finals
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
Datacenter Pulse Stack v2
Datacenter Pulse Stack v2Datacenter Pulse Stack v2
Datacenter Pulse Stack v2
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Technology on a Shoestring Michelle Murrain
Technology on a Shoestring Michelle MurrainTechnology on a Shoestring Michelle Murrain
Technology on a Shoestring Michelle Murrain
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
 

Viewers also liked

Helton Resume 2016
Helton Resume 2016Helton Resume 2016
Helton Resume 2016
Robert Helton
 
22.03.2013, NEWSWIRE, Issue 266
22.03.2013, NEWSWIRE, Issue 26622.03.2013, NEWSWIRE, Issue 266
22.03.2013, NEWSWIRE, Issue 266
The Business Council of Mongolia
 
ada 2 bloque 3
ada 2 bloque 3ada 2 bloque 3
ada 2 bloque 3
markolol25
 
финальная презентация визионеры
финальная презентация визионерыфинальная презентация визионеры
финальная презентация визионеры
Ансаган Бейсембина
 
We love each other
We love each otherWe love each other
We love each other
Silvia Escandón
 
Personal Branding Checklists for You and Your Team in 2016
Personal Branding Checklists for You and Your Team in 2016Personal Branding Checklists for You and Your Team in 2016
Personal Branding Checklists for You and Your Team in 2016
Kredible
 
Rentmania - коляска, люлька, автокресло: купить нельзя арендовать
Rentmania - коляска, люлька, автокресло: купить нельзя арендоватьRentmania - коляска, люлька, автокресло: купить нельзя арендовать
Rentmania - коляска, люлька, автокресло: купить нельзя арендовать
Efim Aldoukhov
 
IIA TIMES, Special Issue, February 2017, Edited by Sarbjit Bahga
IIA TIMES, Special Issue, February 2017, Edited by Sarbjit BahgaIIA TIMES, Special Issue, February 2017, Edited by Sarbjit Bahga
IIA TIMES, Special Issue, February 2017, Edited by Sarbjit Bahga
Sarbjit Bahga
 

Viewers also liked (10)

Helton Resume 2016
Helton Resume 2016Helton Resume 2016
Helton Resume 2016
 
22.03.2013, NEWSWIRE, Issue 266
22.03.2013, NEWSWIRE, Issue 26622.03.2013, NEWSWIRE, Issue 266
22.03.2013, NEWSWIRE, Issue 266
 
ada 2 bloque 3
ada 2 bloque 3ada 2 bloque 3
ada 2 bloque 3
 
финальная презентация визионеры
финальная презентация визионерыфинальная презентация визионеры
финальная презентация визионеры
 
We love each other
We love each otherWe love each other
We love each other
 
Doc10
Doc10Doc10
Doc10
 
powder power
powder powerpowder power
powder power
 
Personal Branding Checklists for You and Your Team in 2016
Personal Branding Checklists for You and Your Team in 2016Personal Branding Checklists for You and Your Team in 2016
Personal Branding Checklists for You and Your Team in 2016
 
Rentmania - коляска, люлька, автокресло: купить нельзя арендовать
Rentmania - коляска, люлька, автокресло: купить нельзя арендоватьRentmania - коляска, люлька, автокресло: купить нельзя арендовать
Rentmania - коляска, люлька, автокресло: купить нельзя арендовать
 
IIA TIMES, Special Issue, February 2017, Edited by Sarbjit Bahga
IIA TIMES, Special Issue, February 2017, Edited by Sarbjit BahgaIIA TIMES, Special Issue, February 2017, Edited by Sarbjit Bahga
IIA TIMES, Special Issue, February 2017, Edited by Sarbjit Bahga
 

Similar to Data for Action Talk - 2016-02-22

Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
InsightDataScience
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly
 
Big Data Boom
Big Data BoomBig Data Boom
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
SatyaHadoop
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
Niko Vuokko
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
humerashaziya
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
SpringPeople
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
Amazon Web Services
 
Database & Database Users
Database & Database UsersDatabase & Database Users
Database & Database Users
M.Zalmai Rahmani
 

Similar to Data for Action Talk - 2016-02-22 (20)

Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
 
Database & Database Users
Database & Database UsersDatabase & Database Users
Database & Database Users
 

Recently uploaded

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 

Recently uploaded (20)

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 

Data for Action Talk - 2016-02-22

  • 1. What is Big Data in a Nutshell?: An Introduction to Problems and Bottlenecks in Data Systems Zach Gazak David E Drummond Insight Data Science & Engineering
  • 2.
  • 3. Program mentors are data teams from top technology companies including: 500+ Fellows 100+ Companies
  • 4. Goals • Understand what can be done with “Big Data” and the scale of the data. • Understand the hardware bottlenecks that dictate the technology “stack”. • Understand different stacks that are used for different types of companies, and why.
  • 6. Types of Data • Audio / Visual: Images and Videos • Text: Comments, Notes, Profile Content • Interactions: Likes, Friendships, Groups • Site usage: Log in, Scroll, Click, Post, etc.
  • 7. Types of Data • Audio / Visual: Images and Videos • Text: Comments, Notes, Profile Content • Interactions: Likes, Friendships, Groups • Site usage: Log in, Scroll, Click, Post, etc. Unstructured Structured
  • 8. How is it Used? Business Intelligence / Analytics Customer engagement
  • 9. How is it Used? Research and Development Product Iteration and Improvement
  • 10. How is it Used?
  • 11. How much data is there? For Zach: • ~1 MB per month • Unstructured data only
  • 12. How much data is there? For 1.2 billion Zachs ~ 1.2 petabytes per month
  • 13. How is this done?
  • 15. Various ports (I/O) up to ~ 10GB/s CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB RAM (memory) ~ 8GB
  • 16. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~ 8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB Network Processing Storage
  • 17. Bottlenecks in Data Systems Proper data system design should consider these limiting bottlenecks: • Processing time by the CPU • Loading data into the CPU and memory • Finding data on the disk • Reading data from the disk • Moving data across the network
  • 18. Bottlenecks: Processing Data • All data that is processed must be loaded into the CPU Disk Storage Memory CPU Price Speed
  • 19. Bottlenecks: Processing Data • All data that is processed must be loaded into the CPU Disk Storage Memory CPU Price Speed • Solution: Storage Hierachy, Supercomputers, Distributed Systems
  • 20. Bottlenecks: Finding Data • Finding a new file on disk (known as random seeks) Actuator arm with head that reads from disk End of Desired File Beginning of Desired File
  • 21. Bottlenecks: Finding Data • Finding a new file on disk (known as random seeks) • Solution: SSD and structured databases for specific use cases Actuator arm with head that reads from disk End of Desired File Beginning of Desired File
  • 22. Bottlenecks: Moving Data • Moving data from machine to machine over a network
  • 23. Bottlenecks: Moving Data • Solution: Keeping data close to the processors (MapReduce) • Moving data from machine to machine over a network
  • 24. Bottlenecks: Example • Processing a 2 kB transaction in memory, sequentially and randomly on disk, or across the network 100 :1 200 :1 50 :1
  • 25. Open Questions • Will processors continue to improve? • Are there new types of processing? • What if memory replaced hard disks?
  • 27. GPU and Deep Learning
  • 29. Tech Stacks for Companies Depending on your growth plans: • Single system with small data • Distributed data center with large data • Renting computers for flexibility (cloud)
  • 30. Small Firms with Small Data • Example: Small medical firm with slow growth • Pros: Easy to maintain, data locality, inexpensive • Cons: Difficult to grow quickly, risky, not ideal for analysis
  • 31. Small Firms with Small Data • Example: Small medical firm with slow growth • Pros: Easy to maintain, data locality, inexpensive • Cons: Difficult to grow quickly, risky, not ideal for analysis
  • 32. Small Firms with Small Data
  • 33. Large Firms with Stable Growth • Example: Facebook with steadily growing data centers • Pros: Economies of scale, redundancy, innovative design • Cons: Upfront capital, dedicated maintenance • >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month
  • 34. Start-Ups with Exponential Growth • Example: AirBnB - rent processing and storage from AWS • Pros: Scales easily, no maintenance, no upfront capital • Cons: Expensive in the long run, depend on data provider • 50 GB / Day • $20-50 / TB / Mo
  • 35. Start-Ups with Exponential Growth • Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure