1. Where Is Your Data?:
An Introduction to Problems and
Bottlenecks in Data Systems
!
John Joo, Program Director
David Drummond, Program Director
!
Insight Data Engineering
4. Goals
• Understand the different components of the
tech stack at a high level.
• Understand the hardware bottlenecks that
dictate the tech stack.
• Understand the tech stacks that are generally
used for different types of companies, and why.
11. Data @ Point of Sale
• 1 Transaction → 2 kb
• What did Customer buy?
• How much did Customer
spend?
• When did Customer make
this transaction?
12. Daily Data @ Individual Store
• ~50,000 transactions / store /
day → 100 MB
• Servers at back of store
• What items were sold today?
• What was our revenue for
today?
• How much was refunded today?
• What do we need to do to
restock for tomorrow?
13. Yearly Data @ Individual Store
• 20 million transactions → 40 GB /
year
• What are some seasonal trends in
purchased items?
• How should we target our coupons or
advertisements to local customers?
• Who were the most efficient
employees?
• Should the store’s hours change
depending on the time of year?
14. Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
15. Yearly Data @ All Stores
• 7 billion transactions → 10 TB / year
• Requires in data centers
• What national sales campaigns should we
run? Ads, coupons, commercials, web.
• What should the CEO's compensation
be?
• Where should we open Supercenters,
Discount Stores, Neighborhood Stores,
Walmart Expresses?
• What music should we play in the stores?
16. Complete Historic
Data @ All Stores
• 16 years (1992 - 2008)
• 1 trillion transactions → 2.5 PB
• Data centers
• “Area 71” in Caverna, Missouri.
• 125,000-square-foot
• 460 TB
• Colorado Springs
• 210,000-square-foot
• $100 million
Area 71
18. Bottlenecks in Data Systems
Proper data system design should consider
these limiting bottlenecks:
• Loading data into the CPU and memory
• Finding data on the disk
• Moving data across the network
19. Bottlenecks: Loading Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
20. Bottlenecks: Loading Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Distributed computing with ample memory
21. Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
22. Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
• Solution: SSD and structuring data in the order it is accessed
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
24. Bottlenecks: Moving Data
• Solution: Keeping data close to the processors
• Moving data from machine to machine over a network
25. Bottlenecks: Example
• Processing a 2 kB transaction in memory, sequentially and
randomly on disk, or across the network
100 :1 200 :1 50 :1
26. Tech Stacks for Companies
Depending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility
27. Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
28. Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
30. Large Firms with Stable Growth
• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data
• 7 PB / Day
• 1 kW / TB
• ~$20 / TB / Month
31. Start-Ups with Exponential Growth
• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day
• $20-50 / TB / Mo
32. Start-Ups with Exponential Growth
• Example: Netflix - AWS fails on Christmas Eve
• Con: You can rent the computers, but you own the failure
33. Data Pipeline
Ingestion
Realtime Processing
File System Batch Processing
Database
Gathering
data in a
reliable way
Storing the
unstructured
data redundantly
Processing the
data in large
batches at the
data center
Processing live
streaming data reliably
Organizing
data for quick
access
34. Conclusion
• Understand the different components of the
tech stack at a high level
• Understand the hardware bottlenecks that
dictate the tech stack
• Understand the tech stacks that are generally
used for different types of companies, and why