9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
Data for Action Talk - 2016-02-22
1. What is Big Data in a Nutshell?:
An Introduction to Problems and
Bottlenecks in Data Systems
Zach Gazak
David E Drummond
Insight Data Science & Engineering
2.
3. Program mentors are data teams from top
technology companies including:
500+
Fellows
100+
Companies
4. Goals
• Understand what can be done with “Big Data” and
the scale of the data.
• Understand the hardware bottlenecks that dictate
the technology “stack”.
• Understand different stacks that are used for
different types of companies, and why.
6. Types of Data
• Audio / Visual:
Images and Videos
• Text: Comments,
Notes, Profile Content
• Interactions: Likes,
Friendships, Groups
• Site usage: Log in,
Scroll, Click, Post, etc.
7. Types of Data
• Audio / Visual:
Images and Videos
• Text: Comments,
Notes, Profile Content
• Interactions: Likes,
Friendships, Groups
• Site usage: Log in,
Scroll, Click, Post, etc.
Unstructured
Structured
8. How is it Used?
Business Intelligence / Analytics Customer engagement
9. How is it Used?
Research and Development
Product Iteration and Improvement
15. Various ports
(I/O)
up to ~ 10GB/s
CPU
(processor)
~ 1GHz
Hard Drive
(storage)
~ 250GB
RAM
(memory)
~ 8GB
16. Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~ 8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Network Processing Storage
17. Bottlenecks in Data Systems
Proper data system design should consider these
limiting bottlenecks:
• Processing time by the CPU
• Loading data into the CPU and memory
• Finding data on the disk
• Reading data from the disk
• Moving data across the network
18. Bottlenecks: Processing Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
19. Bottlenecks: Processing Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Storage Hierachy, Supercomputers, Distributed Systems
20. Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
21. Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
• Solution: SSD and structured databases for specific use cases
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
29. Tech Stacks for Companies
Depending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility (cloud)
30. Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
31. Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
33. Large Firms with Stable Growth
• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data
• 7 PB / Day
• 1 kW / TB
• ~$20 / TB / Month
34. Start-Ups with Exponential Growth
• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day
• $20-50 / TB / Mo
35. Start-Ups with Exponential Growth
• Example: Netflix - AWS fails on Christmas Eve
• Con: You can rent the computers, but you own the failure