1©MapR Technologies - Confidential
Scalability in Hadoop and
Similar Systems
2©MapR Technologies - Confidential
Big is the next big thing
 Big data and Hadoop are exploding
 Companies are being fun...
3©MapR Technologies - Confidential
Slow Motion Explosion
3
4©MapR Technologies - Confidential
Hadoop Explosion
4
5©MapR Technologies - Confidential
Why Now?
 But Moore’s law has applied for a long time
 Why is Hadoop exploding now?
...
6©MapR Technologies - Confidential
Size Matters, but …
 If it were just availability of data then existing big companies ...
7©MapR Technologies - Confidential
Size Matters, but …
 If it were just availability of data then existing big companies ...
8©MapR Technologies - Confidential
Or Maybe Cost
 If it were just a net positive value then finance companies should
adop...
9©MapR Technologies - Confidential
Or Maybe Cost
 If it were just a net positive value then finance companies should
adop...
10©MapR Technologies - Confidential
Backwards adoption
 Under almost any threshold argument startups would not adopt
big ...
11©MapR Technologies - Confidential
Backwards adoption
 Under almost any threshold argument startups would not adopt
big ...
12©MapR Technologies - Confidential
Everywhere at Once?
 Something very strange is happening
– Big data is being applied ...
13©MapR Technologies - Confidential
Everywhere at Once?
 Something very strange is happening
– Big data is being applied ...
14©MapR Technologies - Confidential
More data is being produced more quickly
Data sizes are bigger than even a very large ...
15©MapR Technologies - Confidential
Analytics Scaling Laws
 Analytics scaling is all about the 80-20 rule
– Big gains for...
16©MapR Technologies - Confidential
We knew that
We should have
known that
We didn’t know that!
You’re kidding, people do ...
17©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Anybody with eyes
Intern with a spr...
18©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
2,0000 500 1000 1500
1
0
0.25
0.5
0...
19©MapR Technologies - Confidential
But scaling laws are changing
both slope and shape
20©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
More than just a little
21©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
They are changing a LOT!
22©MapR Technologies - Confidential
23©MapR Technologies - Confidential
24©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
25©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
26©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Initially, linear cost scaling
actu...
27©MapR Technologies - Confidential
Pre-requisites for Tipping
 To reach the tipping point,
 Algorithms must scale out h...
28©MapR Technologies - Confidential
Yeah… but wait
29©MapR Technologies - Confidential
The Standard Sort of Model
 People talk about the law of large numbers as if it were ...
30©MapR Technologies - Confidential
What if …
 These assumptions are:
 Changes have a
– stationary,
– independent,
– fin...
31©MapR Technologies - Confidential
For Example
Time
Stuff
32©MapR Technologies - Confidential
Time
Stuff
End point
has nice
tractable
distribution
33©MapR Technologies - Confidential
What if the Assumptions are Wrong?
 Take the finite variance as a simple example
 Th...
34©MapR Technologies - Confidential
Is it Really Different?
35©MapR Technologies - Confidential
Time
Stuff
36©MapR Technologies - Confidential
What About Real Life?
37©MapR Technologies - Confidential
38©MapR Technologies - Confidential
But is it Really Infinite Variance?
 Or are there other kinds of phenomena that show ...
39©MapR Technologies - Confidential
Why the Difference?
Law of large
numbers
Infinite
variance
Interacting
agents
Apologie...
40©MapR Technologies - Confidential
What Happens with Interactions
 Social phenomena defeat the law of large numbers
 Di...
41©MapR Technologies - Confidential
What are the
Implications?
42©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
43©MapR Technologies - Confidential
In a Nutshell
 Scalability is much more important than we thought
 Mashups are more ...
44©MapR Technologies - Confidential
Thank You
45©MapR Technologies - Confidential
whoami?
 Ted Dunning
– @ted_dunning
– tdunning@maprtech.com (MapR distribution for Ha...
Upcoming SlideShare
Loading in...5
×

Chicago Hadoop in Finance - Ted Dunning

270

Published on

Talk about what scalability really means in terms of interacting processes and statistics of growth by Ted Dunning

Published in: Technology, Business
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
270
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed?
  • Google searches are up 10x over just four years ago.
  • Hadoop use is exploding. We chose this example, which shows job trends for Hadoop. Further evidence that you should pay attention during this talk.
  • But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
  • The different kinds of scaling laws have different shape and I think that shape is the key.
  • The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  • In classical analytics, the cost of doing analytics increases sharply.
  • The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  • New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  • This next sequence shows how the net value changes with different slope linear cost models.
  • Notice how the best net value has jumped up significantly
  • And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  • And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  • Chicago Hadoop in Finance - Ted Dunning

    1. 1. 1©MapR Technologies - Confidential Scalability in Hadoop and Similar Systems
    2. 2. 2©MapR Technologies - Confidential Big is the next big thing  Big data and Hadoop are exploding  Companies are being funded  Books are being written  Applications sprouting up everywhere 2
    3. 3. 3©MapR Technologies - Confidential Slow Motion Explosion 3
    4. 4. 4©MapR Technologies - Confidential Hadoop Explosion 4
    5. 5. 5©MapR Technologies - Confidential Why Now?  But Moore’s law has applied for a long time  Why is Hadoop exploding now?  Why not 10 years ago?  Why not 20? 58/13/2013
    6. 6. 6©MapR Technologies - Confidential Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first 6
    7. 7. 7©MapR Technologies - Confidential Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first They didn’t 7
    8. 8. 8©MapR Technologies - Confidential Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte 8
    9. 9. 9©MapR Technologies - Confidential Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t 9
    10. 10. 10©MapR Technologies - Confidential Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first 10
    11. 11. 11©MapR Technologies - Confidential Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first They did 11
    12. 12. 12©MapR Technologies - Confidential Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small 12
    13. 13. 13©MapR Technologies - Confidential Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why? 13
    14. 14. 14©MapR Technologies - Confidential More data is being produced more quickly Data sizes are bigger than even a very large computer can hold Cost to create and store continues to decrease The Conventional Answer
    15. 15. 15©MapR Technologies - Confidential Analytics Scaling Laws  Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns  The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant  Cost/performance has changed radically – IF you can use many commodity boxes
    16. 16. 16©MapR Technologies - Confidential We knew that We should have known that We didn’t know that! You’re kidding, people do that?
    17. 17. 17©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Anybody with eyes Intern with a spreadsheet In-house analytics Industry-wide data consortium NSA, non-proliferation
    18. 18. 18©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Net value optimum has a sharp peak well before maximum effort
    19. 19. 19©MapR Technologies - Confidential But scaling laws are changing both slope and shape
    20. 20. 20©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value More than just a little
    21. 21. 21©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value They are changing a LOT!
    22. 22. 22©MapR Technologies - Confidential
    23. 23. 23©MapR Technologies - Confidential
    24. 24. 24©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
    25. 25. 25©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
    26. 26. 26©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Initially, linear cost scaling actually makes things worse A tipping point is reached and things change radically …
    27. 27. 27©MapR Technologies - Confidential Pre-requisites for Tipping  To reach the tipping point,  Algorithms must scale out horizontally – On commodity hardware – That can and will fail  Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare
    28. 28. 28©MapR Technologies - Confidential Yeah… but wait
    29. 29. 29©MapR Technologies - Confidential The Standard Sort of Model  People talk about the law of large numbers as if it were …  Well, as if it were a law  It’s not …  It is a context and assumption dependent theorem
    30. 30. 30©MapR Technologies - Confidential What if …  These assumptions are:  Changes have a – stationary, – independent, – finite variance distribution  What happens if these assumptions are wrong?  And which of them is really wrong?
    31. 31. 31©MapR Technologies - Confidential For Example Time Stuff
    32. 32. 32©MapR Technologies - Confidential Time Stuff End point has nice tractable distribution
    33. 33. 33©MapR Technologies - Confidential What if the Assumptions are Wrong?  Take the finite variance as a simple example  This leads to Levy stable distributions  Like the Cauchy distribution
    34. 34. 34©MapR Technologies - Confidential Is it Really Different?
    35. 35. 35©MapR Technologies - Confidential Time Stuff
    36. 36. 36©MapR Technologies - Confidential What About Real Life?
    37. 37. 37©MapR Technologies - Confidential
    38. 38. 38©MapR Technologies - Confidential But is it Really Infinite Variance?  Or are there other kinds of phenomena that show this?  What about the independence assumption?  What if the supposedly independent components of the system communicate?  Like we do. Everyday. All the time.
    39. 39. 39©MapR Technologies - Confidential Why the Difference? Law of large numbers Infinite variance Interacting agents Apologies and credit to Simon DaDeo, SFI The space of all things that change The space of interacting things
    40. 40. 40©MapR Technologies - Confidential What Happens with Interactions  Social phenomena defeat the law of large numbers  Distributions are well modeled by “rich get richer” processes – Pittman-Yar process, Indian Buffet  Limiting dstributions are heavy tailed, power law  We see these distributions everywhere – price of cotton in the 19th century – word frequencies – popularity of Github projects – equity pricing and volumes – sizes of cities – popularity of web-sites
    41. 41. 41©MapR Technologies - Confidential What are the Implications?
    42. 42. 42©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
    43. 43. 43©MapR Technologies - Confidential In a Nutshell  Scalability is much more important than we thought  Mashups are more important than we thought  Network effects are more important than we thought  Exploration is more important than we thought  Hadoop style linear scaling must be mixed with ad hoc analysis
    44. 44. 44©MapR Technologies - Confidential Thank You
    45. 45. 45©MapR Technologies - Confidential whoami?  Ted Dunning – @ted_dunning – tdunning@maprtech.com (MapR distribution for Hadoop) – tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill) – ted.dunning@gmail.com (me)  More info: http://www.mapr.com/company/events/hadoop-in-finance-2012
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×