Big Data SamplingHow to make all of your data useful againMikhail Petrenko, Sr. Data Architect, Adobe          mikhail193@...
AgendaWhat is sampling?Why don’t we use Big Data sampling more?Why sampling is a good ideaWhen sampling is a bad ideaAccur...
Analysis
Summaries
Why we don’t sampleResults are not accurateIt takes time and effort to implementIt is hard to maintainWe can perform all t...
Why do we need Big Data?
The Future!
Your real goals
Biggest benefits of sampling
Legacy Tools
Is sampling always a good idea?
How Accurate are We?Profits +/- 30%EPS + 40%Sales forecast +/- 15%-20% considered pretty accurate.
How big of a sample?1000 EPS Analysts30% accuracyHow many do we need to pay to get the sameaccuracy?Just 18
How big of a sample?100,000 site visitorsHow many do we need to analyze to get yes/noanswer accurate to +/- 1%99% accuracy...
Sample of the big picture                 10,000,000 buyers 10% are your visitors         What price to set for SummitSnea...
Cluster              Pricehardware      €10,000software      €4,000Nodes         30Total              €420,000
Results€ 450,000€ 400,000€ 350,000€ 300,000€ 250,000                                                   Avg Loss           ...
Adjust for sampling   Bessels correction
Online Marketing100,000 impressionsBuy and sell 3 blocks per day340 daysPPM € 1.0 (€1.0 profit per 1,000 impressions)
Cluster              Pricehardware      €10,000software      €4,000Nodes         5Total              €70,000
Results€ 80,000€ 70,000€ 60,000€ 50,000                                            Avg loss€ 40,000                       ...
What makes a good sampling        algorithm?UniformUnbiasedConsistentCan be repeatable or non-repeatableIn Big Data we mos...
HowUnique ID  Modulo (remainder of a division)  HashTime  Every N-th minute  Every X-th visitorLocation  Use only 1 server...
Hadoop/Hive buckets                              Sample     Table        Date                              bucket         ...
Beware of buckets   CREATE TABLE user_info_bucketed(user_id BIGINT, firstname   STRING, lastname STRING) COMMENT A buckete...
Repeatable          Non-Repeatable        UserID % 3           1st Visitor of 3         Yesterday               Yesterday ...
Don’t forget the weightsWe estimate the whole by adding weights to thesampleIf you sampled 1/10 of the whole data set mult...
What can go wrongUnique ID  IDs assigned by some ruleTime  Grab 1sth hour of the day – midnight traffic won’t match  day t...
Variable Rate Sampling   Sometimes we want to be biased
Why Variable Rate?
Flat sample                                x3x3     x3             x3   x3x3 x3   x3                 x3      x3
Guarantee inclusion of VIPs                                    x3x3       x3             x3   x3x3 x3     x3              ...
Careful – include VIPs only once                               x3x3     x3            x3   x3x3                 x3 x3   x3...
Watch out for weightsVariable rate introduces additional skew
Weight correction when needed
Stratified weights
Questions?
Shoes Data                        Take               avg loss      cost         loss of profitAll market               $ 1...
Marketing Data                   Take               avg loss              cost              loss of profitAll data        ...
Shoes €200 +/- €98 1Million buyers500000450000400000350000300000                                                avg loss25...
Shoes €200 +/- €20 1Million buyers450000400000350000300000250000                                                 Avg loss ...
Upcoming SlideShare
Loading in …5
×

Big Data Sampling

3,546 views

Published on

Statistical sampling have established itself in all facets of our live from physics to medical research to presidential elections, still when it comes to Big Data we most frequently favor brute force approach and attempt to process our entire data set - it's all or nothing. However we don't really need to count every single grain of sand at the beach to conclude that it will be a great holiday destination. When we analyze our business performance do we compare every digit of last week 365,514,134 visitors to this week?s 366,364,615 or do we want to know one is 0.2% bigger than the other? Or maybe we can say there is no difference? Properly posing questions to Big Data is the key to reducing overall costs of the data systems and getting information faster while preserving brute force crunching for tasks that really have to count every penny and every drop in the ocean. We will present sampling methodologies useful for Hadoop environments, properly structuring the data for export to non-Hadoop systems, discuss establishing proper sampling rate for different tasks, emphasizing its application to digital marketing and variable sampling rate for properly tracking valuable needles in unimportant haystacks.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,546
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Foundation for new discoveries and inventionsSource of additional revenueWe don’t love the data, we love what it gives us
  • Time – less time to run report, more report in the same time frameMoney – systems cost less, more profit
  • Statistical packagesReportingData miningBusiness Intelligence
  • Data collectionFinanceVery small populataions
  • Big Data Sampling

    1. 1. Big Data SamplingHow to make all of your data useful againMikhail Petrenko, Sr. Data Architect, Adobe mikhail193@gmail.com
    2. 2. AgendaWhat is sampling?Why don’t we use Big Data sampling more?Why sampling is a good ideaWhen sampling is a bad ideaAccuracy of sampled reportsVariable rate sampling
    3. 3. Analysis
    4. 4. Summaries
    5. 5. Why we don’t sampleResults are not accurateIt takes time and effort to implementIt is hard to maintainWe can perform all the analysis we want – just give usmore hardware.
    6. 6. Why do we need Big Data?
    7. 7. The Future!
    8. 8. Your real goals
    9. 9. Biggest benefits of sampling
    10. 10. Legacy Tools
    11. 11. Is sampling always a good idea?
    12. 12. How Accurate are We?Profits +/- 30%EPS + 40%Sales forecast +/- 15%-20% considered pretty accurate.
    13. 13. How big of a sample?1000 EPS Analysts30% accuracyHow many do we need to pay to get the sameaccuracy?Just 18
    14. 14. How big of a sample?100,000 site visitorsHow many do we need to analyze to get yes/noanswer accurate to +/- 1%99% accuracyJust 14,267 (1/7)95% accuracy8,763 (1/12)
    15. 15. Sample of the big picture 10,000,000 buyers 10% are your visitors What price to set for SummitSneaker 2013 (€200 +/- €98)? excluded included0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400
    16. 16. Cluster Pricehardware €10,000software €4,000Nodes 30Total €420,000
    17. 17. Results€ 450,000€ 400,000€ 350,000€ 300,000€ 250,000 Avg Loss Cost€ 200,000 Loss of Profit€ 150,000€ 100,000 € 50,000 €0 1,048,576 104,858 10,486 1,049
    18. 18. Adjust for sampling Bessels correction
    19. 19. Online Marketing100,000 impressionsBuy and sell 3 blocks per day340 daysPPM € 1.0 (€1.0 profit per 1,000 impressions)
    20. 20. Cluster Pricehardware €10,000software €4,000Nodes 5Total €70,000
    21. 21. Results€ 80,000€ 70,000€ 60,000€ 50,000 Avg loss€ 40,000 Cost Loss of profit€ 30,000€ 20,000€ 10,000 €0 107,394 10,739 1,074 107
    22. 22. What makes a good sampling algorithm?UniformUnbiasedConsistentCan be repeatable or non-repeatableIn Big Data we mostly use Systematic Sampling
    23. 23. HowUnique ID Modulo (remainder of a division) HashTime Every N-th minute Every X-th visitorLocation Use only 1 server out of 6
    24. 24. Hadoop/Hive buckets Sample Table Date bucket 18-Mar-2013 1 19-Mar-2013 2 Visitors 3 1 20-Mar- 2 2013 3
    25. 25. Beware of buckets CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) COMMENT A bucketed copy of user_info PARTITIONED BY(ds STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS;Clustering depends on data typeClustering of INT is different from BIGINTStrings are even more complicatedPreserve ability of all systems to sampleUse INT or make it an INT
    26. 26. Repeatable Non-Repeatable UserID % 3 1st Visitor of 3 Yesterday Yesterday Y Y Y Y Y - N N N N N N N Y N N N N N N N - Y -Today Today Y Y Y Y Y N N N N N N N N - N N N N N N - Y Y -
    27. 27. Don’t forget the weightsWe estimate the whole by adding weights to thesampleIf you sampled 1/10 of the whole data set multiplyappropriate metrics by 10
    28. 28. What can go wrongUnique ID IDs assigned by some ruleTime Grab 1sth hour of the day – midnight traffic won’t match day traffic Monday won’t match Sunday Different servers may have different schedulesLocation Servers allocated based on region or storefront
    29. 29. Variable Rate Sampling Sometimes we want to be biased
    30. 30. Why Variable Rate?
    31. 31. Flat sample x3x3 x3 x3 x3x3 x3 x3 x3 x3
    32. 32. Guarantee inclusion of VIPs x3x3 x3 x3 x3x3 x3 x3 x1 x3 x3
    33. 33. Careful – include VIPs only once x3x3 x3 x3 x3x3 x3 x3 x3 x1 x3 x3
    34. 34. Watch out for weightsVariable rate introduces additional skew
    35. 35. Weight correction when needed
    36. 36. Stratified weights
    37. 37. Questions?
    38. 38. Shoes Data Take avg loss cost loss of profitAll market $ 1,325,994,929All data - Sample - 1/10of market $ 1,325,993,312 $ 1,616 $ 420,000 421,616.08Sample 1/100 of marketor 1/10 of all data $ 1,325,989,167 $ 5,762 $ 42,000 47,761.83Sample 1/1000 of market $ 1,325,965,877 $ 29,052 $ 4,200 33,251.85Sample 1/10,000 ofmarket $ 1,325,576,009 $ 418,920 $ 420 419,339.65Sample 1/100,000 ofmarket $ 1,321,523,057 $ 4,471,872 $ 42 4,471,913.92
    39. 39. Marketing Data Take avg loss cost loss of profitAll data € 109,969 €0 € 70,000 € 70,000Sample - 1/10 ofpopulation € 108,358 € 1,611 € 7,000 € 8,611Sample 1/100 ofpopulation € 104,610 € 5,359 € 700 € 6,059Sample 1/1000 ofpopulation € 92,981 € 16,989 € 70 € 17,059
    40. 40. Shoes €200 +/- €98 1Million buyers500000450000400000350000300000 avg loss250000 cost200000 loss of profit150000100000 50000 0 all 104,858 10,486 1,049 105
    41. 41. Shoes €200 +/- €20 1Million buyers450000400000350000300000250000 Avg loss System cost200000 Loss of profit 150000100000 50000 0 all 104,858 10,486 1,049 105

    ×