Good Enough Analytics

714 views

Published on

Presented @ Bigdata Singapore Meetup. Good Enough Analytics is a methodology I am working on to achieve decent analytical results at a reasonable cost. Warning: For the consumption of Data Nerds Only. For 99% of normal humans, these slides are snooze inducing =P.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
714
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Hello, I am here to share a little of what I have learnt so far regarding big data analytics. I do not claim to be a data scientist nor a statistician nor an expert in big data technologies. Which is good, because many experts become too focus on their particular domain, be it data modeling, stats, or big data tools. Instead, I hope I will be able to get everyone to think of big data analytics as a process with numerous components where data scientist, statisticians and big data technologist can work together to achieve the goal of good enough analytics.
  • Imagine you are a lemonade stand owner. What is good enough analytics for you? Perhaps some simple analytics on excel or open source programs to determine the optimum price, weather condition and position to sell the lemonade? Or perhaps you just set a price and call it a day?
  • What is good enough analytics if you working in a fraud detection section of a bank? Well you will probably need to run some sophisticated fraud pattern detection algorithms, buy some large, expansive analytical tools hardware and hire a good team of data scientist. So inherently, we do know what is roughly good enough analytics for different scenario. What I am trying to do here is to give good enough analytics a clearer structure for us to think about analytical problems.
  • To begin, I will like to talk about analytical tools
  • Analytical tools are like spoons. There are established, standard spoons that we use for everyday purpose
  • There are also niche, special purpose spoons to bring out the best flavor of the food
  • Sometimes you need a big spoon
  • Other times you need a baby spoon. The point is, just like you will never swear by a spoon, you should not be too caught up with picking the best analytical tool. It is important for big data people to be open minded and just like spoons, every analytical tool has its different purpose and usage according to the business scenario. Throughout this presentation, I will try to be as platform agnostic as possible, presenting general ideas that will work not matter you are using SAS, greenplum, IBM, orcale, revR etc
  • Here we have a usefulness vs cost graph of analytical tools. The graph is shaped this way due to the law of diminishing returns. Initially, you see great returns on investment when you start purchasing new analytical tools. As you progress further up the graph, however, you will need more and more expansive tools and experienced data scientist to uncover the less obvious trends/traits in your data. Axis: *Usefulness in the form of helping business make decisions. *Cost in the form of hardware/software/manpower
  • And like all good graphs, there is a always a point of stupidity, where you will face rapidly diminishing usefulness for the price you pay.
  • For the lemonade stand business, the point of stupidity is when you decide to setup a hadoop cluster to counter the other lemonade stalls around you instead of just selling something else or changing location
  • For the bank, it will be less obvious. Will you buy a super computer? Or a 3rd oracle rack? It all depends. Sometimes it is worth it to push the limits of analytics, other time it is not. It depends on the situation, the company and experience. Nevertheless, it is impt to know the point of stupidity exist and always ask yourself: “am I approaching the point of stupidity or I just need the extra edge for the breakthrough?”
  • Good enough analytics is simply analytics that lies before the pt of stupidity. It is the cost efficient solution As we saw from the previous example, the pt of stupidity for lemonade stand and bank is very different and hence is their line of good enough analytics. And what is stupid today may not be stupid tomorrow. Things change, more budget come in, new challengers enter market and now, we really need to move the point of stupidity further up the curve. That’s normal
  • So here we have the first definition of Good Enough Analytics
  • Moving beyond tools, now I will like to talk about models
  • The “perfect model” is the kind of model over zealous A+ students come up with while in school. Real world big data, on the other hand, is too complex an animal for anyone to come up with perfect model for.
  • You cannot build a perfect model because there are things we simply do not know we don’t know. We will never have perfect data nor perfect understanding of the big data. And knowing that is kind of liberating as the goal will no longer be the impossible goal of seeking perfection, instead, it becomes the constant improvement towards a good enough answer. We should fear perfection because if perfection is attained, data scientists will be out of job.
  • This is a graphical example. Each model has its own decision boundary and errors. When we combine them into an ensemble, their extremities average out and we obtain a much better results than individual model. (The final ensemble in the image shows a perfect result but in real world, we probably wont be so lucky)
  • In this case, this complex 4 shaped data cannot be easily represented/detected by any single model
  • An ensemble of multiple models, each forming a piece of the 4 sides, might come close in representing the actual data.
  • Beyond all the therorycraft and babies, I will like to show you some actual code and statistical theories. I will be using R codes here but really, the concepts presented here can be easily done in python, SAS code or whatever favorite language you like. Again, I am no statistician nor world number one data scientist so I will try my best to give a good enough explanation of the concepts.
  • Bagging simply means we repeat the model multiple times, each time sampling randomly with replacement a portion of the data to prevent overfitting and get closer to the “truth”. One of the fastest among simple ensemble models, it is also the least accurate unless the data is linear. In the code, what we trying to do is to obtain the mean of 1000 linear models, each linear model built on a random sample of 90% of the data.
  • GBM have great accuracy but is hard to tweak and understand and is slower (cant find a way to run in concurrently/multinode). It is a stronger version of bagging, so at each step of resampling, instead of always picking 90% of data randomly, it will smartly select the subset of data with the most information gain. In essence, each iteration of boosting creates three weak classifiers: the first classifier C1 is trained with a random subset of the available training data. The training data subset for the second classifier C2 is chosen as the most informative subset, given C1 . Specifically, C2 is trained on a training data only half of which is correctly classified by C1 , and the other half is misclassified. The third classifier C3 is trained with instances on which C1 and C2 disagree. The three classifiers are combined through a three-way majority vote.interaction.depth: The maximum depth of variable interactions. 1 implies an additive model, 2 implies a model with up to 2-way interactions, etc.n.minobsinnode: minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.Shrinkage: Also known as the learning rate or step-size reduction, it modifythe update rule and regularize themodel, smaller shrinkage results in improvement in accuracy but more iterations are needed.Refer to this link for more details: http://www.scholarpedia.org/article/Ensemble_learning#Bagging
  • Random forest is simply a collection of decision trees. In the code, each decision tree will be based on a random sample of 1/3 of the columns and will be combined using a voting system or mode across the 999 tress. Random forest have a good balance of speed an accuracy
  • There is still a lot to learn but I got to Top 1% of Kaggle using this. Ensemble of Ensembles method is pretty good. The toughest challenge is in the 3rd method. You will need to run multiple cross validations and have a regression model ontop of it to determine the optimum ratio of the models to minimize the mean square error. It is quite some work.
  • One of the key point to note here is the need for diversity among constituent models to have better results. The idea here is to look through the problem from different perspective and methodologies .
  • An example from the kaggle competition I participated in. While the data itself is not strictly big data per se, the challenge lies in the fairly large number of attributes/columns
  • These columns are not the actual data set, I added some labeling to better illustrate my point.
  • The main problem will be that these useless columns will be held in RAM/HDD and waste compute powers as models run through them and ignore them. The nearzerovar function in R removes the column with zero (like the column in yellow) or near to zero variance. The near to zero variance part needs some explanation. For example, if out of 1million drugs, one drug happens to be manufactured by company ABC and that drug happens to be a rouge drug (poisons the patient instead of curing him). Can we generalize that all drugs produced by company ABC are rouge drugs? Of course not. By tweaking the freqCut and/or uniqueCut, we can set the cutoff point for such outliers and remove them to increase the accuracy of the model. Although…there are times when we want to keep these outliers, example to train for fraud detection models
  • A blue pill or red pill does not determine the potency of the drug (not matrix). The idea here is simple. Just a one line code in R and from here, you can customize to select only the important variables and shave out the less important columns. Doing so correctly should slightly increases the accuracy and greatly reduce the time taken to run analysisHere are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.
  • Next,I want to compress the data, reducing number of columns without removing the attributes completely. Most of the time, the top 10% of principle components account for 90% of the variance in the data, so you can compress 90% of the data and still retain the data model somewhat intact. Although it is only one line of code in R, PCA is very powerful way to greatly decrease analysis time and to visualize complex high dimension data. I will like to explain how it works in the following slides. Tol: a value indicating the magnitude below which components should be omitted. (Components are omitted if their standard deviations are less than or equal to tol times the standard deviation of the first component.) With the default null setting, no components are omitted. Other settings for tol could be tol = 0 or tol = sqrt(.Machine$double.eps), which would omit essentially constant components.
  • Andrew Ng: Always try analysis without PCA first. The reason is by reducing attributes base on .90, .95, .99 variance, we are still losing accuracy, however small. It does speed up the processing speed considerably and we can thus apply stronger models to compensate for the lost in accuracy. Also, as PCA merge attributes together, it is a lot harder to use/understand as compared to importance()If I ask you to represent the data with a single line, how will you do it? Probably you will just laugh and draw a best fit line across the points (the red line)
  • The red line (best fit line) is the principal component that represents the two attributes. We can now remove the attributes and we effectively compressed two attributes into one principal component
  • Now we have a slightly more complex data. Again, to best represent the data, you will probably draw a best fit line across (red line) The blue line is the shortest distance from the points to the best fit line
  • We now project the points via the blue lines onto the red line and here we can see the principal component that represents the two attributes. Again, we effectively compressed two attributes into one principal component
  • I will not embarrass myself in front of the cloud experts who presented before me. They have done a great job explaining why do analytics on cloud and how to do it. In essence, unless you need real time analytics, good enough analytics on hourly/daily basis is much more manageable and fits the on-demand nature of cloud. We just need to pay for the time we need to run the analysis.
  • Here we will talk about preparing data for the cloud. We cannot just ship our entire database onto the cloud and pray that we don’t get sued by our clients.
  • Here are some ways we can message the data. The main idea is to remove any features that can uniquely identify an individual. Of course the data will also need to be encrypted, both at rest in cloud database and in flight during data transfer.
  • Amazon spot instance is very hot among kaggle players. Cheap and powerful. *GiB is Gibibyte. Actual useable amount. 60.5GiB ~ 64GB ram
  • There are limitations of spot instance and here are some common solutions people use.
  • Sample workable code for KNN built on PCA.
  • Revolution R data chunking is powerful and allow small-big data analytics on laptops
  • Full multicore code in R
  • No time to talk about visualization but they are important:at the start of analysis in helping us understand the data throughout the analysis in helping us understand our modelsAnd at the end when presenting the data/findings
  • Some useful references for the various topics covered
  • I have a six sense that good enough analytics will be greatly suitable in Asia. The reason being most analytics tools / methodology are developed in the west, mainly USA, by big companies for big companies. I think there is a lot of market out there to apply good enough analytics on big data on the smaller, leaner Asian companies. Things in asia works differently, where analytics is frankly, still in its infancy. Perhaps I will explore more in future presentations.
  • Good Enough Analytics

    1. 1. Good Enough Analytics by Kai Xin
    2. 2. The Good Enough Stuff Analytical Tools
    3. 3. Analytical Tools are like spoons
    4. 4. Analytical Tools are like spoons
    5. 5. Usefulness
    6. 6. Usefulness Point of stupidity
    7. 7. Usefulness Point of stupidity
    8. 8. Usefulness Point of stupidity Point of stupidity
    9. 9. Point of stupidity What is stupid today, might not be stupid tomorrow
    10. 10. Good Enough Analytics Big data analytics using cost efficient tools
    11. 11. The Good Enough Stuff Ensembles of good enough models
    12. 12. Point of stupidity: The perfect model 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3 A “perfect” model is too complex, too costly to build, too hard to maintain and not flexible to change.
    13. 13. “There are known knowns; there are things we know that we know. There are known unknowns; there are things that we now know we don't know. But there are also unknown unknowns; there are things we do not know we don't know.” By Donald Rumsfeld, United States Secretary of Defense and Potential Data Scientist Why the perfect model is stupid
    14. 14. “In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models” Good Enough Analytics: Ensembles 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 + 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 + 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6
    15. 15. scholarpedia.org Refer to References
    16. 16. scholarpedia.org Refer to References
    17. 17. scholarpedia.org Refer to References
    18. 18. The Serious Stuff …beyond theorycraft
    19. 19. Simple Ensembles – GLM Bootstrap aggregating (bagging) predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE) train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") } result<-rowMeans(predictions)
    20. 20. Simple Ensembles – Gradient Boosting Machines gbmMod<-gbm(eqn, train,n.trees=10000, shrinkage=0.002, distribution="gaussian", interaction.depth=7, bag.fraction=0.9, n.minobsinnode = 50 ) Similar to bagging, boosting also creates an ensemble of classifiers by resampling the data, which are then combined by majority voting. However, in boosting, resampling is strategically geared to provide the most informative training data for each consecutive classifier.
    21. 21. Simple Ensembles - Random Forest rf <- foreach(ntree=rep(333,3), .combine=combine, .packages='randomForest')%dopar% randomForest(train[,3:length(train)], train$Act, ntree=ntree, do.trace=1000, mtry=round(colNumber/3), replace=FALSE, nodesize = 5, na.action=na.omit)
    22. 22. Ensemble of Ensembles 1. Mean(RF+GBM+BagGLM) 2. Median(RF+GBM+BagGLM) 3. 0.4*RF+0.4*GBM+0.2*BagGLM
    23. 23. Ensembles – Why it matters Improve accuracy Ensembles tend to yield better results than its constituent models when there is a significant diversity among the models Developing multiple simple model is faster attempting to develop the perfect model More resistance to over fitting Less reliant on any single model Concurrent development Different models can be run and developed on different instances/machines by different data scientist
    24. 24. Ensembles – point of stupidity Netflix prize 1 million dollar winner: Ensemble of 107 models for 10% improvement Too complicated, costly and inflexible to change Actual deployment: Ensemble of 2 models for 8.43% improvement Moral of story: Good Enough Ensemble is good enough
    25. 25. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models
    26. 26. The Good Enough Stuff Data Optimization
    27. 27. Data cleaning vs Data optimization Important but I assume you know Done AFTER data cleaning
    28. 28. Kaggle Medical Drug Competition 15 sets of data Each data set: 1,000 to 2,000 Attributes 500 to 20,000 Rows Qn: Identify rogue drugs
    29. 29. Point of stupidity: Trying to run analysis on all attributes Drug Rogue % Company Color Component 1 Component 2…2000 A 0.0400 XYZ Red 200 30 B 0.0002 XYZ Green 920 50 C 0.8000 XYZ Blue 30 1000 D ? XYZ Red 340 800
    30. 30. Drug Rogue % Company Color Component 1 Component 2…2000 A 0.0400 XYZ Red 200 30 B 0.0002 XYZ Green 920 50 C 0.8000 XYZ Blue 30 1000 D ? XYZ Red 340 800 Not all attributes are born equal No Variance Irrelevant Too many attributes
    31. 31. Drug Rogue % Company A 0.0400 XYZ B 0.0002 XYZ C 0.8000 XYZ D ? XYZ R code: Library(caret) healthdata[nearZeroVar(healthdata, freqCut = 95/5, uniqueCut = 10)]<-list(NULL) <- this attribute does not help in differentiating between the drugs Remove no variance / near zero variance attributes
    32. 32. Drug Rogue % Color A 0.0400 Red B 0.0002 Green C 0.8000 Blue D ? Red R code for Random Forest: importanceScore <- importance(myMod) R code for GBM: importanceScore <- summary.gbm (myMod, ntree) <- this attribute has no relevance to % rouge drug Remove not important attributes
    33. 33. Drug Rogue % Component 1 Component 2…2000 A 0.0400 200 30 B 0.0002 920 50 C 0.8000 30 1000 D ? 340 800 R code: pc <- prcomp(train[, 2:length(train)],tol=0.12) <- too many attributes takes very long to run analysis Attribute reduction using Principal Component Analysis
    34. 34. Andrew Ng: Always try analysis without PCA first. X XXXX X Attribute 1 Attribute 2 Attribute reduction using Principal Component Analysis Andrew Ng: Machine Learning Course Refer to References
    35. 35. Andrew Ng: Always try analysis without PCA first. X XXXX X Principal Component Attribute reduction using Principal Component Analysis Andrew Ng: Machine Learning Course Refer to References
    36. 36. X X X X X X Attribute 1 Attribute 2 Attribute reduction using Principal Component Analysis Andrew Ng: Machine Learning Course Refer to References
    37. 37. The 1D red line and points are now representative of the 2D graph Principal Component Attribute reduction using Principal Component Analysis 0 0 0 0 0 0 Andrew Ng: Machine Learning Course Refer to References
    38. 38. Data Optimization – Why it matters Performance Improvement (importance,nearZeroVar) Cut down attributes which are useless or not “good enough”. More accurate and complex models can be built on attributes that matters. Cost Savings (PCA) Less data needs to be processed, faster turnover for models and results.
    39. 39. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data
    40. 40. The Good Enough Stuff Scaling on cloud
    41. 41. Why use Cloud How often do you really need a multimillion machine to be on standby 24/7 to churn data? Do you really need real time analytics or is hourly/daily/weekly/monthly report good enough?
    42. 42. Cloud – Why it matters Excellent bang for the buck <$5/hr to rent million dollar worth of power. No need to purchase/maintain hardware. Scale on demand Great for Ensemble Modeling You can start multiple instance, each instance running one simple model and ensemble them But beware of data security and privacy laws Not suitable for all kinds of data/application For example, Amazon Web Service is HIPAA compliant but Rackspace is not.
    43. 43. Name Age Income Postal Peter 23 $2,000 400573 Sally 11 $0 520028 Paul 70 $500 521201 Mark 30 $8,000 247392 Prepare data for the cloud
    44. 44. Name Age Age Group Income Income Range Postal Postal Area Peter 23 Youth $2,000 $1,000- $3,000 400*** Eunos Sally 11 Child $0 $0 520*** Simei Paul 70 Senior $500 $1-$1,000 521*** Tampines Mark 30 Adult $8,000 >$5,000 247*** Tanglin Prepare data for the cloud Remove Identity Use general category Reference: Dr. Yap Ghim Eng (A*Star) Use range category Masking Rollup
    45. 45. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data, scaled on cloud services
    46. 46. The Good Enough Stuff …that we have no time for Amazon Web Service
    47. 47. sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blas sudo yum install -y lapack-devel blas-devel wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gz tar -xf R-2.15.2.tar.gz cd R-2.15.2 ./configure --with-x=no sudo make PATH=$PATH:~/R-2.15.2/bin/ cd .. wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/download tar -xzf numpy-1.6.2.tar.gz cd numpy-1.6.2 sudo python setup.py install cd .. wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/download tar -xzf scipy-0.11.0.tar.gz cd scipy-0.11.0 sudo python setup.py install cd .. wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817 tar -xzf nose-1.1.2.tar.gz cd nose-1.1.2 sudo python setup.py install Basic code to setup Amazon instance for analytics =after sudo-ing and running R, type= install.packages('gbm') install.packages('randomForest') To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"
    48. 48. Amazon EC2 Spot Instance Cluster Compute Eight Extra Large 60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet $0.27 per hour High-Memory Quadruple Extra Large Instance 68.4 GiB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform $0.14 per hour
    49. 49. Weakness of Spot Instance Bidding system. If your bid < spot instance price, instance will be terminated. Solutions: 1) Put master on normal cloud instance and slave on spot instance 2) Heartbeat + Queue with Checkpoint
    50. 50. The Good Enough Stuff …that we have no time for PCA with KNN
    51. 51. library(FNN) train <- read.csv("train.csv", header=TRUE) test <- read.csv("test.csv", header=TRUE) pc <- prcomp(train[, 2:length(train)],tol=0.12) mydata <- data.frame(label = train[, "label"], pc$x) labels <- mydata[,1] mydata2 <- mydata[,-1] test.p <- predict(pc, newdata = test) results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1, algorithm="cover_tree")] write(results, file="knn_PCA.csv", ncolumns=1) Principal Component Analysis - With K-Nearest Neighbor
    52. 52. The Good Enough Stuff …that we have no time for Data Chunking
    53. 53. Data Chunking– Revolution R Loosely based on NoSQL The XDF format is a binary file format that stores data in blocks and processes data in chunks (groups of blocks) for efficient reading of arbitrary columns and contiguous rows Use a format called XDF For more details, visit RevR website
    54. 54. Data Chunking– Why it matters # Chunk 6.5GB worth of data onto HDD in XDF rxImport(inData = trainFile, outFile = “trainingData.xdf”) #revR created methods like rxGlm to run huge Poisson regression directly on XDF file myPos <- rxGlm(amount2 ~ Mailed+Donated+RR,data="trainingData", family=poisson()) *This cannot be done using normal R on my laptop, as R tries to load entire dataset into memory
    55. 55. RAM: Fast but expansive SSD: ~4x faster than normal HDD when chunking Data Chunking– Speeding it up using SSD instead of normal HDD
    56. 56. The Good Enough Stuff …that we have no time for Multicore
    57. 57. Multicore Processing – Revolution R library(foreach) library(doSNOW) cluster <-makeCluster(3, type = "SOCK") registerDoSNOW(cluster) setMKLthreads(1) predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE) train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") } result<-rowMeans(predictions)
    58. 58. Multicore Processing – Why it matters License Cost (Usually charge by per CPU) 1 CPU with 4 core = 1 single user license Distributed 4 CPUS with 1 core each = 4 license or group license Performance Improvement ~2 x performance for 3 core vs 1 core
    59. 59. Visualization
    60. 60. Good Enough ReferencesRandom Forest •Obtaining knowledge from a random forest •Suggestions for speeding up Random Forests •Random Forest with classes that are very unbalanced GBM •Define boosting •Generalized Boosted Models:A guide to the gbm package •What are some useful guidelines for GBM parameters? •R gbm logistic regression •How to win the KDD Cup Challenge with R and gbm Ensembles •Ensemble learning introduction •Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets •Resources for learning how to implement ensemble methods •Ensemble methods •Intro to ensemble learning in R •Predictive analytics & decision tree
    61. 61. Good Enough References PCA and NearZero •Principal Component Analysis in R •PCA on high dimensional data •PCA on training and test data •Nearzero R caret library Misc •Andrew Ng’s Machine Learning Course •A Few Useful Things to Know about Machine Learning •Creating HIPAA-Compliant Medical Data Applications With AWS •Amazon EC2 Spot Instances •Improve Predictive Performance in R with Bagging •Kaggle: Visualizing dark world •Kaggle: Visualizing handwriting
    62. 62. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data, scaled on cloud services
    63. 63. Qns? Email me @ thiakx@gmail.com LinkedIn Profile Kaggle Profile Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data, scaled on cloud services Asia?
    64. 64. •Slide 2: http://3.bp.blogspot.com/- nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG •Slide 3: http://www.salesmanagementmastery.com/wp- content/uploads/2010/09/money-flying.jpg •Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg •Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html •Slide 7: http://2.bp.blogspot.com/- Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg •Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg •Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg •Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank- icon.jpg •Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page •Slide 19-21: www.scholarpedia.org •Slide 23/25: www.wikipedia.org •Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg •Slide 63: www.kaggle.com Photo Credits

    ×