Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Analysis - Making Big Data Work

3,722 views

Published on

Data Analysis - Making Big Data Work

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Analysis - Making Big Data Work

  1. 1. Data Analysis Making Big Data Work David Chiu 2014/11/24
  2. 2. About Me Founder of LargitData Ex-Trend Micro Engineer ywchiu.com
  3. 3. Big Data & Data Science
  4. 4. US Election Prediction 4
  5. 5. World Cup Prediction
  6. 6. Hurricane Prediction
  7. 7. Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  8. 8. Being A Data Scientist, You Need to Know That Much? Seriously?
  9. 9. Statistic Single Variable、Multi Variable、ANOVA Data Munging Data Extraction, Transformation, Loading Data Visualization Figure, Business Intelligence Required Skills
  10. 10. What You Probably Need Is A Team Business Analyst Knowing how to use different tools under different circumstance Statistician How to process big data? DBA How to deal with unstructured data Software Engineer Knowing how to user statistics
  11. 11. Four Dimension 12 Single Machine Memory R Local File Cloud Distributed Hadoop HDFS Statistics Analysis Linear Algebra Architect Management Standard Concept MapReduce Linear Algebra Logistic Regression Tool Hadoop PostgreSQL R Analyst How to use these tools Hackers R Python Java
  12. 12. “80% are doing summing and averaging” Content 1.Data Munging 2.Data Analysis 3.Interpret Result What Data Scientists Do?
  13. 13. Application of Data Analysis Text Mining Classify Spam Mail Build Index Data Search Engine Social Network Analysis Finding Opinion Leader Recommendation System What user likes? Opinion Mining Positive/Negative Opinion Fraud Analysis Credit Card Fraud
  14. 14. Feed data to computer Make Computer to Do Analysis
  15. 15. Let Computer Predict For You
  16. 16. Predictive Analysis Learn from experience (Data), to predict future behavior What to Predict? e.g. Who is likely to click on that ad? For What? e.g. According to the click possibility and revenue to decide which ad to show. Predictive Analysis
  17. 17. Customer buying beer will also buy pampers? People are surfing telephone fee rate are likely to switch its vendor People belong to same group are tend to have same telecom vendor Surprising Conclusion
  18. 18. According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior. Predictive Model
  19. 19. Linear Model e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability) Female 13.5%,Male1.5% Rule Model e.g. If the user is “She” And Income is over 30k And haven’t seen the ad yet The click rate is 11% Simple Predictive Model
  20. 20. Induction From detail to general A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E -- Tom Mitchell (1998) Discover an effective model Start from a simple model Update the model based on feeding data Keep on improving prediction power Machine Learning
  21. 21. Statistic Analysis Regression Analysis Clustering Classification Recommendation Text Mining Application 22
  22. 22. Image recognition
  23. 23. Decision Tree Rate > 1,299/Month Probability to switch vendor 15% Probability to switch vendor 3% Yes No
  24. 24. Decision Tree Rate > 1,299/Month Probability to switch vendor 3% Yes No Probability to switch vendor 10% Probability to switch vendor 22% Income>22,000 Yes No
  25. 25. Decision Tree Rate > 1,299/Month Yes No Probability to switch vendor 10% Probability to switch vendor 22% Income>22,000 Yes No Probability to switch vendor 1% Probability to switch vendor 7% Free for intranet Yes No
  26. 26. Supervised Learning Regression Classification Unsupervised Learning Dimension Reduction Clustering Machine Learning
  27. 27. Supervised Learning
  28. 28. Classification e.g. Stock prediction on bull/bear market Regression e.g. Price prediction Supervised Learning
  29. 29. Dimension Reduction e.g. Making a new index Clustering e.g. Customer Segmentation Unsupervised Learning
  30. 30. Lift The better the lift, the greater the cost? The more decision rule, the more campaign? Design strategy for different persona? The lift for 4 campaign? The lift for 20 ampaign? Lift
  31. 31. Can we use the production rate of butter to predict stock market? Overfitting
  32. 32. Use noise as information Over assumption Over Interpretation What overfitting learn is not truth Like memorize all answers in a single test. Overfitting
  33. 33. Testing Model Use external data or partial data as testing dataset
  34. 34. Traditional Analysis Tool
  35. 35. Statistics On The Fly Built-in Math and Graphic Function Free and Open Source http://cran.r-project.org/src/base/ R Language 36
  36. 36. Functional Programming Use Function Definition To Retrieve Answer Interpreted Language Statistics On the Fly Object Oriented Language S3 and S4 Method R Language
  37. 37. Most Used Analytic Language Most popular languages are R, Python (39%), SQL (37%). SAS (20%). By Gregory Piatetsky, Aug 27, 2013.
  38. 38. Kaggle http://www.kaggle.com/ Most often used language in Kaggle competition
  39. 39. Data Scientist in Google and Apple Use R What is your programming language of choice, R, Python or something else? “I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/ “Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
  40. 40. Discover which customer is likely to churn? Customer Churn Analysis
  41. 41. Account Information state account length. area code phone number User Behavior international plan voice mail plan, number vmail messages total day minutes, total day calls, total day charge total eve minutes, total eve calls, total eve charge total night minutes, total night calls, total night charge total intl minutes, total intl calls, total intl charge number customer service calls Target Churn (Yes/No) Data Description
  42. 42. > install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,] Split data into training and testing dataset 70% as training dataset 30% as testing dataset
  43. 43. churn.rp <- rpart(churn ~ ., data=trainset) plot(churn.rp, margin= 0.1) text(churn.rp, all=TRUE, use.n = TRUE) Build Classifier Classfication
  44. 44. > predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions) Prediction Result pred no yes no 859 18 yes 41 100
  45. 45. > confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes Use Confusion Matrix
  46. 46. Use Testing Data to Validate Result predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))
  47. 47. Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
  48. 48. Dynamic Language Execution at runtime Dynamic Type Interpreted Language See the result after execution OOP Python Language 49
  49. 49. Cross Platform(Python VM) Third-Party Resource (Data Analysis、Graphics、Website Development) Simple, and easy to learn Benefit of Python
  50. 50. Data Analysis Scipy Numpy Scikit-learn Pandas 51
  51. 51. Company that use python 52
  52. 52. Use InfoLite Tool To Extract DOM
  53. 53. Use Python To Build Up Dashboard
  54. 54. Monitor Social Media and News Monitor post on social media Configure keyword and alert Use line plot to show daily post statistics 55 蘋果, nownews, udn, 中央跟風傳媒 還有 其他財經媒體
  55. 55. Daily Statistics Report 56
  56. 56. Examine Associate Article 57
  57. 57. Configure Alert and Keyword 58
  58. 58. Configure Monitor Channel 59
  59. 59. Track Specific Article 60
  60. 60. Have You Learned Big Data? 61
  61. 61. The 3Vs of Big Data
  62. 62. Product Centric Customer Centric Product Centric v.s. Customer Centric
  63. 63. Customer Centric? http://goo.gl/iuy4lY
  64. 64. Personal Recommendation
  65. 65. Knowing Who You Are? Personal recommendation Customer relation management Knowing What Futures Likes? From the history, we can see the future Predictive analysis Knowing What is Hidden Beneath? Correlation, Correlation, Correlation So… What is Big Data?
  66. 66. So… How To Analyze?
  67. 67. Apache Project – From Yahoo Feature Extensible Cost Effective Flexible High Fault Tolerant Hadoop
  68. 68. Hadoop Eco System HDFS MR IMPALA HBASE PIG HIVE SQOOP FLUME HUE, Oozie, Mahout
  69. 69. Tools for different scale Size Classification Tools Lines Sample Data Analysis and Visualisation Whiteboard, Bash, ... KBs – low MBs Prototype Data Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ... MBs – low GBs Online Data Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. Visualisation Flare, AmCharts, Raphael GBs – TBs – PBs Big Data Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  70. 70. Amazon
  71. 71. Facebook
  72. 72. Recommendation System Javascript Flume HDFS HBase Pig Mahout
  73. 73. Item- Based
  74. 74. User - Based
  75. 75. Monitor User Rating
  76. 76. Send User Behavior to Backend
  77. 77. Use Flume To Collect Streaming Data From /tmp/postlog.txt To /user/cloudera/flume
  78. 78. JSON sample data {"food":"Tacos", "person":"Alice", "amount":3} {"food":"Tomato Soup", "person":"Sarah", "amount":2} {"food":"Grilled Cheese", "person":"Alex", "amount":5} Demo Code second_table = LOAD 'second_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int'); Use Pig To Load JSON
  79. 79. Build Recommendation Model
  80. 80. $ hbase shell > create ‘mydata’, ‘mycf’ Build Table In HBase
  81. 81. Examine Data In HDFS
  82. 82. Use Pig To Transfer Data Into HBase
  83. 83. Examine Data In HBase
  84. 84. Build API
  85. 85. Recommendation System
  86. 86. Focus on algorithm Divide and Conquer, Trie, Collaborative Filtering Being an expert of single programming language But knowing what tools and algorithm you can use to solve your problem Define your role Statistician Software engineer What You Should Do
  87. 87. Website: largitdata.com ywchiu.com Email: david@largitdata.com tr.ywchiu@gmail.com Contacts

×