Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Heterogeneous Mixture Learning on Spark

770 views

Published on

Big Data Heterogeneous Mixture Learning on Spark

Published in: Technology
  • Be the first to comment

Big Data Heterogeneous Mixture Learning on Spark

  1. 1. Big Data Heterogeneous Mixture Learning on Spark Masato Asahara and Ryohei Fujimaki NEC Data Science Research Labs. Jun/30/2016 @Hadoop Summit 2016
  2. 2. 2 © NEC Corporation 2016 Who we are? ▌Masato Asahara (Ph.D.) ▌Researcher, NEC Data Science Research Laboratory  Masato received his Ph.D. degree from Keio University, and has worked at NEC for 6 years in the research field of distributed computing systems and computing resource management technologies.  Masato is currently leading developments of Spark-based machine learning and data analytics systems, particularly using NEC’s Heterogeneous Mixture Learning technology. ▌Ryohei Fujimaki (Ph.D.) ▌Research Fellow, NEC Data Science Research Laboratory  Ryohei is responsible for cores to NEC’s leading predictive and prescriptive analytics solutions, including "heterogeneous mixture learning” and “predictive optimization” technologies. In addition to technology R&D, Ryohei is also involved with co-developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners in the North American and APAC regions.  Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 115-year history.
  3. 3. 3 © NEC Corporation 2016 Agenda HML
  4. 4. 4 © NEC Corporation 2016 Agenda CPU 12 SEP 11 SEP 10 SEP HML
  5. 5. 5 © NEC Corporation 2016 Agenda HML
  6. 6. 6 © NEC Corporation 2016 Agenda Micro Modular Server HML
  7. 7. NEC’s Predictive Analytics and Heterogeneous Mixture Learning
  8. 8. 8 © NEC Corporation 2016 Enterprise Applications of HML Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  9. 9. 9 © NEC Corporation 2016 NEC’s Heterogeneous Mixture Learning (HML) NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data. Sun Sat Y=a1x1+ ・・・ +anxn Y=a1x1+ ・・・+knx3 Y=b1x2+ ・・・+hnxn Y=c1x4+ ・・・+fnxn Mon Sun Explore Massive Formulas Transparent data segmentation and predictive formulas Heterogeneous mixture data
  10. 10. 10 © NEC Corporation 2016 The HML Model Gender Height tallshort age smoke food weight sports sports tea/coffee beer/wine The health risk of a tall man can be predicted by age, alcohol and smoke!
  11. 11. 11 © NEC Corporation 2016 HML (Heterogeneous Mixture Learning)
  12. 12. 12 © NEC Corporation 2016 HML (Heterogeneous Mixture Learning)
  13. 13. 13 © NEC Corporation 2016 HML Algorithm Segment data by a rule-based tree Feature Selection Pruning Data Segmentation Training data
  14. 14. 14 © NEC Corporation 2016 HML Algorithm Select features and fit predictive models for data segments Feature Selection Pruning Data Segmentation Training data
  15. 15. 15 © NEC Corporation 2016 Enterprise Applications of HML Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Energy/Water Operation Mgmt. Sales Optimization Product Price Optimization
  16. 16. 16 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Energy/Water Demand Forecasting Sales Forecasting Product Price Optimization Demand for Big Data Analysis 5 million(customers)×12(months) =60,000,000 training samples
  17. 17. 17 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenanc e Energy/Water Operation Mgmt. Product Price Optimization Demand for Big Data Analysis 100(items)×1000(shops) =100,000 predictive models Sales Optimization
  18. 18. 18 © NEC Corporation 2016 Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenanc e Energy/Water Operation Mgmt. Sales Optimization Product Price Optimization Demand for Big Data Analysis
  19. 19. Distributed HML on Spark
  20. 20. 20 © NEC Corporation 2016 Why Spark, not Hadoop? Because Spark’s in-memory architecture can run HML faster Feature Selection Pruning Data Segmentation CPU CPU Data segmentation Feature selection CPU CPU CPU Data segmentation Feature selection ~ Pruning
  21. 21. 21 © NEC Corporation 2016 Data Scalability powered by Spark Treat unlimited scale of training data by adding executors executor executor executor CPU CPU CPU driver HDFS Add executors, and get more memory and CPU power Training data
  22. 22. 22 © NEC Corporation 2016 Distributed HML Engine: Architecture HDFS Distributed HML Core (Scala) YARN invoke HML model Distributed HML Interface (Python) Matrix computation libraries (Breeze, BLAS) Data Scientist
  23. 23. 23 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  24. 24. 24 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  25. 25. 25 © NEC Corporation 2016 Challenge to Avoid Data Shuffling executor executor executor driver Model: 10KB~10MB HML Model Merge Algorithm Training data: TB~ executor executor executor driver Naïve Design Distributed HML
  26. 26. 26 © NEC Corporation 2016 Execution Flow of Distributed HML Executors build a rule-based tree from their local data in parallel Driver Executor Feature Selection Pruning Data Segmentation
  27. 27. 27 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the trees Driver Executor Feature Selection Pruning Data Segmentation HML Segment Merge Algorithm
  28. 28. 28 © NEC Corporation 2016 Execution Flow of Distributed HML Driver broadcasts the merged tree to executors Driver Executor Feature Selection Pruning Data Segmentation
  29. 29. 29 © NEC Corporation 2016 Execution Flow of Distributed HML Executors perform feature selection with their local data in parallel Driver Executor Feature Selection Pruning Data Segmentation Sat Sat Sat Sat Sat
  30. 30. 30 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the results of feature selection Driver Executor Feature Selection Pruning Data Segmentation Sat Sat Sat Sat Sat HML Feature Merge Algorithm
  31. 31. 31 © NEC Corporation 2016 Execution Flow of Distributed HML Driver merges the results of feature selection Driver Executor Feature Selection Pruning Data Segmentation Sat HML Feature Merge Algorithm
  32. 32. 32 © NEC Corporation 2016 Execution Flow of Distributed HML Driver broadcasts the merged results of feature selection to executors Driver Executor Feature Selection Pruning Data Segmentation Sat SatSatSat
  33. 33. 33 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark CPU 12 SEP 11 SEP 10 SEP
  34. 34. 34 © NEC Corporation 2016 Wait Time for Other Executors Delays Execution Machine learning is likely to cause unbalanced computation load Executor Sat Sat Time Executor Executor
  35. 35. 35 © NEC Corporation 2016 Balance Computational Effort for Each Executor Executor optimizes all predictive formulas with equally-divided training data Executor Sat Sat Time Executor Executor
  36. 36. 36 © NEC Corporation 2016 Balance Computational Effort for Each Executor Executor optimizes all predictive formulas with equally-divided training data Executor Sat Sat Time Executor Executor Sat Sat Sat Sat Completion time reduced!
  37. 37. 37 © NEC Corporation 2016 3 Technical Key Points to Fast Run ML on Spark 1TB CPU 12 SEP 11 SEP 10 SEP
  38. 38. 38 © NEC Corporation 2016 Machine Learning Consists of Many Matrix Computations Feature Selection Pruning Data Segmentation Basic Matrix Computation Matrix Inversion Combinatorial Optimization … CPU
  39. 39. 39 © NEC Corporation 2016 RDD Design for Leveraging High-Speed Libraries Distributed HML’s RDD Design Matrix object1 element … Straightforward RDD Design Training sample1 partition Training sample2 CPU CPU Matrix computation libraries (Breeze, BLAS) ~Training sample1 Training sample2 … seq(Training sample1, Training sample2, …) mapPartition Matrix object1 … Training sample1 Training sample2 map Iterator Matrix object creation element
  40. 40. 40 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Long-chained RDD operations cause high-cost recalculations Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(PruningMapFunc)
  41. 41. 41 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Long-chained RDD operations cause high-cost recalculations Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) GC .map(TreeMapFunc) CPU .map(PruningMapFunc)
  42. 42. 42 © NEC Corporation 2016 Performance Degradation by Long-Chained RDD Cut long-chained operations periodically by checkpoint() Feature Selection Pruning Data Segmentation RDD.map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) .map(TreeMapFunc) .map(FeatureSelectionMapFunc) .reduce(TreeReduceFunc) .reduce(FeatureSelectionReduceFunc) GC .map(TreeMapFunc) savedRDD .checkpoint() = CPU ~ .map(PruningMapFunc)
  43. 43. Benchmark Performance Evaluations
  44. 44. 44 © NEC Corporation 2016 Eval 1: Prediction Error (vs. Spark MLlib algorithms) Distributed HML achieves low error competitive to a complex model data # samples Distributed HML Logistic Regression Decision Tree Random Forests gas sensor array (CO)* 4,208,261 0.542 0.597 0.587 0.576 household power consumption * 2,075,259 0.524 0.531 0.529 0.655 HIGGS* 11,000,000 0.335 0.358 0.337 0.317 HEPMASS* 7,000,000 0.156 0.163 0.167 0.175 * UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) 1st 1st 1st 2nd
  45. 45. 45 © NEC Corporation 2016 Eval 2: Performance Scalability (vs. Spark MLlib algos) Distributed HML is competitive with Spark MLlib implementations. HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores Distributed HML Logistic Regression Distributed HML * UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) Logistic Regression
  46. 46. Evaluation in Real Case
  47. 47. 47 © NEC Corporation 2016 ATM Cash Balance Prediction Much cash  High inventory cost Little cash  High risk of out of stock Around 20,000 ATMs
  48. 48. 48 © NEC Corporation 2016 Training Speed 2 hours9 days ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 10M ▌Cluster spec (10 nodes)  # CPU cores: 128 (Xeon E5-2640 v3)  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2) * Run w/ 1 CPU core and 256GB memory 110x Speed up Serial HML* Distributed HML
  49. 49. 49 © NEC Corporation 2016 Prediction Error ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 23M ▌Cluster spec (10 nodes)  # CPU cores: 128  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2) Average Error: 17% lower Serial HML Distributed HMLSerial HML Serial HML vs Better: Distributed HML
  50. 50. NEC’s HML Data Analytics Appliance powered by Micro Modular Server
  51. 51. 51 © NEC Corporation 2016 Dilemma between Cloud and On-Premises Cloud On-Premises •High pay-for-use charge •Complicated application software design •Data leakage risk • Complicated operation for Spark/YARN/HDFS • High installation cost • High space/power cost •Quick start •Low operation effort for Spark/YARN/HDFS •No space/power cost •No installation effort • No pay-for-use charge • Low data leakage risk • Simple application software design
  52. 52. 52 © NEC Corporation 2016 NEC’s HML Data Analytics Appliance: Concept Micro Modular Server Data Scientist ~ Plug-and-Analytics! •Quick deploy! •Low data leakage risk •Low operation effort for Spark/YARN/HDFS powered by HDP •Significantly space saving •Low power cost •Simple application software 328 CPU cores, 1.3TB memory and 5TB SSD in 2U space! HML
  53. 53. 53 © NEC Corporation 2016 Speed Test (ATM Cash Balance Prediction) 7275 sec (2h) ▌Summary of data # ATMs: around 20,000 (in Japan) # training samples: around 10M ▌Standard Cluster spec (10 nodes)  # CPU cores: 128 (Xeon E5-2640 v3)  Memory: 2.5TB  Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2) Standard Cluster Micro Modular Server 4529 sec (1.25h) 1.6x Speed up
  54. 54. 54 © NEC Corporation 2016 Toward More Big Data! Micro Modular Server (DX1000) Scalable Modular Server (DX2000) • CPU 2.3x speed up • MEM 1.7x increased • SSD 3.3x increased 272 high-speed Xeon-D 2.1GHz cores, 2TB memory and 17TB SSD in 3U space
  55. 55. 55 © NEC Corporation 2016 Summary CPU 12 SEP 11 SEP 10 SEP HMLMicro Modular Server
  56. 56. 56 © NEC Corporation 2016 Summary HMLMicro Modular Server
  57. 57. Appendix
  58. 58. 59 © NEC Corporation 2016 Micro Modular Server (DX1000) Spec.: Server Module
  59. 59. 60 © NEC Corporation 2016 Micro Modular Server (DX1000) Spec.: Module Enclosure
  60. 60. 61 © NEC Corporation 2016 Scalable Modular Server (DX2000) Spec.: Server Module
  61. 61. 62 © NEC Corporation 2016 Scalable Modular Server (DX2000) Spec.: Module Enclosure

×