Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev

221 views

Published on

THORNY PATH TO DATA MINING PROJECTS

Java is often criticized for hard parsing CSV datasets, poor matrix and vectors manipulations. This makes it hard to easy and efficiently implement certain types of machine learning algorithms. In many cases data scientists choose R or Python languages for modeling and problem solution and you as a Java developer should rewrite R algorithms in Java or integrate many small Python scripts in Java application.


But why so many Highload tools like Cassandra, Hadoop, Giraph, Spark are written in Java or executed on JVM? What is the secret of successful implementation and running? Maybe we should forget old manufacturing approach when we separate developers from research engineers in production projects?


During the report, we will discuss typical Data Mining tasks, advantages and disadvantages of Hadoop ecosystem, battle between Spark and Hadoop for a place under the Sun, difference between popular Machine Learning tools and libraries.


Attendees of my talk will become more familiar with different abbreviations and buzz words and also will get useful tips about self-education way in this area.

Published in: Software
  • Be the first to comment

JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev

  1. 1. THORNY PATH TO DATA MINING PROJECTS Alexey Zinovyev, Java Trainer in EPAM
  2. 2. 2JDD conference About I am a <graph theory, machine learning, traffic jams prediction, BigData algorithms> scientist But I'm a <Java, NoSQL, Hadoop, Spark> programmer
  3. 3. 3JDD conference What do I know about Krakow?
  4. 4. 4JDD conference W kociołkach bigos grzano; w słowach wydać trudno bigosu smak przedziwny, kolor i woń cudną
  5. 5. 5JDD conference True story about Poland!
  6. 6. 6JDD conference In this topic … A lot of strange pictures and technologies from crazy zoo We talk about • Data Mining • Hadoop ecosystem • Spark and its friends • Machine Learning libraries
  7. 7. 7JDD conference Are you a Hadoop developer?
  8. 8. 8JDD conference Let’s do THIS!
  9. 9. 9JDD conference The Good Old Days
  10. 10. 10JDD conference One of these fine days...
  11. 11. 11JDD conference We need in Python dev 'cause Data Mining
  12. 12. 12JDD conference No, you are JavaEE developer only, continue …
  13. 13. 13JDD conference Write your backends, dude!
  14. 14. 14JDD conference Let’s talk about it, Java-boy...
  15. 15. 15JDD conference Can a Java programmer to be a Data Scientist?
  16. 16. 16JDD conference Sexy Data Scientist
  17. 17. 17JDD conference Real Data Scientist
  18. 18. 18JDD conference And what I tell you, young man
  19. 19. 19JDD conference And what I tell you, young man
  20. 20. WHAT IS DATA MINING?
  21. 21. 21JDD conference Statistics?
  22. 22. 22JDD conference Not OLAP, 100%
  23. 23. 23JDD conference Hey, man, predict something!
  24. 24. 24JDD conference Hey, man, predict something!
  25. 25. 25JDD conference Man or sofa?
  26. 26. 26JDD conference It’s Time for Java Superhero, yeah!
  27. 27. 27JDD conference Before patterns discovering you should .. • Select small pieces • Define default values for missed data • Remove strange signals from data • Merge some tables in one if required
  28. 28. SUBJECT AREA
  29. 29. 29JDD conference Typical questions for DM • Which loan applicants are high-risk?
  30. 30. 30JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud?
  31. 31. 31JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year?
  32. 32. 32JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year? • Can you recommend music for users?
  33. 33. DATA
  34. 34. 34JDD conference Datasets • Facebook users, tweets • Trade transactions • Government • Medicine (genomic data) • Telecommunications
  35. 35. 35JDD conference Data Sources • Relational Databases • Data warehouses (Historical data) • Files in CSV or in binary format • Internet or electronic mails • Scientific, research (R, Octave, Matlab)
  36. 36. PATTERN MINING
  37. 37. 37JDD conference Association rule learning
  38. 38. 38JDD conference What is Cluster Analysis? It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown.
  39. 39. 39JDD conference Different algorithms – different results
  40. 40. 40JDD conference Regression
  41. 41. 41JDD conference • Training set of classified examples (supervised learning) Classification
  42. 42. 42JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items Classification
  43. 43. 43JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items • Main goal: find a function (classifier) that maps input data to a category (class) Classification
  44. 44. 44JDD conference Decision trees
  45. 45. 45JDD conference Cruel Tree
  46. 46. 46JDD conference Green circle is blue square or red triangle? Let’s ask its neighbors! kNN (k-nearest neighbor)
  47. 47. FASHION LANGUAGES
  48. 48. 48JDD conference Octave
  49. 49. 49JDD conference • A small amount of ML algorithms • All your matrixes are belong to us! • Single thread model • Java support • Octave in Java? Why not Octave?
  50. 50. 50JDD conference Do you like this GUI?
  51. 51. 51JDD conference • 25% of R packs are written in Java • Syntax is too sweet • You should read 1000 lines in docs to write 1 line of code • Single thread model for 95% algorithms Why not R?
  52. 52. 52JDD conference Now Python is an idol for young scientists due to the low barrier to entry Why not Python?
  53. 53. 53JDD conference • High-level language • Have you ever heard about a Jython? • Long way to real Highload production • We are not Python developers Why not Python?
  54. 54. 54JDD conference DM libraries
  55. 55. 55JDD conference Think Java
  56. 56. JAVA ECOSYSTEM
  57. 57. 57JDD conference Hadoop
  58. 58. 58JDD conference How to make features from Hadoop cluster?
  59. 59. 59JDD conference Pig & Hive
  60. 60. 60JDD conference Hive
  61. 61. 61JDD conference PIG (Triangle count)
  62. 62. 62JDD conference Why do we need in special graph approach?
  63. 63. HOW TO MAKE GRAPH FEATURES
  64. 64. 64JDD conference SNA
  65. 65. 65JDD conference MapReduce for iterative calculations • High complexity of graph problem reduction to key-value model • Iteration algorithms, but multiple chained jobs in M/R with full saving and reading of each state Think like a vertex…
  66. 66. 66JDD conference Data vs Graph
  67. 67. 67JDD conference TRAIN MODEL
  68. 68. 68JDD conference Java API for Data mining, JSR 73 and JSR 247 • javax.datamining.supervised defines the supervised function-related interfaces • javax.datamining.algorithm contains all mining algorithm subclass packages • JDM 2.0 adds Text Mining, Time series and so on.. JDM
  69. 69. 69JDD conference Who knows Weka?
  70. 70. 70JDD conference • Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL databases • Source code of all algorithms in Java • Preprocessing tools: discretization, normalization, resampling, attribute selection, transforming and combining Weka
  71. 71. 71JDD conference Weka
  72. 72. 72JDD conference SPMF • It’s codebase of algorithms in pattern mining field • It has cool examples and implementation of 109 algorithms • Cool performance results in specific area • Codebase grows very fast • Not so many classification algorithms are covered
  73. 73. 73JDD conference Mahout • Scalable machine learning with Samsara • Advanced Implementations of Java’s Collections Framework for better Performance. • New algorithms will build on Spark platform • Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Miscellaneous are supported
  74. 74. 74JDD conference Code sample Mahout (K-Means) // read the point values and generate vectors from input data final List vectors = vectorize(points); // Write data to sequence hadoop sequence files writePointsToFile(configuration, vectors); // Write initial centers for clusters writeClusterInitialCenters(configuration, vectors); // Run K-means algorithm final Path inputPath = new Path(POINTS_PATH); final Path clustersPath = new Path(CLUSTERS_PATH); final Path outputPath = new Path(OUTPUT_PATH); HadoopUtil.delete(configuration, outputPath); KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false); // Read and print output values readAndPrintOutputValues(configuration);
  75. 75. 75JDD conference Hadoop ecosystem
  76. 76. HADOOP IS NOT SEXY
  77. 77. 77JDD conference Whaaaat?
  78. 78. 78JDD conference Map Reduce Job Writing
  79. 79. 79JDD conference YARN?
  80. 80. 80JDD conference SPARK: the bloody son of MR • MapReduce in memory • Up to 50x faster than Hadoop • RDD is a basic building block (immutable distributed collections of objects)
  81. 81. 81JDD conference Mahout’s killer?
  82. 82. 82JDD conference MLlib supports • Classification and regression • Collaborative filtering • Clustering • Dimensionality reduction • Optimization
  83. 83. 83JDD conference Code sample MLlib (K-Means) // Cluster the data into two classes using KMeans int numClusters = 2; int numIterations = 20; KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); // Evaluate clustering by computing Within Set Sum of Squared Errors double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE); // Save and load model clusters.save(sc.sc(), "myModelPath"); KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
  84. 84. 84JDD conference MLlib • .. extends scikit-learn (Python lib) and Mahout • .. runs fully on Spark • .. is documented • .. is well for large datasets and parallelized algorithms
  85. 85. 85JDD conference It solves all problems!
  86. 86. 86JDD conference In conclusion • Think about your data
  87. 87. 87JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer
  88. 88. 88JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark
  89. 89. 89JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms
  90. 90. 90JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms • Write Java code
  91. 91. 91JDD conference Hold your data and go ahead!
  92. 92. 92JDD conference CALL ME IF YOU WANT TO KNOW MORE Thanks a lot
  93. 93. 93JDD conference Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw LinkedIn: https://www.linkedin.com/in/zaleslaw

×