Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JavaDayKiev'15 Java in production for Data Mining Research projects

765 views

Published on

Alexey Zinoviev presented this paper on the JavaDayKiev'15 conference http://javaday.org.ua/kyiv/#schedule

This paper covers next topics: Java, Spark, Hadoop, Mahout, MLlib, Weka, Machine Learning, Data Mining

Published in: Data & Analytics
  • Be the first to comment

JavaDayKiev'15 Java in production for Data Mining Research projects

  1. 1. Java in production for Data Mining Research projects Alexey Zinovyev, Java Trainer in EPAM
  2. 2. 2JDD conference About I am a <graph theory, machine learning, traffic jams prediction, BigData algorithms> scientist But I'm a <Java, NoSQL, Hadoop, Spark> programmer
  3. 3. 3JDD conference In this topic … A lot of strange pictures and technologies from crazy zoo We talk about • Data Mining • Hadoop ecosystem • Spark and its friends • Machine Learning libraries
  4. 4. 4JDD conference Are you a Hadoop developer?
  5. 5. 5JDD conference Let’s do THIS!
  6. 6. 6JDD conference The Good Old Days
  7. 7. 7JDD conference One of these fine days...
  8. 8. 8JDD conference We need in Python dev 'cause Data Mining
  9. 9. 9JDD conference No, you are JavaEE developer only, continue …
  10. 10. 10JDD conference Write your backends, dude!
  11. 11. 11JDD conference Let’s talk about it, Java-boy...
  12. 12. 12JDD conference Can a Java programmer to be a Data Scientist?
  13. 13. 13JDD conference Sexy Data Scientist
  14. 14. 14JDD conference Real Data Scientist
  15. 15. 15JDD conference And what I tell you, young man
  16. 16. 16JDD conference And what I tell you, young man
  17. 17. WHAT IS DATA MINING?
  18. 18. 18JDD conference Statistics?
  19. 19. 19JDD conference Tag cloud from B2B conference?
  20. 20. 20JDD conference Not OLAP, 100%
  21. 21. 21JDD conference Hey, man, predict something!
  22. 22. 22JDD conference Hey, man, predict something!
  23. 23. 23JDD conference Man or sofa?
  24. 24. 24JDD conference SUBJECT AREA
  25. 25. 25JDD conference Typical questions for DM • Which loan applicants are high-risk?
  26. 26. 26JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud?
  27. 27. 27JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year?
  28. 28. 28JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year? • Can you recommend music for users?
  29. 29. 29JDD conference It’s Time for Java Superhero, yeah!
  30. 30. 30JDD conference Before patterns discovering you should .. • Select small pieces • Define default values for missed data • Remove strange signals from data • Merge some tables in one if required
  31. 31. 31JDD conference TARGET DATA & PERSONAL DATA
  32. 32. 32JDD conference Targeting by …
  33. 33. 33JDD conference Pay with your personal data All your personal data (PD) are being deeply mined
  34. 34. 34JDD conference Pay with your personal data The industry of collecting, aggregating, and brokering PD is “database marketing.”
  35. 35. 35JDD conference Pay with your personal data 1.1 billion browser cookies, 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom
  36. 36. 36JDD conference RTB
  37. 37. DATA
  38. 38. 38JDD conference Datasets • Facebook users, tweets • Trade transactions • Government • Medicine (genomic data) • Telecommunications
  39. 39. 39JDD conference Data Sources • Relational Databases • Data warehouses (Historical data) • Files in CSV or in binary format • Internet or electronic mails • Scientific, research (R, Octave, Matlab)
  40. 40. PATTERN MINING
  41. 41. 41JDD conference Association rule learning
  42. 42. 42JDD conference What is Cluster Analysis? It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown.
  43. 43. 43JDD conference Different algorithms – different results
  44. 44. 44JDD conference Regression
  45. 45. 45JDD conference • Training set of classified examples (supervised learning) Classification
  46. 46. 46JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items Classification
  47. 47. 47JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items • Main goal: find a function (classifier) that maps input data to a category (class) Classification
  48. 48. 48JDD conference Decision trees
  49. 49. 49JDD conference Cruel Tree
  50. 50. 50JDD conference Green circle is blue square or red triangle? Let’s ask its neighbors! kNN (k-nearest neighbor)
  51. 51. 51JDD conference Hit parade of algorithms
  52. 52. FASHION LANGUAGES
  53. 53. 53JDD conference Octave
  54. 54. 54JDD conference • A small amount of ML algorithms • All your matrixes are belong to us! • Single thread model • Java support • Octave in Java? Why not Octave?
  55. 55. 55JDD conference Do you like this GUI?
  56. 56. 56JDD conference • 25% of R packs are written in Java • Syntax is too sweet • You should read 1000 lines in docs to write 1 line of code • Single thread model for 95% algorithms Why not R?
  57. 57. 57JDD conference Now Python is an idol for young scientists due to the low barrier to entry Why not Python?
  58. 58. 58JDD conference • High-level language • Have you ever heard about a Jython? • Long way to real Highload production • We are not Python developers Why not Python?
  59. 59. 59JDD conference DM libraries
  60. 60. Let’s run on JVM!
  61. 61. JAVA ECOSYSTEM
  62. 62. Family
  63. 63. Spring Data
  64. 64. HADOOP
  65. 65. 65JDD conference Hadoop
  66. 66. 66JDD conference Hadoop and Data Knights
  67. 67. 67JDD conference MapReduce for WordCount
  68. 68. 68JDD conference How to make features from Hadoop cluster?
  69. 69. 69JDD conference Pig & Hive
  70. 70. 70JDD conference Hive
  71. 71. 71JDD conference PIG
  72. 72. 72JDD conference PIG (Triangle count)
  73. 73. 73JDD conference Pig • User Defined Functions (UDF)
  74. 74. 74JDD conference Pig • User Defined Functions (UDF) • FOREACH
  75. 75. 75JDD conference Pig • User Defined Functions (UDF) • FOREACH • Pipeline-style
  76. 76. 76JDD conference Pig • User Defined Functions (UDF) • FOREACH • Pipeline-style • Easy parallelization
  77. 77. 77JDD conference Pig
  78. 78. 78JDD conference Why do we need in special graph approach?
  79. 79. HOW TO MAKE GRAPH FEATURES
  80. 80. 80JDD conference SNA
  81. 81. 81JDD conference MapReduce for iterative calculations • High complexity of graph problem reduction to key-value model • Iteration algorithms, but multiple chained jobs in M/R with full saving and reading of each state Think like a vertex…
  82. 82. 82JDD conference Data vs Graph
  83. 83. 83JDD conference Messaging
  84. 84. 84JDD conference TRAIN MODEL
  85. 85. 85JDD conference Java API for Data mining, JSR 73 and JSR 247 • javax.datamining.supervised defines the supervised function-related interfaces • javax.datamining.algorithm contains all mining algorithm subclass packages • JDM 2.0 adds Text Mining, Time series and so on.. JDM
  86. 86. 86JDD conference Who knows Weka?
  87. 87. 87JDD conference • Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL databases • Source code of all algorithms in Java • Preprocessing tools: discretization, normalization, resampling, attribute selection, transforming and combining Weka
  88. 88. 88JDD conference Weka
  89. 89. 89JDD conference Weka + Hadoop
  90. 90. 90JDD conference SPMF • It’s codebase of algorithms in pattern mining field • It has cool examples and implementation of 109 algorithms • Cool performance results in specific area • Codebase grows very fast • Not so many classification algorithms are covered
  91. 91. 91JDD conference Mahout • Scalable machine learning with Samsara • Advanced Implementations of Java’s Collections Framework for better Performance. • New algorithms will build on Spark platform • Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Miscellaneous are supported
  92. 92. 92JDD conference Collaborative Filtering
  93. 93. 93JDD conference Code sample Mahout (K-Means) // read the point values and generate vectors from input data final List vectors = vectorize(points); // Write data to sequence hadoop sequence files writePointsToFile(configuration, vectors); // Write initial centers for clusters writeClusterInitialCenters(configuration, vectors); // Run K-means algorithm final Path inputPath = new Path(POINTS_PATH); final Path clustersPath = new Path(CLUSTERS_PATH); final Path outputPath = new Path(OUTPUT_PATH); HadoopUtil.delete(configuration, outputPath); KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false); // Read and print output values readAndPrintOutputValues(configuration);
  94. 94. 94JDD conference Hadoop ecosystem
  95. 95. HADOOP IS NOT SEXY
  96. 96. 96JDD conference Whaaaat?
  97. 97. 97JDD conference Map Reduce Job Writing
  98. 98. 98JDD conference Hadoop Jobs
  99. 99. 99JDD conference Hadoop Jobs
  100. 100. 100JDD conference YARN?
  101. 101. 101JDD conference SPARK: the bloody son of MR • MapReduce in memory • Up to 50x faster than Hadoop • RDD is a basic building block (immutable distributed collections of objects) • Pipeline API (no needs in PIG)
  102. 102. 102JDD conference GC & Spark • > 100 GB for Spark apps • big pauses as a result • Garbage-First GC • play with spark.storage.memoryFraction (cached data/heap for transformation) • no one recipe
  103. 103. 103JDD conference Mahout’s killer?
  104. 104. 104JDD conference MLlib supports • Classification and regression • Collaborative filtering • Clustering • Dimensionality reduction • Optimization
  105. 105. 105JDD conference Code sample MLlib (K-Means) // Cluster the data into two classes using KMeans int numClusters = 2; int numIterations = 20; KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); // Evaluate clustering by computing Within Set Sum of Squared Errors double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE); // Save and load model clusters.save(sc.sc(), "myModelPath"); KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
  106. 106. 106JDD conference MLlib • .. extends scikit-learn (Python lib) and Mahout • .. runs fully on Spark and supports Spark’s Pipeline API • .. dataset is represented by Spark SQL’s SchemaRDD • .. supports Hive like external data source • .. is well for large datasets and parallelized algorithms
  107. 107. 107JDD conference It solves all problems!
  108. 108. 108JDD conference In conclusion • Think about your data
  109. 109. 109JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer
  110. 110. 110JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark
  111. 111. 111JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms
  112. 112. 112JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms • Write Java code
  113. 113. 113JDD conference Think Java
  114. 114. 114JDD conference Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia LinkedIn: https://www.linkedin.com/in/zaleslaw

×