Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Introduction to type classes by Pawel Szulc 442 views
- Monads asking the right question by Pawel Szulc 672 views
- The cats toolbox a quick tour of s... by Pawel Szulc 204 views
- Architektura to nie bzdura by Pawel Szulc 1520 views
- Make your programs free. by Pawel Szulc 288 views
- Category theory is general abolute ... by Pawel Szulc 594 views

221 views

Published on

Java is often criticized for hard parsing CSV datasets, poor matrix and vectors manipulations. This makes it hard to easy and efficiently implement certain types of machine learning algorithms. In many cases data scientists choose R or Python languages for modeling and problem solution and you as a Java developer should rewrite R algorithms in Java or integrate many small Python scripts in Java application.

But why so many Highload tools like Cassandra, Hadoop, Giraph, Spark are written in Java or executed on JVM? What is the secret of successful implementation and running? Maybe we should forget old manufacturing approach when we separate developers from research engineers in production projects?

During the report, we will discuss typical Data Mining tasks, advantages and disadvantages of Hadoop ecosystem, battle between Spark and Hadoop for a place under the Sun, difference between popular Machine Learning tools and libraries.

Attendees of my talk will become more familiar with different abbreviations and buzz words and also will get useful tips about self-education way in this area.

Published in:
Software

No Downloads

Total views

221

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

2

Comments

0

Likes

1

No embeds

No notes for slide

- 1. THORNY PATH TO DATA MINING PROJECTS Alexey Zinovyev, Java Trainer in EPAM
- 2. 2JDD conference About I am a <graph theory, machine learning, traffic jams prediction, BigData algorithms> scientist But I'm a <Java, NoSQL, Hadoop, Spark> programmer
- 3. 3JDD conference What do I know about Krakow?
- 4. 4JDD conference W kociołkach bigos grzano; w słowach wydać trudno bigosu smak przedziwny, kolor i woń cudną
- 5. 5JDD conference True story about Poland!
- 6. 6JDD conference In this topic … A lot of strange pictures and technologies from crazy zoo We talk about • Data Mining • Hadoop ecosystem • Spark and its friends • Machine Learning libraries
- 7. 7JDD conference Are you a Hadoop developer?
- 8. 8JDD conference Let’s do THIS!
- 9. 9JDD conference The Good Old Days
- 10. 10JDD conference One of these fine days...
- 11. 11JDD conference We need in Python dev 'cause Data Mining
- 12. 12JDD conference No, you are JavaEE developer only, continue …
- 13. 13JDD conference Write your backends, dude!
- 14. 14JDD conference Let’s talk about it, Java-boy...
- 15. 15JDD conference Can a Java programmer to be a Data Scientist?
- 16. 16JDD conference Sexy Data Scientist
- 17. 17JDD conference Real Data Scientist
- 18. 18JDD conference And what I tell you, young man
- 19. 19JDD conference And what I tell you, young man
- 20. WHAT IS DATA MINING?
- 21. 21JDD conference Statistics?
- 22. 22JDD conference Not OLAP, 100%
- 23. 23JDD conference Hey, man, predict something!
- 24. 24JDD conference Hey, man, predict something!
- 25. 25JDD conference Man or sofa?
- 26. 26JDD conference It’s Time for Java Superhero, yeah!
- 27. 27JDD conference Before patterns discovering you should .. • Select small pieces • Define default values for missed data • Remove strange signals from data • Merge some tables in one if required
- 28. SUBJECT AREA
- 29. 29JDD conference Typical questions for DM • Which loan applicants are high-risk?
- 30. 30JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud?
- 31. 31JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year?
- 32. 32JDD conference Typical questions for DM • Which loan applicants are high-risk? • How do we detect phone card fraud? • What is the revenue prediction for next year? • Can you recommend music for users?
- 33. DATA
- 34. 34JDD conference Datasets • Facebook users, tweets • Trade transactions • Government • Medicine (genomic data) • Telecommunications
- 35. 35JDD conference Data Sources • Relational Databases • Data warehouses (Historical data) • Files in CSV or in binary format • Internet or electronic mails • Scientific, research (R, Octave, Matlab)
- 36. PATTERN MINING
- 37. 37JDD conference Association rule learning
- 38. 38JDD conference What is Cluster Analysis? It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown.
- 39. 39JDD conference Different algorithms – different results
- 40. 40JDD conference Regression
- 41. 41JDD conference • Training set of classified examples (supervised learning) Classification
- 42. 42JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items Classification
- 43. 43JDD conference • Training set of classified examples (supervised learning) • Test set of non-classified items • Main goal: find a function (classifier) that maps input data to a category (class) Classification
- 44. 44JDD conference Decision trees
- 45. 45JDD conference Cruel Tree
- 46. 46JDD conference Green circle is blue square or red triangle? Let’s ask its neighbors! kNN (k-nearest neighbor)
- 47. FASHION LANGUAGES
- 48. 48JDD conference Octave
- 49. 49JDD conference • A small amount of ML algorithms • All your matrixes are belong to us! • Single thread model • Java support • Octave in Java? Why not Octave?
- 50. 50JDD conference Do you like this GUI?
- 51. 51JDD conference • 25% of R packs are written in Java • Syntax is too sweet • You should read 1000 lines in docs to write 1 line of code • Single thread model for 95% algorithms Why not R?
- 52. 52JDD conference Now Python is an idol for young scientists due to the low barrier to entry Why not Python?
- 53. 53JDD conference • High-level language • Have you ever heard about a Jython? • Long way to real Highload production • We are not Python developers Why not Python?
- 54. 54JDD conference DM libraries
- 55. 55JDD conference Think Java
- 56. JAVA ECOSYSTEM
- 57. 57JDD conference Hadoop
- 58. 58JDD conference How to make features from Hadoop cluster?
- 59. 59JDD conference Pig & Hive
- 60. 60JDD conference Hive
- 61. 61JDD conference PIG (Triangle count)
- 62. 62JDD conference Why do we need in special graph approach?
- 63. HOW TO MAKE GRAPH FEATURES
- 64. 64JDD conference SNA
- 65. 65JDD conference MapReduce for iterative calculations • High complexity of graph problem reduction to key-value model • Iteration algorithms, but multiple chained jobs in M/R with full saving and reading of each state Think like a vertex…
- 66. 66JDD conference Data vs Graph
- 67. 67JDD conference TRAIN MODEL
- 68. 68JDD conference Java API for Data mining, JSR 73 and JSR 247 • javax.datamining.supervised defines the supervised function-related interfaces • javax.datamining.algorithm contains all mining algorithm subclass packages • JDM 2.0 adds Text Mining, Time series and so on.. JDM
- 69. 69JDD conference Who knows Weka?
- 70. 70JDD conference • Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL databases • Source code of all algorithms in Java • Preprocessing tools: discretization, normalization, resampling, attribute selection, transforming and combining Weka
- 71. 71JDD conference Weka
- 72. 72JDD conference SPMF • It’s codebase of algorithms in pattern mining field • It has cool examples and implementation of 109 algorithms • Cool performance results in specific area • Codebase grows very fast • Not so many classification algorithms are covered
- 73. 73JDD conference Mahout • Scalable machine learning with Samsara • Advanced Implementations of Java’s Collections Framework for better Performance. • New algorithms will build on Spark platform • Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Miscellaneous are supported
- 74. 74JDD conference Code sample Mahout (K-Means) // read the point values and generate vectors from input data final List vectors = vectorize(points); // Write data to sequence hadoop sequence files writePointsToFile(configuration, vectors); // Write initial centers for clusters writeClusterInitialCenters(configuration, vectors); // Run K-means algorithm final Path inputPath = new Path(POINTS_PATH); final Path clustersPath = new Path(CLUSTERS_PATH); final Path outputPath = new Path(OUTPUT_PATH); HadoopUtil.delete(configuration, outputPath); KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false); // Read and print output values readAndPrintOutputValues(configuration);
- 75. 75JDD conference Hadoop ecosystem
- 76. HADOOP IS NOT SEXY
- 77. 77JDD conference Whaaaat?
- 78. 78JDD conference Map Reduce Job Writing
- 79. 79JDD conference YARN?
- 80. 80JDD conference SPARK: the bloody son of MR • MapReduce in memory • Up to 50x faster than Hadoop • RDD is a basic building block (immutable distributed collections of objects)
- 81. 81JDD conference Mahout’s killer?
- 82. 82JDD conference MLlib supports • Classification and regression • Collaborative filtering • Clustering • Dimensionality reduction • Optimization
- 83. 83JDD conference Code sample MLlib (K-Means) // Cluster the data into two classes using KMeans int numClusters = 2; int numIterations = 20; KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); // Evaluate clustering by computing Within Set Sum of Squared Errors double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE); // Save and load model clusters.save(sc.sc(), "myModelPath"); KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
- 84. 84JDD conference MLlib • .. extends scikit-learn (Python lib) and Mahout • .. runs fully on Spark • .. is documented • .. is well for large datasets and parallelized algorithms
- 85. 85JDD conference It solves all problems!
- 86. 86JDD conference In conclusion • Think about your data
- 87. 87JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer
- 88. 88JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark
- 89. 89JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms
- 90. 90JDD conference In conclusion • Think about your data • Have friendship with DevOps engineer • Run Spark • Learn algorithms • Write Java code
- 91. 91JDD conference Hold your data and go ahead!
- 92. 92JDD conference CALL ME IF YOU WANT TO KNOW MORE Thanks a lot
- 93. 93JDD conference Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw LinkedIn: https://www.linkedin.com/in/zaleslaw

No public clipboards found for this slide

Be the first to comment