JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev

THORNY PATH TO
DATA MINING PROJECTS
Alexey Zinovyev, Java Trainer in EPAM

2JDD conference
About
I am a <graph theory, machine learning,
traffic jams prediction, BigData algorithms>
scientist
But I'm a <Java, NoSQL, Hadoop, Spark>
programmer

3JDD conference
What do I know about Krakow?

4JDD conference
W kociołkach bigos grzano; w słowach wydać trudno bigosu
smak przedziwny, kolor i woń cudną

5JDD conference
True story about Poland!

6JDD conference
In this topic …
A lot of strange pictures and technologies from crazy zoo
We talk about
• Data Mining
• Hadoop ecosystem
• Spark and its friends
• Machine Learning libraries

7JDD conference
Are you a Hadoop developer?

8JDD conference
Let’s do THIS!

9JDD conference
The Good Old Days

10JDD conference
One of these fine days...

11JDD conference
We need in Python dev 'cause Data Mining

12JDD conference
No, you are JavaEE developer only, continue …

13JDD conference
Write your backends, dude!

14JDD conference
Let’s talk about it, Java-boy...

15JDD conference
Can a Java programmer to be a Data Scientist?

16JDD conference
Sexy Data Scientist

17JDD conference
Real Data Scientist

18JDD conference
And what I tell you, young man

19JDD conference
And what I tell you, young man

22JDD conference
Not OLAP, 100%

23JDD conference
Hey, man, predict something!

24JDD conference
Hey, man, predict something!

26JDD conference
It’s Time for Java Superhero, yeah!

27JDD conference
Before patterns discovering you should ..
• Select small pieces
• Define default values for missed
data
• Remove strange signals from data
• Merge some tables in one if
required

29JDD conference
Typical questions for DM
• Which loan applicants are high-risk?

30JDD conference
• How do we detect phone card fraud?

31JDD conference
• What is the revenue prediction for next year?

32JDD conference
• What is the revenue prediction for next year?
• Can you recommend music for users?

34JDD conference
Datasets
• Facebook users, tweets
• Trade transactions
• Government
• Medicine (genomic data)
• Telecommunications

35JDD conference
Data Sources
• Relational Databases
• Data warehouses (Historical data)
• Files in CSV or in binary format
• Internet or electronic mails
• Scientific, research (R, Octave,
Matlab)

37JDD conference
Association rule learning

38JDD conference
What is Cluster Analysis?
It is the process of finding model of function that describes
and distinguishes data class to predict the class of objects
whose class label is unknown.

39JDD conference
Different algorithms – different results

41JDD conference
• Training set of classified
examples (supervised learning)
Classification

42JDD conference
• Test set of non-classified items
Classification

43JDD conference
• Test set of non-classified items
• Main goal: find a function
(classifier) that maps input data
to a category (class)
Classification

44JDD conference
Decision trees

46JDD conference
Green circle is blue square or red
triangle? Let’s ask its neighbors!
kNN (k-nearest neighbor)

49JDD conference
• A small amount of ML algorithms
• All your matrixes are belong to us!
• Single thread model
• Java support
• Octave in Java?
Why not Octave?

50JDD conference
Do you like
this GUI?

51JDD conference
• 25% of R packs are written in Java
• Syntax is too sweet
• You should read 1000 lines in docs
to write 1 line of code
• Single thread model for 95%
algorithms
Why not R?

52JDD conference
Now Python is an idol for young scientists
due to the low barrier to entry
Why not Python?

53JDD conference
• High-level language
• Have you ever heard about a
Jython?
• Long way to real Highload
production
• We are not Python developers
Why not Python?

58JDD conference
How to make features from Hadoop cluster?

61JDD conference
PIG (Triangle count)

62JDD conference
Why do we need in special graph approach?

65JDD conference
MapReduce for iterative calculations
• High complexity of graph problem reduction to key-value
model
• Iteration algorithms, but multiple chained jobs in M/R
with full saving and reading of each state
Think like a vertex…

66JDD conference
Data vs
Graph

68JDD conference
Java API for Data mining, JSR 73 and JSR 247
• javax.datamining.supervised defines the supervised
function-related interfaces
• javax.datamining.algorithm contains all mining algorithm
subclass packages
• JDM 2.0 adds Text Mining, Time series and so on..
JDM

69JDD conference
Who knows Weka?

70JDD conference
• Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL
databases
• Source code of all algorithms in Java
• Preprocessing tools: discretization, normalization,
resampling, attribute selection, transforming and combining
Weka

72JDD conference
SPMF
• It’s codebase of algorithms in pattern mining field
• It has cool examples and implementation of 109
algorithms
• Cool performance results in specific area
• Codebase grows very fast
• Not so many classification algorithms are covered

73JDD conference
Mahout
• Scalable machine learning with Samsara
• Advanced Implementations of Java’s Collections Framework
for better Performance.
• New algorithms will build on Spark platform
• Collaborative Filtering, Classification, Clustering,
Dimensionality Reduction, Miscellaneous are supported

74JDD conference
Code sample Mahout (K-Means)
// read the point values and generate vectors from input data
final List vectors = vectorize(points);
// Write data to sequence hadoop sequence files
writePointsToFile(configuration, vectors);
// Write initial centers for clusters
writeClusterInitialCenters(configuration, vectors);
// Run K-means algorithm
final Path inputPath = new Path(POINTS_PATH);
final Path clustersPath = new Path(CLUSTERS_PATH);
final Path outputPath = new Path(OUTPUT_PATH);
HadoopUtil.delete(configuration, outputPath);
KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false);
// Read and print output values
readAndPrintOutputValues(configuration);

75JDD conference
Hadoop
ecosystem

78JDD conference
Map Reduce Job Writing

80JDD conference
SPARK: the bloody son of MR
• MapReduce in memory
• Up to 50x faster than Hadoop
• RDD is a basic building block
(immutable distributed
collections of objects)

81JDD conference
Mahout’s killer?

82JDD conference
MLlib supports
• Classification and regression
• Collaborative filtering
• Clustering
• Dimensionality reduction
• Optimization

83JDD conference
Code sample MLlib (K-Means)
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Save and load model
clusters.save(sc.sc(), "myModelPath");
KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");

84JDD conference
MLlib
• .. extends scikit-learn (Python lib) and Mahout
• .. runs fully on Spark
• .. is documented
• .. is well for large datasets and parallelized algorithms

85JDD conference
It solves all problems!

86JDD conference
In conclusion
• Think about your data

87JDD conference
In conclusion
• Have friendship with DevOps engineer

88JDD conference
In conclusion
• Run Spark

89JDD conference
In conclusion
• Run Spark
• Learn algorithms

90JDD conference
In conclusion
• Run Spark
• Learn algorithms
• Write Java code

91JDD conference
Hold your data and go ahead!

92JDD conference
CALL ME
IF YOU WANT TO KNOW MORE
Thanks a lot

93JDD conference
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw
LinkedIn: https://www.linkedin.com/in/zaleslaw

JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev

Similar to JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev (20)

Recently uploaded

Recently uploaded (20)

JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev