Machine Learning for Java
Developers
Nirmal Fernando
WSO2 Inc.
{Java Colombo}
Few things about me...
● Associated Technical Lead at WSO2
● Team Lead of WSO2 Machine Learner
● Just completed 4th year in the industry
● Graduated from Department of Computer Science, University
of Moratuwa.
● Schooled at St. Sebastian’s College, Moratuwa.
● Can sing a bit :-)
https://goo.gl/qbAXLz
Predictive Analytics
Extract information from existing datasets to determine
patterns and predict future
outcomes and trends.
It does not tell you what will
happen in the future.
But forecasts what might happen
in the future with an acceptable
level of reliability.
source: http://insidebigdata.com/2014/08/25/salespredict-
marketo-partner-using-predictive-analytics/
Predictive Analytics
“Big Data Predictive Analytics”
Forrester Research report is the
second most read Forrester report
in Q3, 2015
https://www.forrester.com
Predictive Analytics - Use cases
http://californialoanfind.com/what-and-who-is-teletrack/
Predictive Analytics - Use cases
http://www.chrisdunn.com/
Machine Learning
Field of study that gives computers
the ability to learn
without being explicitly
programmed.
- Arthur Samuel (1959)
Machine Learning - Pipeline
Machine Learning - Terminology
● Input data must be in tabular format
● Each row is called a data point
● Each column is called a feature
● Value you are going to predict is called the “response
variable”
● Next value prediction
● Classification
● Clustering
● Recommendations
etc…
Machine Learning - What type of a problem?
Next value prediction
Example of linear regression on one
independent variable
Predicting a discrete value
Classification
Grouping similar data points
together.
Clustering
Seek to predict preferences a user
would give to an item/product.
Recommendations
● Supervised learning
● Unsupervised learning
● Reinforcement learning
Machine Learning - Which algorithm category?
Supervised vs Unsupervised
Supervised Learning Algorithms
Regression Classification
Linear Regression
Lasso Regression
Ridge Regression
Logistic Regression
Support Vector Machine
(SVM)
Decision Tree
Random Forest
Naive Bayes
Bayesian Network
Unsupervised Learning Algorithms
Clustering
K-means
K-medians
Hierarchical Clustering
….
Java tools for Machine Learning
Tool License URL
Weka GNU General Public
License
http://www.cs.
waikato.ac.
nz/ml/weka/
JSAT GPL v3 https://github.
com/EdwardRaff/JSAT
Mahout Apache v2 https://mahout.
apache.org/
Spark MLlib Apache v2 http://spark.apache.
org/mllib/
Speed
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
Ease of Use
Write applications quickly in Java, Scala, Python, R.
Easy to Deploy
Runs on existing Hadoop clusters and data.
Apache Spark MLlib - scalable machine learning library
SparkConf - Configuration for a Spark application. Used to
set various Spark parameters as key-value pairs.
SparkContext / JavaSparkContext - Main entry point for Spark
functionality. A SparkContext represents the connection to a
Spark cluster. Only one SparkContext may active per JVM.
RDD / JavaRDD - A Resilient Distributed Dataset (RDD), the
basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated in
parallel.
Apache Spark - few terms
Filter - Return a new dataset formed by selecting those
elements of the source on which function returns true.
Map - Return a new distributed dataset formed by passing
each element of the source through a function.
Random Split - Split a dataset randomly based on a given
ratio.
Cache - Persisting (or caching) a dataset in memory across
operations.
Apache Spark - few operations on a RDD
● Dataset
Pima Indian diabetes dataset
https://archive.ics.uci.
edu/ml/datasets/Pima+Indians+Diabetes
Number of instances : 768
Number of features : 8
Let’s solve a classification problem using Apache Spark
● Response variable
Name : class
Values : 0 or 1
Interpretation : Whether a given Pima Indian has diabetes
or not
Let’s solve a classification problem using Apache Spark
● Objective
Build a classification model to predict whether a given
Pima Indian has diabetes or not.
Let’s try to build a Logistic Regression
model for this.
Let’s solve a classification problem using Apache Spark
Code:
https://github.com/nirmal070125/ml-java-meetup
Solution using Apache Spark
Powered by Apache Spark and Apache Spark MLlib.
● Manage and explore your data
● Analyze the data using machine learning algorithms
● Build machine learning models
● Compare and manage generated machine learning models
● Predict using the built models
● Use the built models with WSO2 CEP and WSO2 ESB.
http://wso2.com/products/machine-learner/
Few words on WSO2 Machine Learner
Machine learning for java developers

Machine learning for java developers

  • 1.
    Machine Learning forJava Developers Nirmal Fernando WSO2 Inc. {Java Colombo}
  • 2.
    Few things aboutme... ● Associated Technical Lead at WSO2 ● Team Lead of WSO2 Machine Learner ● Just completed 4th year in the industry ● Graduated from Department of Computer Science, University of Moratuwa. ● Schooled at St. Sebastian’s College, Moratuwa. ● Can sing a bit :-) https://goo.gl/qbAXLz
  • 3.
    Predictive Analytics Extract informationfrom existing datasets to determine patterns and predict future outcomes and trends. It does not tell you what will happen in the future. But forecasts what might happen in the future with an acceptable level of reliability. source: http://insidebigdata.com/2014/08/25/salespredict- marketo-partner-using-predictive-analytics/
  • 4.
    Predictive Analytics “Big DataPredictive Analytics” Forrester Research report is the second most read Forrester report in Q3, 2015 https://www.forrester.com
  • 5.
    Predictive Analytics -Use cases http://californialoanfind.com/what-and-who-is-teletrack/
  • 6.
    Predictive Analytics -Use cases http://www.chrisdunn.com/
  • 7.
    Machine Learning Field ofstudy that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel (1959)
  • 8.
  • 9.
    Machine Learning -Terminology ● Input data must be in tabular format ● Each row is called a data point ● Each column is called a feature ● Value you are going to predict is called the “response variable”
  • 10.
    ● Next valueprediction ● Classification ● Clustering ● Recommendations etc… Machine Learning - What type of a problem?
  • 11.
    Next value prediction Exampleof linear regression on one independent variable
  • 12.
    Predicting a discretevalue Classification
  • 13.
    Grouping similar datapoints together. Clustering
  • 14.
    Seek to predictpreferences a user would give to an item/product. Recommendations
  • 15.
    ● Supervised learning ●Unsupervised learning ● Reinforcement learning Machine Learning - Which algorithm category?
  • 16.
  • 17.
    Supervised Learning Algorithms RegressionClassification Linear Regression Lasso Regression Ridge Regression Logistic Regression Support Vector Machine (SVM) Decision Tree Random Forest Naive Bayes Bayesian Network
  • 18.
  • 19.
    Java tools forMachine Learning Tool License URL Weka GNU General Public License http://www.cs. waikato.ac. nz/ml/weka/ JSAT GPL v3 https://github. com/EdwardRaff/JSAT Mahout Apache v2 https://mahout. apache.org/ Spark MLlib Apache v2 http://spark.apache. org/mllib/
  • 20.
    Speed Run programs upto 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use Write applications quickly in Java, Scala, Python, R. Easy to Deploy Runs on existing Hadoop clusters and data. Apache Spark MLlib - scalable machine learning library
  • 21.
    SparkConf - Configurationfor a Spark application. Used to set various Spark parameters as key-value pairs. SparkContext / JavaSparkContext - Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster. Only one SparkContext may active per JVM. RDD / JavaRDD - A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated in parallel. Apache Spark - few terms
  • 22.
    Filter - Returna new dataset formed by selecting those elements of the source on which function returns true. Map - Return a new distributed dataset formed by passing each element of the source through a function. Random Split - Split a dataset randomly based on a given ratio. Cache - Persisting (or caching) a dataset in memory across operations. Apache Spark - few operations on a RDD
  • 23.
    ● Dataset Pima Indiandiabetes dataset https://archive.ics.uci. edu/ml/datasets/Pima+Indians+Diabetes Number of instances : 768 Number of features : 8 Let’s solve a classification problem using Apache Spark
  • 24.
    ● Response variable Name: class Values : 0 or 1 Interpretation : Whether a given Pima Indian has diabetes or not Let’s solve a classification problem using Apache Spark
  • 25.
    ● Objective Build aclassification model to predict whether a given Pima Indian has diabetes or not. Let’s try to build a Logistic Regression model for this. Let’s solve a classification problem using Apache Spark
  • 26.
  • 27.
    Powered by ApacheSpark and Apache Spark MLlib. ● Manage and explore your data ● Analyze the data using machine learning algorithms ● Build machine learning models ● Compare and manage generated machine learning models ● Predict using the built models ● Use the built models with WSO2 CEP and WSO2 ESB. http://wso2.com/products/machine-learner/ Few words on WSO2 Machine Learner

Editor's Notes

  • #6 Fraud detection
  • #7 stock market prediction Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on an exchange. The successful prediction of a stock's future price could yield significant profit.
  • #9 stock market prediction
  • #10 stock market prediction
  • #16 Reinforcement learning : A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal or not. Another example is learning to play a game by playing against an opponent
  • #17 Reinforcement learning : A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal or not. Another example is learning to play a game by playing against an opponent
  • #23 Mention about the row wise operations
  • #24 Reinforcement learning : A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal or not. Another example is learning to play a game by playing against an opponent
  • #25 Reinforcement learning : A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal or not. Another example is learning to play a game by playing against an opponent