SlideShare for iOS
by Linkedin Corporation
FREE - On the App Store
by Linkedin Corporation
FREE - On the App Store
We have emailed the verification/download link to "".
Login to your email and click the link to download the file directly.
Check your bulk/spam folders if you can't find our mail.
Like this? Share it with your network
Share
Views
Actions
Embeds 4
Report content
Gather Data
Train Model
Compare Model Performance – AUC Curve etc.
Predict the future
Octave is also an open source, high level interpreted language. The octave language is quite similar to Matlab so that most programs are easily portable.
But both of these languages have limitations in terms of volume of data that can be handled and are not suitable for analytics on huge and dynamic data sets.Hadoop is a defacto standard for storing and processing huge volume of data.
Two main tools –Rhadoop and Mahout were developed to leverage the distributed processing of the Hadoop framework.
Intoduction of yarn… it opens hadoop framework for many other frameworks beyong mapreduce/
Rhadoop?
Rhipe?? 2012?
. R along with R-Hadoop packages needs to be installed on all the nodes including the edge node. And the RHadoop will submit the job from the client/edge node.
Mahout is a java library having mapreduce implementation of machine learning algorithms. In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster
R along with R-Hadoop, RHipe packages needs to be installed on all the nodes including the edge node. And the Rhadoop/Rhipe will submit the job from the client/edge node
In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster.
Rhipe
?? 2012?
Plurmr – provides additional data manipulation cpabilities
• rmr2 -functions providing Hadoop MapReduce functionality in R
• rhdfs -functions providing file management of the HDFS from within R
• rhbase -functions providing database management for the Hbase distributed database from within R
Logistic regression avaiable in R can not be reused
Rhipe?? 2012?
We saw adoption of mahout based recommendation engine across the industry…
- MapReduce tasks must be written as acyclic dataflow programs
- Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler
- Repeated querying of datasets become difficult
- thus hard to write iterative algorithms
- After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home. A wide range of contributors now develop the project (over 120 developers from 25 companies).MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.
Spark top level apache project in Feb,2014
Current version 1.0
Included SVM, logistic regression, K-means, ALS
Hadoop YARN support in Spark
MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
Today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
Algorithms Supported
Classification
Naive Bayes Classification , Random Forest
Regression
Logistic Regression, Linear Regression, Multinomial logistic regression, Elastic net regularization
Clustering
K-Means
Topic Modeling
Latent Dirichlet Allocation etc.
Association Rule Mining
Apriori
MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
Latest version 1.5
MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization