Big Data Analytics-Open Source Toolkits

  • 701 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
701
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Gather Data – Exploratory Analytics, Variable selection, Dimensionality Reduction - PCA, SVD

    Gather Data

    Train Model

    Compare Model Performance – AUC Curve etc.

    Predict the future
  • Now let us understand how analytics was done in pre-hadoop era. Tools like R and octave were used widely which run on a single machine and give fair enough performance with small dataset. Both R and Octave are open source high level interpreted languages. R started in 1993 .It is mainly written in C and Fortran. R is a very popular tool among statisticians and data scientists for performing computational statistics, visualization and data science. It has a vibrant community noted for its active contributions in terms of packages. It has 5589 packages.

    Octave is also an open source, high level interpreted language. The octave language is quite similar to Matlab so that most programs are easily portable.
    But both of these languages have limitations in terms of volume of data that can be handled and are not suitable for analytics on huge and dynamic data sets.Hadoop is a defacto standard for storing and processing huge volume of data.
  • Hadoop was started by Doug Cutting for Nutch project at Yahoo.Till 2007 it had two core components – HDFS and MapReduce.In 2008, tools like Hbase,ZooKeeper were added in the hadoop ecosystem. In 2010 Avro and sqoop were added and the ecosystem is still growing.

    Two main tools –Rhadoop and Mahout were developed to leverage the distributed processing of the Hadoop framework.

    Intoduction of yarn… it opens hadoop framework for many other frameworks beyong mapreduce/

    Rhadoop?
    Rhipe?? 2012?
  • RHadoop is an open source collection of three R packages that allow users to manage and analyze data with Hadoop from R environment.
    . R along with R-Hadoop packages needs to be installed on all the nodes including the edge node. And the RHadoop will submit the job from the client/edge node.

    Mahout is a java library having mapreduce implementation of machine learning algorithms. In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster



    R along with R-Hadoop, RHipe packages needs to be installed on all the nodes including the edge node. And the Rhadoop/Rhipe will submit the job from the client/edge node

    In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster.
  • Rhadoop?
    Rhipe

    ?? 2012?

    Plurmr – provides additional data manipulation cpabilities
  • Rhadoop consists of the following packages:
    • rmr2 -functions providing Hadoop MapReduce functionality in R
    • rhdfs -functions providing file management of the HDFS from within R
    • rhbase -functions providing database management for the Hbase distributed database from within R
  • This is a sample code for logistic regression in Rhadoop.

    Logistic regression avaiable in R can not be reused
  • Rhadoop?
    Rhipe?? 2012?

    We saw adoption of mahout based recommendation engine across the industry…
  • Mahout is a java library having MR implementation of common machine learning algorithms.It was developed to provide scalable and parallelized machine learning algorithms based on Hadoop framework.The original aim of the Mahout project was to implement all 10 alogorithms discussed in Andrew Ng’s paper “Mapreduce …. “


  • One of the reason why Map Reduced is criticized is – Restricted programming framework

    - MapReduce tasks must be written as acyclic dataflow programs
    - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler
    - Repeated querying of datasets become difficult
    - thus hard to write iterative algorithms

    - After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
  • MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis.
    MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
    Latest version 1.5

    MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization

  • Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. The AMPLab continues to perform research on both improving Spark and on systems built on top it.
    After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home. A wide range of contributors now develop the project (over 120 developers from 25 companies).MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.
    Spark top level apache project in Feb,2014
    Current version 1.0
    Included SVM, logistic regression, K-means, ALS

    Hadoop YARN support in Spark
  • MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis.
    MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
    Latest version 1.5

    MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization

  • MADlib grew out of discussions between database-engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal).
    Today it also includes researchers from Stanford and University of Florida.
    Latest version 1.5

    Algorithms Supported
    Classification
    Naive Bayes Classification , Random Forest
    Regression
    Logistic Regression, Linear Regression, Multinomial logistic regression, Elastic net regularization
    Clustering
    K-Means
    Topic Modeling
    Latent Dirichlet Allocation etc.
    Association Rule Mining
    Apriori
  • MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis.
    MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida.
    Latest version 1.5

    MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization

Transcript

  • 1. Big Data Analytics – Open Source Toolkits Prafulla Wani Snehalata Deorukhkar
  • 2. Introduction  Talk Background – More Data Beats Better Algorithms – Evaluate “Analytics Toolkits” that support Hadoop  Speaker Backgrounds  Data Engineers  No PhDs in statistics 2
  • 3. Big Data Analytics Toolkits  Evaluation parameters – Ease of use • Development APIs • # of Algorithms supported – Performance • Scalable Architecture • Disk-based / Memory-based  Open-source only – RHadoop – Mahout – MADLib – HiveMall – H2O – Spark-MLLib 3
  • 4. Analytics Project lifecycle Train Model(s) Gather Data Compare Accuracy Predict Future  Train Algorithm 1 (Logistic regression)  Train Algorithm 2 (SVM)  ......  Train Algorithm N 4
  • 5. Analytics (Pre-Hadoop era) Performance Ease of use Single Machine R, Octave R –  Started in 1993  Very Popular  5589 packages  Written primarily in C and Fortran Octave –  Started in 1988  Open source and features comparable with Matlab 5
  • 6. Timeline 2014201320122008 6 2006 Hadoop 20112010 Core Hadoop (HDFS, MapReduce) HBase, Zookeeper , Pig, Hive… Avro, Sqoop Cloudera Impala YARN
  • 7. Architecture R R R R R R R R R Client/ Edge Node Hadoop Cluster Client/ Edge Node Hadoop Cluster RHadoop Mahout Mahout Map/ Reduce Map/ Reduce 7
  • 8. Timeline 2014201320122008 8 2006 Hadoop 20112010 Core Hadoop (HDFS, MapReduce) HBase, Zookeeper , Pig, Hive… Avro, Sqoop Cloudera Impala YARN RHadoop rhdfs, rmr rmr 2.0 plyrmr
  • 9. RHadoop  Provides R packages – – rhdfs - to read/write from/to HDFS – rhbase - to read/write from/to HBase – rmr - to express map-reduce programs in R  Does not provide out-of-box packages for model training 9
  • 10. RHadoop logistic.regression = function(input, iterations, dims, alpha){ plane = t(rep(0, dims)) g = function(z) 1/(1 + exp(-z)) for (i in 1:iterations) { gradient = values( from.dfs( mapreduce( input, map = lr.map, reduce = lr.reduce, combine = T))) plane = plane + alpha * gradient } plane } lr.map = function(., M) { Y = M[,1] X = M[,-1] keyval( 1, + Y * X * g(-Y * as.numeric(X %*% t(plane))))} lr.reduce = function(k, Z) keyval(k, t(as.matrix(apply(Z,2,sum)))) 10
  • 11. Timeline 2014201320122008 11 2006 Hadoop Mahout Started as a subproject of Apache Lucene 20112010 Decision to reject new MapReduce implementation Future implementations on top of Apache Spark Integration with H2O platform Top level apache project 4 releases (0.1 – 0.4) Core Hadoop (HDFS, MapReduce) HBase, Zookeeper , Pig, Hive… Mahout Avro, Sqoop Cloudera Impala YARN 0.8 release Recomme ndation Engines – Common Case study for Hadoop RHadoop rhdfs, rmr rmr 2.0 plyrmr
  • 12. Mahout  Original goal - To implement all 10 algorithms from Andrew Ng's paper "Map-Reduce for Machine Learning on Multicore"  Java based library having MapReduce implementation of common analytics algorithms  Key algorithms – Recommendation algorithms / Collaborative filtering – Classification – Clustering – Frequent Pattern Growth 12
  • 13. Mahout  Train the model: mahout org.apache.mahout.df.mapreduce.BuildForest - Dmapred.max.split.size=1884231 -oob -d train.arff -ds train.info -sl 5 -t 1000 -o crwd_forest  Test the model: mahout org.apache.mahout.df.mapreduce.TestForest -i test.arff -ds train.info -m crwd_forest -a -mr -o crwd_predictions 13
  • 14. Summary Performance Ease of use Distributed Disk-based Single Machine R, Octave Mahout RHadoop 14
  • 15. Aging MapReduce  Machine learning algorithms are iterative in nature  Mahout algorithms involve multiple MapReduce stages  Intermediate results are written to HDFS  MR job is launched for each iteration  IO overhead Input Input HDFS read HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Query 1 Query 2 Query 3 result 1 result 2 result 3 … … Slow due to replication and disk IO 15
  • 16. Disk Trend  Disk throughput increasing slowly Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf 16
  • 17. Memory Trend  RAM throughput increasing exponentially Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf 17
  • 18. Timeline 2014201320122011 18 2010 Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark
  • 19. Spark – Data sharing  Resilient Distributed Datasets (RDDs) – Distributed collections of objects that can be cached in memory across cluster nodes – Manipulated through various parallel operations – Automatically rebuilt on failures Input Input One-time Processing iter. 1 iter. 2 Query 1 Query 2 Query 3 … 10-100x faster than network and disk … Distributed memory 19
  • 20. MLLib  Spark implementation of some common machine learning algorithms and utilities, including – Classification – Regression – Clustering  Pre-packaged libraries (in scala, Java, Python) for analytics algorithms – – val model = SVMWithSGD.train(training, numIterations) – val clusters = KMeans.train(parsedData, numClusters, numIterations) 20
  • 21. SparkR - R Interface over Spark  Currently supports using data transformation functions lapply() etc. on distributed spark model  It does not support running out of the box model (e.g. SVMWithSGD.train or KMeans.train)  The work is in progress on sparkR - MLLib integration which may address this limitation 21
  • 22. Timeline 2014201320122011 22 Began as a collaboration between researchers, engineers and data scientists 2010 Initial release Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark MADLib MADLib- port for Impala
  • 23. MADLib  An open-source library for scalable in-database analytics  Supports Postgres, Pivotal GreenPlum Database, and Pivotal HAWQ  Key MADLib architecture principles are: – Operating on the data locally-in database. – Utilizing best of breed database engines, but separate the machine learning logic from database specific implementation details. – Leveraging MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability. – Open implementation maintaining active ties into ongoing academic research." 23
  • 24. MADLib Architecture 24 User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High – level Abstraction Layer (iteration controller, …) RDBMS Built-in functions MPP Query Processing (Greenplum, PostgreSQL, Impala …) Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) SQL, generated from specification C++
  • 25. Timeline 2014201320122011 25 Began as a collaboration between researchers, engineers and data scientists 2010 Initial release Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark H2O Project open- sourced MADLib H2O Latest stable release of H2O 2.4.3.4 released on May 13, 2014 MADLib- port for Impala
  • 26. H2O  Open source math and prediction engine  Distributed, in-memory computations  Creates a cluster of H2O nodes, which are map- only tasks  Provides graphical interface to load-data, view summaries and train models  Certified for major hadoop distributions 26
  • 27. H2O on Hadoop Deployment Hadoop H2O Map Task Hadoop H2O Map Task Hadoop H2O Map Task Job Tracker hadoop jar … HDFS Hadoop edge Node Hadoop Cluster Hadoop Task Tracker Nodes (H2O Cluster) Hadoop HDFS Data Nodes 27 Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12
  • 28. H2O Programming Interface  R-Package “H2O” – prostate.data = h2o.importURL(localH2O, path = “<path>”, key = “<key>") – summary(prostate.data) – h2o.glm – h2o.kmeans 28
  • 29. Community involvement Mahout Spark-MLLib MADLib H2O # of commits 20 249 0 557 29 For 30 days ending 27 May,
  • 30. HiveMall  Machine learning and feature engineering functions through UDFs/UDAFs/UDTFs of Hive  Supports various algorithms for – – Classification – Perceptron, Adaptive Regularization of Weight Vectors (AROW) – Regression - Logistic Regression using Stochastic Gradient Descent – Recommendation - Minhash (LSH with jaccard index) – k-Nearest Neighbor – Feature engineering 30
  • 31. Summary Performance Ease of use Distributed Disk-based Distributed Memory-based Single Machine R, Octave Mahout RHadoop H2O MLLIb MADLib+ Impala Hive Mall 31
  • 32. MLBase - Vision  Optimizer built on top of Spark & MLLib  A Declarative Approach  Abstracts complexities of variable & algorithm selection – var X = load (“als_clinical”, 2 to 10) – var Y = load (“als_clinical”, 1) – var (fn-model, summary) = doClassify (X , y) Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced Train Model(s) Gather Data Compare Accuracy Predict Future 32
  • 33. Summary Performance Ease of use Distributed Disk-based Distributed Memory-based Single Machine R, Octave Mahout RHadoop H2O MLBase MLLIb MADLib+ Impala Hive Mall 33
  • 34. Yes, We Are Hiring! Thank You!