Your SlideShare is downloading. ×
0
1
Big-data Analytics: Need to look
beyond Hadoop?
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovat...
• Introduction to Berkeley data analytics stack – Spark
• Machine learning: 3 generations
• Iterative Machine Learning (ML...
3
ML realizations: 3 Generational view
Iterative ML Algorithms
 What are iterative algorithms?
 Those that need communication among the computing entities
 Ex...
5
Berkeley Big-data Analytics Stack
Hadoop Distributed File System
Tachyon: Distributed In-memory File System
Spark: Compu...
Spark: Third Generation ML Realization
 Resilient distributed data sets (RDDs)
 Read-only collection of objects partitio...
7
Data Flow in Spark and Hadoop
Some Spark(ling) examples
Scala code (serial)
var count = 0
for (i <- 1 to 100000)
{ val x = Math.random * 2 - 1
val y = M...
Some Spark(ling) examples
Spark code (parallel)
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator...
Logistic Regression in Spark: Serial Code
// Read data file and convert it into Point objects
val lines = scala.io.Source....
Logistic Regression in Spark
// Read data file and transform it into Point objects
val spark = new SparkContext(<Mesos mas...
Logistic Regression: Spark VS Hadoop
12http://spark-project.org
Instance of Architecture for Internet Traffic
Analysis Use Case
K-means Clustering Algorithm:
Mahout VS ML Over Storm
14
Spark Use Cases
15
• Ooyala
• Uses Cassandra for video data personalization.
• Pre-compute aggregates VS on-the-fly querie...
Hadoop (un)Suitability:
Discussion
16
• Iterative ML algorithms – Spark, Giraph
• Logistic regression, Kernel SVMs, Conjug...
Thank You!
• Mail
vijay.sa@impetus.co.in
• LinkedIn
http://in.linkedin.com/in/vijaysrinivasagneeswaran
• Blogs
blogs.impet...
Upcoming SlideShare
Loading in...5
×

Big data analytics_beyond_hadoop_public_18_july_2013

2,522

Published on

This was the deck I used for the Hadoop Meetup talk at Bangalore on 18th of July 2013. The talk was titled "Big-data Analytics: Need to Look Beyond Hadoop?"

Published in: Technology, Education
1 Comment
6 Likes
Statistics
Notes
  • Thanks for uploading the slides, I very much enjoyed your presentation in inMobi Meetup in Bangalore!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,522
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
82
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "Big data analytics_beyond_hadoop_public_18_july_2013"

  1. 1. 1 Big-data Analytics: Need to look beyond Hadoop? Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus.s
  2. 2. • Introduction to Berkeley data analytics stack – Spark • Machine learning: 3 generations • Iterative Machine Learning (ML) algorithms – Logistic regression. • Code snippets • Performance comparison with Hadoop. • Real-time analytics with Twitter’s Storm • Internet traffic use case – ML over Storm. • Performance comparison of Mahout with R/ML over Storm Contents 2
  3. 3. 3 ML realizations: 3 Generational view
  4. 4. Iterative ML Algorithms  What are iterative algorithms?  Those that need communication among the computing entities  Examples – neural networks, PageRank algorithms, network traffic analysis  Conjugate gradient descent  Commonly used to solve systems of linear equations  [CB09] tried implementing CG on dense matrices  DAXPY – Multiplies vector x by constant a and adds y.  DDOT – Dot product of 2 vectors  MatVec – Multiply matrix by vector, produce a vector.  1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.  Other iterative algorithms – fast fourier transform, block tridiagonal [CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009.
  5. 5. 5 Berkeley Big-data Analytics Stack Hadoop Distributed File System Tachyon: Distributed In-memory File System Spark: Computing Paradigm Bagel/GraphX: Graph Processing • Mesos – similar to Nimbus used by Storm, but more sophisticated. • Tachyon: DFS – could be replaced by HDFS. • Spark – built as a computing paradigm over resilient distributed data sets. • Shark – comparable to Impala Shark: SQL Abstraction Spark Streaming Mesos: Cluster Management
  6. 6. Spark: Third Generation ML Realization  Resilient distributed data sets (RDDs)  Read-only collection of objects partitioned across a cluster  Can be rebuilt if partition is lost.  Operations on RDDs  Transformations – map, flatMap, reduceByKey, sort, join, partitionBy  Actions – Foreach, reduce, collect, count, lookup  Programmer can build RDDs from 1. a file in HDFS 2. Parallelizing Scala collection - divide into slices. 3. Transform existing RDD - Specify operations such as Map, Filter 4. Change persistence of RDD Cache or a save action – saves to HDFS.  Shared variables  Broadcast variables, accumulators [MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10
  7. 7. 7 Data Flow in Spark and Hadoop
  8. 8. Some Spark(ling) examples Scala code (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
  9. 9. Some Spark(ling) examples Spark code (parallel) val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Notable points: 1. Spark context created – talks to Mesos1 master. 2. Count becomes shared variable – accumulator. 3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices. 4. Parallelize method invokes foreach method of RDD. 1 Mesos is an Apache incubated clustering system – http://mesosproject.org
  10. 10. Logistic Regression in Spark: Serial Code // Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines() val points = lines.map(x => parsePoint(x)) // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient } println("Result: " + w)
  11. 11. Logistic Regression in Spark // Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache() // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)
  12. 12. Logistic Regression: Spark VS Hadoop 12http://spark-project.org
  13. 13. Instance of Architecture for Internet Traffic Analysis Use Case
  14. 14. K-means Clustering Algorithm: Mahout VS ML Over Storm 14
  15. 15. Spark Use Cases 15 • Ooyala • Uses Cassandra for video data personalization. • Pre-compute aggregates VS on-the-fly queries. • Moved to Spark for ML and computing views. • Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark • Conviva • Uses Hive for repeatedly running ad-hoc queries on video data. • Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive • ML for connection analysis and video streaming optimization. • Quantifind • Movie , video game companies can predict success of new releases
  16. 16. Hadoop (un)Suitability: Discussion 16 • Iterative ML algorithms – Spark, Giraph • Logistic regression, Kernel SVMs, Conjugate gradient descent, collaborative filtering, Gibbs sampling, Alternating least squares. • Interactive/On-the-fly data processing – Storm. • OLAP – data cube operations. Dremel/Drill • Data sets – not embarrassingly parallel? • Graph processing • GraphLab, Pregel
  17. 17. Thank You! • Mail vijay.sa@impetus.co.in • LinkedIn http://in.linkedin.com/in/vijaysrinivasagneeswaran • Blogs blogs.impetus.com • Twitter @a_vijaysrinivas.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×