SlideShare a Scribd company logo
1 of 17
1
Big-data Analytics: Need to look
beyond Hadoop?
Dr. Vijay Srinivas Agneeswaran,
Director and Head, Big-data R&D,
Innovation Labs, Impetus.s
• Introduction to Berkeley data analytics stack – Spark
• Machine learning: 3 generations
• Iterative Machine Learning (ML) algorithms – Logistic
regression.
• Code snippets
• Performance comparison with Hadoop.
• Real-time analytics with Twitter’s Storm
• Internet traffic use case – ML over Storm.
• Performance comparison of Mahout with R/ML over
Storm
Contents
2
3
ML realizations: 3 Generational view
Iterative ML Algorithms
 What are iterative algorithms?
 Those that need communication among the computing entities
 Examples – neural networks, PageRank algorithms, network traffic analysis
 Conjugate gradient descent
 Commonly used to solve systems of linear equations
 [CB09] tried implementing CG on dense matrices
 DAXPY – Multiplies vector x by constant a and adds y.
 DDOT – Dot product of 2 vectors
 MatVec – Multiply matrix by vector, produce a vector.
 1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG
computation, leading to 10 of GBs of communication even for small
matrices.
 Other iterative algorithms – fast fourier transform, block tridiagonal
[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing,
Technical Report, University of California, Computer Science Department, 2009.
5
Berkeley Big-data Analytics Stack
Hadoop Distributed File System
Tachyon: Distributed In-memory File System
Spark: Computing Paradigm
Bagel/GraphX:
Graph Processing
• Mesos – similar to Nimbus used by Storm, but more
sophisticated.
• Tachyon: DFS – could be replaced by HDFS.
• Spark – built as a computing paradigm over resilient distributed
data sets.
• Shark – comparable to Impala
Shark: SQL
Abstraction
Spark
Streaming
Mesos: Cluster Management
Spark: Third Generation ML Realization
 Resilient distributed data sets (RDDs)
 Read-only collection of objects partitioned across a cluster
 Can be rebuilt if partition is lost.
 Operations on RDDs
 Transformations – map, flatMap, reduceByKey, sort, join, partitionBy
 Actions – Foreach, reduce, collect, count, lookup
 Programmer can build RDDs from
1. a file in HDFS
2. Parallelizing Scala collection - divide into slices.
3. Transform existing RDD - Specify operations such as Map, Filter
4. Change persistence of RDD Cache or a save action – saves to HDFS.
 Shared variables
 Broadcast variables, accumulators
[MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark:
cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud
computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10
7
Data Flow in Spark and Hadoop
Some Spark(ling) examples
Scala code (serial)
var count = 0
for (i <- 1 to 100000)
{ val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Sample random point on unit circle – count how many are inside them (roughly about PI/4).
Hence, u get approximate value for PI.
Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
Some Spark(ling) examples
Spark code (parallel)
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 100000, 12))
{ val x = Math.random * 2 – 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Notable points:
1. Spark context created – talks to Mesos1 master.
2. Count becomes shared variable – accumulator.
3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.
4. Parallelize method invokes foreach method of RDD.
1 Mesos is an Apache incubated clustering system – http://mesosproject.org
Logistic Regression in Spark: Serial Code
// Read data file and convert it into Point objects
val lines = scala.io.Source.fromFile("data.txt").getLines()
val points = lines.map(x => parsePoint(x))
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = Vector.zeros(D)
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient
}
println("Result: " + w)
Logistic Regression in Spark
// Read data file and transform it into Point objects
val spark = new SparkContext(<Mesos master>)
val lines = spark.hdfsTextFile("hdfs://.../data.txt")
val points = lines.map(x => parsePoint(x)).cache()
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = spark.accumulator(Vector.zeros(D))
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Result: " + w)
Logistic Regression: Spark VS Hadoop
12http://spark-project.org
Instance of Architecture for Internet Traffic
Analysis Use Case
K-means Clustering Algorithm:
Mahout VS ML Over Storm
14
Spark Use Cases
15
• Ooyala
• Uses Cassandra for video data personalization.
• Pre-compute aggregates VS on-the-fly queries.
• Moved to Spark for ML and computing views.
• Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs,
60 ms in Spark
• Conviva
• Uses Hive for repeatedly running ad-hoc queries on video data.
• Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than
Hive
• ML for connection analysis and video streaming optimization.
• Quantifind
• Movie , video game companies can predict success of new releases
Hadoop (un)Suitability:
Discussion
16
• Iterative ML algorithms – Spark, Giraph
• Logistic regression, Kernel SVMs, Conjugate
gradient descent, collaborative filtering, Gibbs
sampling, Alternating least squares.
• Interactive/On-the-fly data processing – Storm.
• OLAP – data cube operations. Dremel/Drill
• Data sets – not embarrassingly parallel?
• Graph processing
• GraphLab, Pregel
Thank You!
• Mail
vijay.sa@impetus.co.in
• LinkedIn
http://in.linkedin.com/in/vijaysrinivasagneeswaran
• Blogs
blogs.impetus.com
• Twitter
@a_vijaysrinivas.

More Related Content

What's hot

Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 

What's hot (20)

Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 

Similar to Beyond Hadoop? Big-data Analytics Frameworks for ML and Real-Time Processing

20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 

Similar to Beyond Hadoop? Big-data Analytics Frameworks for ML and Real-Time Processing (20)

20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Scala+data
Scala+dataScala+data
Scala+data
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 

More from Vijay Srinivas Agneeswaran, Ph.D

More from Vijay Srinivas Agneeswaran, Ph.D (9)

Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Beyond Hadoop? Big-data Analytics Frameworks for ML and Real-Time Processing

  • 1. 1 Big-data Analytics: Need to look beyond Hadoop? Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus.s
  • 2. • Introduction to Berkeley data analytics stack – Spark • Machine learning: 3 generations • Iterative Machine Learning (ML) algorithms – Logistic regression. • Code snippets • Performance comparison with Hadoop. • Real-time analytics with Twitter’s Storm • Internet traffic use case – ML over Storm. • Performance comparison of Mahout with R/ML over Storm Contents 2
  • 3. 3 ML realizations: 3 Generational view
  • 4. Iterative ML Algorithms  What are iterative algorithms?  Those that need communication among the computing entities  Examples – neural networks, PageRank algorithms, network traffic analysis  Conjugate gradient descent  Commonly used to solve systems of linear equations  [CB09] tried implementing CG on dense matrices  DAXPY – Multiplies vector x by constant a and adds y.  DDOT – Dot product of 2 vectors  MatVec – Multiply matrix by vector, produce a vector.  1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.  Other iterative algorithms – fast fourier transform, block tridiagonal [CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009.
  • 5. 5 Berkeley Big-data Analytics Stack Hadoop Distributed File System Tachyon: Distributed In-memory File System Spark: Computing Paradigm Bagel/GraphX: Graph Processing • Mesos – similar to Nimbus used by Storm, but more sophisticated. • Tachyon: DFS – could be replaced by HDFS. • Spark – built as a computing paradigm over resilient distributed data sets. • Shark – comparable to Impala Shark: SQL Abstraction Spark Streaming Mesos: Cluster Management
  • 6. Spark: Third Generation ML Realization  Resilient distributed data sets (RDDs)  Read-only collection of objects partitioned across a cluster  Can be rebuilt if partition is lost.  Operations on RDDs  Transformations – map, flatMap, reduceByKey, sort, join, partitionBy  Actions – Foreach, reduce, collect, count, lookup  Programmer can build RDDs from 1. a file in HDFS 2. Parallelizing Scala collection - divide into slices. 3. Transform existing RDD - Specify operations such as Map, Filter 4. Change persistence of RDD Cache or a save action – saves to HDFS.  Shared variables  Broadcast variables, accumulators [MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10
  • 7. 7 Data Flow in Spark and Hadoop
  • 8. Some Spark(ling) examples Scala code (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
  • 9. Some Spark(ling) examples Spark code (parallel) val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Notable points: 1. Spark context created – talks to Mesos1 master. 2. Count becomes shared variable – accumulator. 3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices. 4. Parallelize method invokes foreach method of RDD. 1 Mesos is an Apache incubated clustering system – http://mesosproject.org
  • 10. Logistic Regression in Spark: Serial Code // Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines() val points = lines.map(x => parsePoint(x)) // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient } println("Result: " + w)
  • 11. Logistic Regression in Spark // Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache() // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)
  • 12. Logistic Regression: Spark VS Hadoop 12http://spark-project.org
  • 13. Instance of Architecture for Internet Traffic Analysis Use Case
  • 15. Spark Use Cases 15 • Ooyala • Uses Cassandra for video data personalization. • Pre-compute aggregates VS on-the-fly queries. • Moved to Spark for ML and computing views. • Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark • Conviva • Uses Hive for repeatedly running ad-hoc queries on video data. • Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive • ML for connection analysis and video streaming optimization. • Quantifind • Movie , video game companies can predict success of new releases
  • 16. Hadoop (un)Suitability: Discussion 16 • Iterative ML algorithms – Spark, Giraph • Logistic regression, Kernel SVMs, Conjugate gradient descent, collaborative filtering, Gibbs sampling, Alternating least squares. • Interactive/On-the-fly data processing – Storm. • OLAP – data cube operations. Dremel/Drill • Data sets – not embarrassingly parallel? • Graph processing • GraphLab, Pregel
  • 17. Thank You! • Mail vijay.sa@impetus.co.in • LinkedIn http://in.linkedin.com/in/vijaysrinivasagneeswaran • Blogs blogs.impetus.com • Twitter @a_vijaysrinivas.