Apache Spark
History and market overview
Martin Zapletal Cake Solutions
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment
Table of contents
● Motivation - why distributed data processing
● Market overview
● Brief history
● Hadoop MapReduce
● Apache Spark
● Other competitors
● Q & A
Motivation
● production of data in 2002 around 5 exabytes/800
megabytes per person. Even more TV, radio, phone.
● doubled from 1999
● importance of data for business and society
● must stored, processed, analysed to get the value
● 3Vs of data
o Volume
o Velocity
o Variety
Distributed computing
● from supercomputers to cloud
o economical reasons
o gradual upgrades
o fault tolerance
o scalability
o versatility
o development speed
o ecosystem and tooling
o geographical distribution
o various models and technologies
Distributed computing
● largest Yahoo Hadoop cluster has 4,500 nodes. 40,000 nodes in total. 455
petabytes
● Facebook Hadoop 2000 nodes, each 12TB storage, 32GB RAM, 8-16
cores
● Yahoo Kafka 20 gigabytes/second, LinkedIn 460,000 writes/sec,
2,300,000 reads/sec
● MongoDB 100 nodes, 20-30TB
Distributed computing
● need for new tools, approaches, philosophy, languages, theory
● 7 fallacies of distributed computing
o the network is reliable, the latency is 0, the network is secure
● complexity
o packet loss, ordering, acknowledgement, time, synchronization,
reliable delivery
o many possible states and possibilities
o ubiquitous failures and impact of the distribution
● deployment
● theory
Big Data technologies
● distributed computing frameworks
o batch
o stream
● machine learning and data mining
● support tools
● message queues
● databases
● distributed computing primitives
● cluster operating systems, schedulers
● deployment tools
Big Data technologies
Distributing computation
● efficient use of resources
● ensuring the computation completes
● ensuring correct result
● different levels of abstraction
o gpu
o processes
o threads
o actors
o actor clusters and virtualized actors
o frameworks on top of actors
o distributed computing frameworks
● different computing models
o share nothing
o shared memory
o actors
o mapReduce
Distributing computation
t1 t2 t3 t4 t5 t6
Data Network Computation
Distributing computation
t1 t2 t3
Distributing computation
t1 t2 t3
Brief history
● Google File System 2003
● MapReduce 2004
● BigTable 2006
● Dremel 2008
● Colossus 2011
● Spanner 2012
● Amazon Dynamo 2002
Brief history
● Apache Hadoop
o HDFS file system
o HBase database
o MapReduce
o Apache Mahout
o Apache Hive
o Apache Pig
o Apache Drill
o Yarn resource management etc.
Hadoop MapReduce
Hadoop MapReduce
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Apache Spark
● developed at UC Berkeley, now OS
● written in Scala, uses Akka
● compatible with existing Hadoop infrastructure
● api for Java, Scala, Python
● simple, expressive, functional and high level programming model
● speed
● in memory caching, query optimizations
● suitable for iterative and ad-hoc queries (ideal for ML)
● used in production in Yahoo, Amazon, ..
● Databricks raised ~$47M in last year
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Apache Spark
● RDD
● deployment, installation and programming model and what is actually happening in
the background in the next talks
Competition
● non exhaustive list
● Akka cluster/remoting
o lower level abstraction
o more work for the developer
o more freedom
Competition
● Intel GearPump
o build on top of Akka
o scalable, fault-tolerant and expressive solution
o distributed streaming data solution competing with for example Storm
Competition
● Apache Flink
o written in Java, started in 2008 at the Technical University of Berlin, the Humboldt
University of Berlin, and the Hasso Plattner Institute
o ASF Top-Level Project since early 2015
o fast
o cost based query optimizers that generalizes relational database query optimizers to
distributed environment
o streaming
o api similar to Spark
Competition
● Apache Tez
o developed by Hortonworks, became ASF Top-Level since July 2014
o generalizes MapReduce to a more powerful framework based on expressing computations
as dataflow graph
o much richer api
o lower level than Spark or Flink allowing some extra optimizations
Competition
● Apache Samza
o developed at LinkedIn, joined ASF in September 2013
o distributed stream processing framework
o uses Kafka (also developed at LinkedIn) and other data sources
● Apache Storm
o distributed unbounded stream processing framework
o programming api to define graph topologies
using Spouts (sources) and Bolts (processing nodes)
o used at Yahoo, Twitter, Yelp, Spotify, ...
Conclusion
● why distributed computing frameworks
● why Spark?
o concepts based on theory
o young and progressive, written in Scala
o already mature and production proven
o distributed computing, Big Data, data analysis increasingly important
o potential to replace market leading MapReduce in Hadoop ecosystem
● why not?
o many competitors
o Spark may not always be the best fit
Questions

Apache spark - History and market overview

  • 1.
    Apache Spark History andmarket overview Martin Zapletal Cake Solutions
  • 2.
    Apache Spark andBig Data 1) History and market overview 2) Installation 3) MLlib and machine learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Core, SQL, GraphX, Streaming 6) Spark’s distributed programming model 7) Deployment
  • 3.
    Table of contents ●Motivation - why distributed data processing ● Market overview ● Brief history ● Hadoop MapReduce ● Apache Spark ● Other competitors ● Q & A
  • 4.
    Motivation ● production ofdata in 2002 around 5 exabytes/800 megabytes per person. Even more TV, radio, phone. ● doubled from 1999 ● importance of data for business and society ● must stored, processed, analysed to get the value ● 3Vs of data o Volume o Velocity o Variety
  • 5.
    Distributed computing ● fromsupercomputers to cloud o economical reasons o gradual upgrades o fault tolerance o scalability o versatility o development speed o ecosystem and tooling o geographical distribution o various models and technologies
  • 6.
    Distributed computing ● largestYahoo Hadoop cluster has 4,500 nodes. 40,000 nodes in total. 455 petabytes ● Facebook Hadoop 2000 nodes, each 12TB storage, 32GB RAM, 8-16 cores ● Yahoo Kafka 20 gigabytes/second, LinkedIn 460,000 writes/sec, 2,300,000 reads/sec ● MongoDB 100 nodes, 20-30TB
  • 7.
    Distributed computing ● needfor new tools, approaches, philosophy, languages, theory ● 7 fallacies of distributed computing o the network is reliable, the latency is 0, the network is secure ● complexity o packet loss, ordering, acknowledgement, time, synchronization, reliable delivery o many possible states and possibilities o ubiquitous failures and impact of the distribution ● deployment ● theory
  • 8.
    Big Data technologies ●distributed computing frameworks o batch o stream ● machine learning and data mining ● support tools ● message queues ● databases ● distributed computing primitives ● cluster operating systems, schedulers ● deployment tools
  • 9.
  • 10.
    Distributing computation ● efficientuse of resources ● ensuring the computation completes ● ensuring correct result ● different levels of abstraction o gpu o processes o threads o actors o actor clusters and virtualized actors o frameworks on top of actors o distributed computing frameworks ● different computing models o share nothing o shared memory o actors o mapReduce
  • 11.
    Distributing computation t1 t2t3 t4 t5 t6 Data Network Computation
  • 12.
  • 13.
  • 14.
    Brief history ● GoogleFile System 2003 ● MapReduce 2004 ● BigTable 2006 ● Dremel 2008 ● Colossus 2011 ● Spanner 2012 ● Amazon Dynamo 2002
  • 15.
    Brief history ● ApacheHadoop o HDFS file system o HBase database o MapReduce o Apache Mahout o Apache Hive o Apache Pig o Apache Drill o Yarn resource management etc.
  • 16.
  • 17.
    Hadoop MapReduce public classWordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
  • 18.
    Apache Spark ● developedat UC Berkeley, now OS ● written in Scala, uses Akka ● compatible with existing Hadoop infrastructure ● api for Java, Scala, Python ● simple, expressive, functional and high level programming model ● speed ● in memory caching, query optimizations ● suitable for iterative and ad-hoc queries (ideal for ML) ● used in production in Yahoo, Amazon, .. ● Databricks raised ~$47M in last year val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 19.
    Apache Spark ● RDD ●deployment, installation and programming model and what is actually happening in the background in the next talks
  • 20.
    Competition ● non exhaustivelist ● Akka cluster/remoting o lower level abstraction o more work for the developer o more freedom
  • 21.
    Competition ● Intel GearPump obuild on top of Akka o scalable, fault-tolerant and expressive solution o distributed streaming data solution competing with for example Storm
  • 22.
    Competition ● Apache Flink owritten in Java, started in 2008 at the Technical University of Berlin, the Humboldt University of Berlin, and the Hasso Plattner Institute o ASF Top-Level Project since early 2015 o fast o cost based query optimizers that generalizes relational database query optimizers to distributed environment o streaming o api similar to Spark
  • 23.
    Competition ● Apache Tez odeveloped by Hortonworks, became ASF Top-Level since July 2014 o generalizes MapReduce to a more powerful framework based on expressing computations as dataflow graph o much richer api o lower level than Spark or Flink allowing some extra optimizations
  • 24.
    Competition ● Apache Samza odeveloped at LinkedIn, joined ASF in September 2013 o distributed stream processing framework o uses Kafka (also developed at LinkedIn) and other data sources ● Apache Storm o distributed unbounded stream processing framework o programming api to define graph topologies using Spouts (sources) and Bolts (processing nodes) o used at Yahoo, Twitter, Yelp, Spotify, ...
  • 25.
    Conclusion ● why distributedcomputing frameworks ● why Spark? o concepts based on theory o young and progressive, written in Scala o already mature and production proven o distributed computing, Big Data, data analysis increasingly important o potential to replace market leading MapReduce in Hadoop ecosystem ● why not? o many competitors o Spark may not always be the best fit
  • 26.

Editor's Notes

  • #5 Volume - the amount of data. Consider for example social media, logs, emails, html web pages, machine to machine communication between different systems or internet of things. The amounts are increasing rapidly. Researchers claim that 90% of worlds total data were generated in last two years (2013) [3]. YouTube users upload 48 hours of video, Facebook users share 684,478 pieces of content, Instagram users share 3,600 new photos, and Tumblr sees 27,778 new posts published every minute (2012)[4]. According to [5], one of the widely used RDBMS, PostgreSQL, hits latency limits at 10 million rows, has no robust and inexpensive solution to query across shards, has no robust way to scale horizontally and performance improvements are very expensive. Velocity - most of the data is not persistent. Data often exist for a limited amount of time. Consider for example difference in velocity of a blog post versus an http request, an event from sensor or streaming data from social media. If not recorded, the data is gone and can not be retrieved. Data with high velocity are harder to analyse and require systems to be reactive and scalable. Legacy RDBMSs, such as DB2, Oracle, and SQLServer were designed (many years ago) as general purpose DBMSs, appropriate for all applications. As such, their performance on ingesting a firehose of real-time messages will almost certainly be inadequate, both in terms of throughput and latency [6]. Variety - However we may want, the data are usually not stored in tables with defined data types. Data is mostly unstructured. There are no explicit relations, different types, often some records are missing or require cleaning, some data are erroneous etc. Huge amounts of data are also in formats that are even harder to store and analyse such as images or videos.
  • #6 Economical reasons. Data centers running distributed applications often use cheap commodity hardware. Of course the applications that require real-time performance use better hardware, networks etc., but other applications can easily run on cheap hardware and still process huge amounts of data. This also includes maintenance costs where malfunctioned machine can easily be removed from the self-organizing cluster. Distributed systems also allowed companies to outsource infrastructure and only pay for processing power when needed while being able to cheaply purchase more if scaling is needed. This may result in major cost decrease, because all the infrastructure, knowledge and skilled personnel is no longer required. 2) Upgrades can be done gradually. Distributed networks can use machines of very different characteristics and the managers and schedulers should be able to deal with this scenario and efficiently distribute work. 3) Fault tolerance. When using a proper software, failure of a single machine and failures of network should not limit availability of the service. In case of processing frameworks the manager process re-runs the failed part of the computation so user eventually receives the correct result. 4) Scalability. Applications using certain distributed model can scale almost linearly with increasing size of cluster. Data centers can scale by purchasing more machines and adding them to the existing infrastructure. 5) Versatility. Unused nodes can simply be reused for other tasks when they are free. Modern schedulers and deployment pipelines help to automate this task and to distribute load evenly during peak usage. The whole distributed system is then treated as one unit from users point of view and the scheduler acts as a cluster operating system taking roles such as service discovery, service registration, load balancing load, resource assignment and task scheduling. 6) Development speed. Distributed data processing frameworks often offer a high level abstraction to express the desired task and often also a simplified parallel programming model and framework level optimizations. Together with deployment tool pipeline it provides ability execute the develop, test and deploy cycle very quickly. On the other hand sequential programs or lower level parallel programming frameworks such as MPI are not constrained by abstracted programming model and can therefore often express some more complex algorithms in a simpler way as well as provide manual optimization, efficiency and speed. Usually at a cost of slower development. 7) Ecosystem of tools and libraries. Probably the strongest reason to use existing distributed frameworks in practice is the fact that they have large communities, many open source libraries and tools that can be adopted and used for free. This greatly simplifies the development, increases quality and saves finance. 8) Iterative development, feedback and simple implementation for data scientists. With the above mentioned, the processing frameworks are perfect for quick feedback loop where the developer can easily deploy new version of the application very often. Also the presence of tools and machine learning libraries and high level of abstraction make them simple to use even for data scientists without much programming background. 9) Geographical distribution. Distributed systems often span multiple countries or even large parts of the world. For example the round trip of a packet between Stanford and Boston takes around 95ms and the theoretical limit is 43.2ms (constrained by the speed of light) [11]. Therefore it is a large advantage to have one system span multiple data centers around the world to utilize the proximity for latency critical requests and reduce communication between the distant partitions.
  • #8 theory - cap theorem, flp impossibility, vector clocks, distributed consensus, replication, formal methods (TLA+, invariant based reasoning), consistency models, .. two armies, byzantine generals
  • #9 many technologies, useful for different cases different approaches and design choices different levels of quality, price, support, maturity different use cases
  • #11 lower levels - manually written parallelization and distribution higher levels - constraining programming model but takes care of a lot
  • #13 more complex communication patterns usually required just conceptual view
  • #14 more complex communication patterns usually required just conceptual view
  • #15 BigTable ~= HBase (proprietary data store built on top of GFS) Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. Successor of BigTable Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. Interactive, fast queries. Apache Drill. Colossus new version of GFS.
  • #17 describe how it works why it helps distributing computation huge scalability improvement map cheap, shuffle expensive poor programming model