10 Things About Spark

The 10 Apache Spark Features
You (Unlikely) Didn't Hear About
Roger Brinkley
Technical Evangelist

The 10 Apache Stack Features You
(Unlikely) Didn't Hear About
• 10 minutes – 10 slides
• Ignite Format
• No stopping!
• No going back!
• Questions? Sure, but only if and until time
remains on slide (otherwise, save for later)
• Hire me, I’ll find 45 more

It’s Fast Really Fast
• 10 - 100x faster than MapReduce
• 10 – 100x faster than Hive
• Historical perspective
MapReduce is Listed as the Last Most
Important Software Innovation
– JRuby 2-3x Faster with InvokedDynamic JVM
– Hardware rarely gets greater than 10x/year

It’s Pure Open Source
• Commons-based Peer Production
– Apache Software Foundation Top Level Project
– 200 people from 50 OrganizationsContributing
– 12 Organizations Committing
– Peer Governance
– Participative Decision Making
The very essence of a free software
consists in considering contributing
roles as public trusts, bestowed for the
good of the community, and not for
the benefit of an individual or a party
The very essence of a free government
consists in considering offices as public
trusts, bestowed for the good of the
country, and not for the benefit of an
individual or a party
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun
1835

Strong Enterprise Relationships
• Spark is in every major Hadoop distributor
• Vertical enterprise use
– Internet companies, government, financials
– Churn analysis, fraud detection, risk analytics
• Used in other data stores
– Datastax (Cassandra)
– MongoDB
• Databricks has a cloud based implementation

Enhances Other Big Data
Implementations
• Hadoop – Replacement of Map Reduce
• Cassandara – Analytics
• Hive – Faster SQL processing
• SAP Hana – Faster interactive analysis

API Stability
• Guaranteed stability of its core API for 1.X
• Spark has always been conservative with API
changes
• Clearly defined annotations for future APIs
– Experimental
– Alpha
– Developer

Don’t Need to Learn a New Language
• Scala
• Java – 25%
• Python – 30%
• And soon R

Java 8 Lambda Support
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt");
hdfs://log.txt");
JavaRDD<// Map each line String> to multiple words
words =
JavaRDD<lines.String> flatMap(words = lines.line flatMap(
new FlatMapFunction<String, String>() -> Arrays.{
asList(line.split(" ")));
JavaPairRDD<public Iterable<String> String, call(String Integer> line) {
counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
return Arrays.asList(line.split(" "));
}
});
// Turn the words into (word, 1) pairs
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String w) {
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://counts.txt");
return new Tuple2<String, Integer>(w, 1);
}
});
// Group up and add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
counts.saveAsTextFile("hdfs://counts.txt");

Real Time Stream Process
val ssc = new StreamingContext(args(0),
"NetworkHashCount", Seconds(10),
file = sc.textFile("hdfs://.../pagecounts-*.gz")
val counts = file.flatMap(line => line.split(" "))
System.getenv("SPARK_HOME"),
.map(word => (word, 1))
.reduceByKey(_ + _)
Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split("
")).filter(_.startsWith("#"))
val wordCounts = words.map(x => (x,
1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
counts.saveAsTextFile("hdfs://.../word-count")

Caching Interactive Algorithms
val points =
sc.textFile("...").map(parsePoint).cache()
var w = Vector.random(D) //current separating
plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)

New Security Integration
• Complete Integration with Haddop/YARN Security
Model
– Authenticate Job Submissions
– Securely transfer HDFS credentials
– Authenticate communication between component
• Other deployments supported
val conf = new SparkConf
conf.set("spark.authenticate", "true")
conf.set("spark.authenticate.secret", "good")

And Lots More
• Apache Spark Website
• Databricks – making big data easy
– Introduction to Apache Spark
• Jul 28 – Austin, TX - More Info & Registration
• Aug 25 – Chicago, IL - More Info & Registration

10 Things About Spark

More Related Content

What's hot

Viewers also liked

Similar to 10 Things About Spark

Recently uploaded

10 Things About Spark

Editor's Notes