A presentation prepared for Data Stack as a part of their Interview process on July 20.
This 15 presentation in ignite format features 10 items that you might not know about the V1.0 Spark release
A healthy diet for your Java application Devoxx France.pdf
10 Things About Spark
1. The 10 Apache Spark Features
You (Unlikely) Didn't Hear About
Roger Brinkley
Technical Evangelist
2. The 10 Apache Stack Features You
(Unlikely) Didn't Hear About
• 10 minutes – 10 slides
• Ignite Format
• No stopping!
• No going back!
• Questions? Sure, but only if and until time
remains on slide (otherwise, save for later)
• Hire me, I’ll find 45 more
3. It’s Fast Really Fast
• 10 - 100x faster than MapReduce
• 10 – 100x faster than Hive
• Historical perspective
MapReduce is Listed as the Last Most
Important Software Innovation
– JRuby 2-3x Faster with InvokedDynamic JVM
– Hardware rarely gets greater than 10x/year
4. It’s Pure Open Source
• Commons-based Peer Production
– Apache Software Foundation Top Level Project
– 200 people from 50 OrganizationsContributing
– 12 Organizations Committing
– Peer Governance
– Participative Decision Making
The very essence of a free software
consists in considering contributing
roles as public trusts, bestowed for the
good of the community, and not for
the benefit of an individual or a party
The very essence of a free government
consists in considering offices as public
trusts, bestowed for the good of the
country, and not for the benefit of an
individual or a party
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun
1835
5. Strong Enterprise Relationships
• Spark is in every major Hadoop distributor
• Vertical enterprise use
– Internet companies, government, financials
– Churn analysis, fraud detection, risk analytics
• Used in other data stores
– Datastax (Cassandra)
– MongoDB
• Databricks has a cloud based implementation
6. Enhances Other Big Data
Implementations
• Hadoop – Replacement of Map Reduce
• Cassandara – Analytics
• Hive – Faster SQL processing
• SAP Hana – Faster interactive analysis
7. API Stability
• Guaranteed stability of its core API for 1.X
• Spark has always been conservative with API
changes
• Clearly defined annotations for future APIs
– Experimental
– Alpha
– Developer
8. Don’t Need to Learn a New Language
• Scala
• Java – 25%
• Python – 30%
• And soon R
9. Java 8 Lambda Support
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt");
hdfs://log.txt");
JavaRDD<// Map each line String> to multiple words
words =
JavaRDD<lines.String> flatMap(words = lines.line flatMap(
new FlatMapFunction<String, String>() -> Arrays.{
asList(line.split(" ")));
JavaPairRDD<public Iterable<String> String, call(String Integer> line) {
counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
return Arrays.asList(line.split(" "));
}
});
// Turn the words into (word, 1) pairs
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String w) {
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://counts.txt");
return new Tuple2<String, Integer>(w, 1);
}
});
// Group up and add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
counts.saveAsTextFile("hdfs://counts.txt");
10. Real Time Stream Process
val ssc = new StreamingContext(args(0),
"NetworkHashCount", Seconds(10),
file = sc.textFile("hdfs://.../pagecounts-*.gz")
val counts = file.flatMap(line => line.split(" "))
System.getenv("SPARK_HOME"),
.map(word => (word, 1))
.reduceByKey(_ + _)
Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split("
")).filter(_.startsWith("#"))
val wordCounts = words.map(x => (x,
1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
counts.saveAsTextFile("hdfs://.../word-count")
11. Caching Interactive Algorithms
val points =
sc.textFile("...").map(parsePoint).cache()
var w = Vector.random(D) //current separating
plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)
12. New Security Integration
• Complete Integration with Haddop/YARN Security
Model
– Authenticate Job Submissions
– Securely transfer HDFS credentials
– Authenticate communication between component
• Other deployments supported
val conf = new SparkConf
conf.set("spark.authenticate", "true")
conf.set("spark.authenticate.secret", "good")
13. And Lots More
• Apache Spark Website
• Databricks – making big data easy
– Introduction to Apache Spark
• Jul 28 – Austin, TX - More Info & Registration
• Aug 25 – Chicago, IL - More Info & Registration
Editor's Notes
There are lot of features that you probably don’t know about but you can find them at the Apache Spark Website or at Databricks, the company where a number of the leading conributors to Apache Spark work. Also be aware that Databricks offers an Introduction to Apache Spark with events coming up on July 28 in Austin and August 25 in Chicago.