The 10 Apache Spark Features 
You (Unlikely) Didn't Hear About 
Roger Brinkley 
Technical Evangelist
The 10 Apache Stack Features You 
(Unlikely) Didn't Hear About 
• 10 minutes – 10 slides 
• Ignite Format 
• No stopping! 
• No going back! 
• Questions? Sure, but only if and until time 
remains on slide (otherwise, save for later) 
• Hire me, I’ll find 45 more
It’s Fast Really Fast 
• 10 - 100x faster than MapReduce 
• 10 – 100x faster than Hive 
• Historical perspective 
MapReduce is Listed as the Last Most 
Important Software Innovation 
– JRuby 2-3x Faster with InvokedDynamic JVM 
– Hardware rarely gets greater than 10x/year
It’s Pure Open Source 
• Commons-based Peer Production 
– Apache Software Foundation Top Level Project 
– 200 people from 50 OrganizationsContributing 
– 12 Organizations Committing 
– Peer Governance 
– Participative Decision Making 
The very essence of a free software 
consists in considering contributing 
roles as public trusts, bestowed for the 
good of the community, and not for 
the benefit of an individual or a party 
The very essence of a free government 
consists in considering offices as public 
trusts, bestowed for the good of the 
country, and not for the benefit of an 
individual or a party 
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 
1835
Strong Enterprise Relationships 
• Spark is in every major Hadoop distributor 
• Vertical enterprise use 
– Internet companies, government, financials 
– Churn analysis, fraud detection, risk analytics 
• Used in other data stores 
– Datastax (Cassandra) 
– MongoDB 
• Databricks has a cloud based implementation
Enhances Other Big Data 
Implementations 
• Hadoop – Replacement of Map Reduce 
• Cassandara – Analytics 
• Hive – Faster SQL processing 
• SAP Hana – Faster interactive analysis
API Stability 
• Guaranteed stability of its core API for 1.X 
• Spark has always been conservative with API 
changes 
• Clearly defined annotations for future APIs 
– Experimental 
– Alpha 
– Developer
Don’t Need to Learn a New Language 
• Scala 
• Java – 25% 
• Python – 30% 
• And soon R
Java 8 Lambda Support 
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); 
hdfs://log.txt"); 
JavaRDD<// Map each line String> to multiple words 
words = 
JavaRDD<lines.String> flatMap(words = lines.line flatMap( 
new FlatMapFunction<String, String>() -> Arrays.{ 
asList(line.split(" "))); 
JavaPairRDD<public Iterable<String> String, call(String Integer> line) { 
counts = 
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) 
return Arrays.asList(line.split(" ")); 
} 
}); 
// Turn the words into (word, 1) pairs 
JavaPairRDD<String, Integer> ones = words.mapToPair( 
new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String w) { 
.reduceByKey((x, y) -> x + y); 
counts.saveAsTextFile("hdfs://counts.txt"); 
return new Tuple2<String, Integer>(w, 1); 
} 
}); 
// Group up and add the pairs by key to produce counts 
JavaPairRDD<String, Integer> counts = ones.reduceByKey( 
new Function2<Integer, Integer, Integer>() { 
public Integer call(Integer i1, Integer i2) { 
return i1 + i2; 
} 
}); 
counts.saveAsTextFile("hdfs://counts.txt");
Real Time Stream Process 
val ssc = new StreamingContext(args(0), 
"NetworkHashCount", Seconds(10), 
file = sc.textFile("hdfs://.../pagecounts-*.gz") 
val counts = file.flatMap(line => line.split(" ")) 
System.getenv("SPARK_HOME"), 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
Seq(System.getenv("SPARK_EXAMPLES_JAR"))) 
val lines = ssc.socketTextStream("localhost", 9999) 
val words = lines.flatMap(_.split(" 
")).filter(_.startsWith("#")) 
val wordCounts = words.map(x => (x, 
1)).reduceByKey(_ + _) 
wordCounts.print() 
ssc.start() 
counts.saveAsTextFile("hdfs://.../word-count")
Caching Interactive Algorithms 
val points = 
sc.textFile("...").map(parsePoint).cache() 
var w = Vector.random(D) //current separating 
plane 
for (i <- 1 to ITERATIONS) { 
val gradient = points.map(p => 
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x 
).reduce(_ + _) 
w -= gradient 
} 
println("Final separating plane: " + w)
New Security Integration 
• Complete Integration with Haddop/YARN Security 
Model 
– Authenticate Job Submissions 
– Securely transfer HDFS credentials 
– Authenticate communication between component 
• Other deployments supported 
val conf = new SparkConf 
conf.set("spark.authenticate", "true") 
conf.set("spark.authenticate.secret", "good")
And Lots More 
• Apache Spark Website 
• Databricks – making big data easy 
– Introduction to Apache Spark 
• Jul 28 – Austin, TX - More Info & Registration 
• Aug 25 – Chicago, IL - More Info & Registration

10 Things About Spark

  • 1.
    The 10 ApacheSpark Features You (Unlikely) Didn't Hear About Roger Brinkley Technical Evangelist
  • 2.
    The 10 ApacheStack Features You (Unlikely) Didn't Hear About • 10 minutes – 10 slides • Ignite Format • No stopping! • No going back! • Questions? Sure, but only if and until time remains on slide (otherwise, save for later) • Hire me, I’ll find 45 more
  • 3.
    It’s Fast ReallyFast • 10 - 100x faster than MapReduce • 10 – 100x faster than Hive • Historical perspective MapReduce is Listed as the Last Most Important Software Innovation – JRuby 2-3x Faster with InvokedDynamic JVM – Hardware rarely gets greater than 10x/year
  • 4.
    It’s Pure OpenSource • Commons-based Peer Production – Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing – 12 Organizations Committing – Peer Governance – Participative Decision Making The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party The very essence of a free government consists in considering offices as public trusts, bestowed for the good of the country, and not for the benefit of an individual or a party Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 1835
  • 5.
    Strong Enterprise Relationships • Spark is in every major Hadoop distributor • Vertical enterprise use – Internet companies, government, financials – Churn analysis, fraud detection, risk analytics • Used in other data stores – Datastax (Cassandra) – MongoDB • Databricks has a cloud based implementation
  • 6.
    Enhances Other BigData Implementations • Hadoop – Replacement of Map Reduce • Cassandara – Analytics • Hive – Faster SQL processing • SAP Hana – Faster interactive analysis
  • 7.
    API Stability •Guaranteed stability of its core API for 1.X • Spark has always been conservative with API changes • Clearly defined annotations for future APIs – Experimental – Alpha – Developer
  • 8.
    Don’t Need toLearn a New Language • Scala • Java – 25% • Python – 30% • And soon R
  • 9.
    Java 8 LambdaSupport JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); hdfs://log.txt"); JavaRDD<// Map each line String> to multiple words words = JavaRDD<lines.String> flatMap(words = lines.line flatMap( new FlatMapFunction<String, String>() -> Arrays.{ asList(line.split(" "))); JavaPairRDD<public Iterable<String> String, call(String Integer> line) { counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) return Arrays.asList(line.split(" ")); } }); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://counts.txt"); return new Tuple2<String, Integer>(w, 1); } }); // Group up and add the pairs by key to produce counts JavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); counts.saveAsTextFile("hdfs://counts.txt");
  • 10.
    Real Time StreamProcess val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), file = sc.textFile("hdfs://.../pagecounts-*.gz") val counts = file.flatMap(line => line.split(" ")) System.getenv("SPARK_HOME"), .map(word => (word, 1)) .reduceByKey(_ + _) Seq(System.getenv("SPARK_EXAMPLES_JAR"))) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() counts.saveAsTextFile("hdfs://.../word-count")
  • 11.
    Caching Interactive Algorithms val points = sc.textFile("...").map(parsePoint).cache() var w = Vector.random(D) //current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)
  • 12.
    New Security Integration • Complete Integration with Haddop/YARN Security Model – Authenticate Job Submissions – Securely transfer HDFS credentials – Authenticate communication between component • Other deployments supported val conf = new SparkConf conf.set("spark.authenticate", "true") conf.set("spark.authenticate.secret", "good")
  • 13.
    And Lots More • Apache Spark Website • Databricks – making big data easy – Introduction to Apache Spark • Jul 28 – Austin, TX - More Info & Registration • Aug 25 – Chicago, IL - More Info & Registration

Editor's Notes

  • #14 There are lot of features that you probably don’t know about but you can find them at the Apache Spark Website or at Databricks, the company where a number of the leading conributors to Apache Spark work. Also be aware that Databricks offers an Introduction to Apache Spark with events coming up on July 28 in Austin and August 25 in Chicago.