10 Things About Spark

The 10 Apache Spark Features
You (Unlikely) Didn't Hear About
Roger Brinkley
Technical Evangelist

The 10 Apache Stack Features You
(Unlikely) Didn't Hear About
• 10 minutes – 10 slides
• Ignite Format
• No stopping!
• No going back!
• Questions? Sure, but only if and until time
remains on slide (otherwise, save for later)
• Hire me, I’ll find 45 more

It’s Fast Really Fast
• 10 - 100x faster than MapReduce
• 10 – 100x faster than Hive
• Historical perspective
MapReduce is Listed as the Last Most
Important Software Innovation
– JRuby 2-3x Faster with InvokedDynamic JVM
– Hardware rarely gets greater than 10x/year

It’s Pure Open Source
• Commons-based Peer Production
– Apache Software Foundation Top Level Project
– 200 people from 50 OrganizationsContributing
– 12 Organizations Committing
– Peer Governance
– Participative Decision Making
The very essence of a free software
consists in considering contributing
roles as public trusts, bestowed for the
good of the community, and not for
the benefit of an individual or a party
The very essence of a free government
consists in considering offices as public
trusts, bestowed for the good of the
country, and not for the benefit of an
individual or a party
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun
1835

Strong Enterprise Relationships
• Spark is in every major Hadoop distributor
• Vertical enterprise use
– Internet companies, government, financials
– Churn analysis, fraud detection, risk analytics
• Used in other data stores
– Datastax (Cassandra)
– MongoDB
• Databricks has a cloud based implementation

Enhances Other Big Data
Implementations
• Hadoop – Replacement of Map Reduce
• Cassandara – Analytics
• Hive – Faster SQL processing
• SAP Hana – Faster interactive analysis

API Stability
• Guaranteed stability of its core API for 1.X
• Spark has always been conservative with API
changes
• Clearly defined annotations for future APIs
– Experimental
– Alpha
– Developer

Don’t Need to Learn a New Language
• Scala
• Java – 25%
• Python – 30%
• And soon R

Java 8 Lambda Support
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt");
hdfs://log.txt");
JavaRDD<// Map each line String> to multiple words
words =
JavaRDD<lines.String> flatMap(words = lines.line flatMap(
new FlatMapFunction<String, String>() -> Arrays.{
asList(line.split(" ")));
JavaPairRDD<public Iterable<String> String, call(String Integer> line) {
counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
return Arrays.asList(line.split(" "));
}
});
// Turn the words into (word, 1) pairs
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String w) {
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://counts.txt");
return new Tuple2<String, Integer>(w, 1);
}
});
// Group up and add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
counts.saveAsTextFile("hdfs://counts.txt");

Real Time Stream Process
val ssc = new StreamingContext(args(0),
"NetworkHashCount", Seconds(10),
file = sc.textFile("hdfs://.../pagecounts-*.gz")
val counts = file.flatMap(line => line.split(" "))
System.getenv("SPARK_HOME"),
.map(word => (word, 1))
.reduceByKey(_ + _)
Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split("
")).filter(_.startsWith("#"))
val wordCounts = words.map(x => (x,
1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
counts.saveAsTextFile("hdfs://.../word-count")

Caching Interactive Algorithms
val points =
sc.textFile("...").map(parsePoint).cache()
var w = Vector.random(D) //current separating
plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)

New Security Integration
• Complete Integration with Haddop/YARN Security
Model
– Authenticate Job Submissions
– Securely transfer HDFS credentials
– Authenticate communication between component
• Other deployments supported
val conf = new SparkConf
conf.set("spark.authenticate", "true")
conf.set("spark.authenticate.secret", "good")

And Lots More
• Apache Spark Website
• Databricks – making big data easy
– Introduction to Apache Spark
• Jul 28 – Austin, TX - More Info & Registration
• Aug 25 – Chicago, IL - More Info & Registration

10 Things About Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to 10 Things About Spark

Similar to 10 Things About Spark (20)

Recently uploaded

Recently uploaded (20)

10 Things About Spark

Editor's Notes