SlideShare a Scribd company logo
1 of 13
The 10 Apache Spark Features 
You (Unlikely) Didn't Hear About 
Roger Brinkley 
Technical Evangelist
The 10 Apache Stack Features You 
(Unlikely) Didn't Hear About 
• 10 minutes – 10 slides 
• Ignite Format 
• No stopping! 
• No going back! 
• Questions? Sure, but only if and until time 
remains on slide (otherwise, save for later) 
• Hire me, I’ll find 45 more
It’s Fast Really Fast 
• 10 - 100x faster than MapReduce 
• 10 – 100x faster than Hive 
• Historical perspective 
MapReduce is Listed as the Last Most 
Important Software Innovation 
– JRuby 2-3x Faster with InvokedDynamic JVM 
– Hardware rarely gets greater than 10x/year
It’s Pure Open Source 
• Commons-based Peer Production 
– Apache Software Foundation Top Level Project 
– 200 people from 50 OrganizationsContributing 
– 12 Organizations Committing 
– Peer Governance 
– Participative Decision Making 
The very essence of a free software 
consists in considering contributing 
roles as public trusts, bestowed for the 
good of the community, and not for 
the benefit of an individual or a party 
The very essence of a free government 
consists in considering offices as public 
trusts, bestowed for the good of the 
country, and not for the benefit of an 
individual or a party 
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 
1835
Strong Enterprise Relationships 
• Spark is in every major Hadoop distributor 
• Vertical enterprise use 
– Internet companies, government, financials 
– Churn analysis, fraud detection, risk analytics 
• Used in other data stores 
– Datastax (Cassandra) 
– MongoDB 
• Databricks has a cloud based implementation
Enhances Other Big Data 
Implementations 
• Hadoop – Replacement of Map Reduce 
• Cassandara – Analytics 
• Hive – Faster SQL processing 
• SAP Hana – Faster interactive analysis
API Stability 
• Guaranteed stability of its core API for 1.X 
• Spark has always been conservative with API 
changes 
• Clearly defined annotations for future APIs 
– Experimental 
– Alpha 
– Developer
Don’t Need to Learn a New Language 
• Scala 
• Java – 25% 
• Python – 30% 
• And soon R
Java 8 Lambda Support 
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); 
hdfs://log.txt"); 
JavaRDD<// Map each line String> to multiple words 
words = 
JavaRDD<lines.String> flatMap(words = lines.line flatMap( 
new FlatMapFunction<String, String>() -> Arrays.{ 
asList(line.split(" "))); 
JavaPairRDD<public Iterable<String> String, call(String Integer> line) { 
counts = 
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) 
return Arrays.asList(line.split(" ")); 
} 
}); 
// Turn the words into (word, 1) pairs 
JavaPairRDD<String, Integer> ones = words.mapToPair( 
new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String w) { 
.reduceByKey((x, y) -> x + y); 
counts.saveAsTextFile("hdfs://counts.txt"); 
return new Tuple2<String, Integer>(w, 1); 
} 
}); 
// Group up and add the pairs by key to produce counts 
JavaPairRDD<String, Integer> counts = ones.reduceByKey( 
new Function2<Integer, Integer, Integer>() { 
public Integer call(Integer i1, Integer i2) { 
return i1 + i2; 
} 
}); 
counts.saveAsTextFile("hdfs://counts.txt");
Real Time Stream Process 
val ssc = new StreamingContext(args(0), 
"NetworkHashCount", Seconds(10), 
file = sc.textFile("hdfs://.../pagecounts-*.gz") 
val counts = file.flatMap(line => line.split(" ")) 
System.getenv("SPARK_HOME"), 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
Seq(System.getenv("SPARK_EXAMPLES_JAR"))) 
val lines = ssc.socketTextStream("localhost", 9999) 
val words = lines.flatMap(_.split(" 
")).filter(_.startsWith("#")) 
val wordCounts = words.map(x => (x, 
1)).reduceByKey(_ + _) 
wordCounts.print() 
ssc.start() 
counts.saveAsTextFile("hdfs://.../word-count")
Caching Interactive Algorithms 
val points = 
sc.textFile("...").map(parsePoint).cache() 
var w = Vector.random(D) //current separating 
plane 
for (i <- 1 to ITERATIONS) { 
val gradient = points.map(p => 
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x 
).reduce(_ + _) 
w -= gradient 
} 
println("Final separating plane: " + w)
New Security Integration 
• Complete Integration with Haddop/YARN Security 
Model 
– Authenticate Job Submissions 
– Securely transfer HDFS credentials 
– Authenticate communication between component 
• Other deployments supported 
val conf = new SparkConf 
conf.set("spark.authenticate", "true") 
conf.set("spark.authenticate.secret", "good")
And Lots More 
• Apache Spark Website 
• Databricks – making big data easy 
– Introduction to Apache Spark 
• Jul 28 – Austin, TX - More Info & Registration 
• Aug 25 – Chicago, IL - More Info & Registration

More Related Content

What's hot

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practicesAshish Thapliyal
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 

What's hot (20)

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 

Viewers also liked

Test your english 1
Test your english 1Test your english 1
Test your english 1gomesr
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...In-Memory Computing Summit
 
Jeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionJeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionSahil Gaurav
 
Access grammar 3+
Access grammar 3+Access grammar 3+
Access grammar 3+OlgaYurchuk
 
Click on test_booklet_3_with_key
Click on test_booklet_3_with_keyClick on test_booklet_3_with_key
Click on test_booklet_3_with_keyTuvshin Tuvshuu
 
Click on 2 test booklet with keys
Click on 2   test booklet with keysClick on 2   test booklet with keys
Click on 2 test booklet with keysYulia Karpenko
 
Summative test in English 6
Summative test in English 6Summative test in English 6
Summative test in English 6Jerlyn Zara
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJoseph Kuo
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...DataStax
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIn-Memory Computing Summit
 
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)LiGhT ArOhL
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

Viewers also liked (14)

Test your english 1
Test your english 1Test your english 1
Test your english 1
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
 
Jeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionJeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solution
 
Access grammar 3+
Access grammar 3+Access grammar 3+
Access grammar 3+
 
Access 1-test-booklet
Access 1-test-bookletAccess 1-test-booklet
Access 1-test-booklet
 
Click on test_booklet_3_with_key
Click on test_booklet_3_with_keyClick on test_booklet_3_with_key
Click on test_booklet_3_with_key
 
Click on 2 test booklet with keys
Click on 2   test booklet with keysClick on 2   test booklet with keys
Click on 2 test booklet with keys
 
Summative test in English 6
Summative test in English 6Summative test in English 6
Summative test in English 6
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
 
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Fce practice test (book 3)
Fce practice test (book 3)Fce practice test (book 3)
Fce practice test (book 3)
 

Similar to 10 Things About Spark

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 

Similar to 10 Things About Spark (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
963
963963
963
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 

Recently uploaded

Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 

Recently uploaded (20)

Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 

10 Things About Spark

  • 1. The 10 Apache Spark Features You (Unlikely) Didn't Hear About Roger Brinkley Technical Evangelist
  • 2. The 10 Apache Stack Features You (Unlikely) Didn't Hear About • 10 minutes – 10 slides • Ignite Format • No stopping! • No going back! • Questions? Sure, but only if and until time remains on slide (otherwise, save for later) • Hire me, I’ll find 45 more
  • 3. It’s Fast Really Fast • 10 - 100x faster than MapReduce • 10 – 100x faster than Hive • Historical perspective MapReduce is Listed as the Last Most Important Software Innovation – JRuby 2-3x Faster with InvokedDynamic JVM – Hardware rarely gets greater than 10x/year
  • 4. It’s Pure Open Source • Commons-based Peer Production – Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing – 12 Organizations Committing – Peer Governance – Participative Decision Making The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party The very essence of a free government consists in considering offices as public trusts, bestowed for the good of the country, and not for the benefit of an individual or a party Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 1835
  • 5. Strong Enterprise Relationships • Spark is in every major Hadoop distributor • Vertical enterprise use – Internet companies, government, financials – Churn analysis, fraud detection, risk analytics • Used in other data stores – Datastax (Cassandra) – MongoDB • Databricks has a cloud based implementation
  • 6. Enhances Other Big Data Implementations • Hadoop – Replacement of Map Reduce • Cassandara – Analytics • Hive – Faster SQL processing • SAP Hana – Faster interactive analysis
  • 7. API Stability • Guaranteed stability of its core API for 1.X • Spark has always been conservative with API changes • Clearly defined annotations for future APIs – Experimental – Alpha – Developer
  • 8. Don’t Need to Learn a New Language • Scala • Java – 25% • Python – 30% • And soon R
  • 9. Java 8 Lambda Support JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); hdfs://log.txt"); JavaRDD<// Map each line String> to multiple words words = JavaRDD<lines.String> flatMap(words = lines.line flatMap( new FlatMapFunction<String, String>() -> Arrays.{ asList(line.split(" "))); JavaPairRDD<public Iterable<String> String, call(String Integer> line) { counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) return Arrays.asList(line.split(" ")); } }); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://counts.txt"); return new Tuple2<String, Integer>(w, 1); } }); // Group up and add the pairs by key to produce counts JavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); counts.saveAsTextFile("hdfs://counts.txt");
  • 10. Real Time Stream Process val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), file = sc.textFile("hdfs://.../pagecounts-*.gz") val counts = file.flatMap(line => line.split(" ")) System.getenv("SPARK_HOME"), .map(word => (word, 1)) .reduceByKey(_ + _) Seq(System.getenv("SPARK_EXAMPLES_JAR"))) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() counts.saveAsTextFile("hdfs://.../word-count")
  • 11. Caching Interactive Algorithms val points = sc.textFile("...").map(parsePoint).cache() var w = Vector.random(D) //current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)
  • 12. New Security Integration • Complete Integration with Haddop/YARN Security Model – Authenticate Job Submissions – Securely transfer HDFS credentials – Authenticate communication between component • Other deployments supported val conf = new SparkConf conf.set("spark.authenticate", "true") conf.set("spark.authenticate.secret", "good")
  • 13. And Lots More • Apache Spark Website • Databricks – making big data easy – Introduction to Apache Spark • Jul 28 – Austin, TX - More Info & Registration • Aug 25 – Chicago, IL - More Info & Registration

Editor's Notes

  1. There are lot of features that you probably don’t know about but you can find them at the Apache Spark Website or at Databricks, the company where a number of the leading conributors to Apache Spark work. Also be aware that Databricks offers an Introduction to Apache Spark with events coming up on July 28 in Austin and August 25 in Chicago.