SlideShare a Scribd company logo
1 of 40
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark and Object Stores
—What you need to know
Steve Loughran
stevel@hortonworks.com
@steveloughran
October 2016
Steve Loughran,
Hadoop committer, PMC member, …
Chris Nauroth,
Apache Hadoop committer & PMC
ASF member
Rajesh Balamohan
Tez Committer, PMC Member
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC, Parquet
datasets
inbound
Elastic ETL
HDFS
external
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
datasets
external
Notebooks
library
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A Filesystem: Directories, Files  Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Object Store: hash(name)->blob
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
REST APIs
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Four Challenges
1. Classpath
2. Credentials
3. Code
4. Commitment
Let's look At S3 and Azure
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use S3A to work with S3
(EMR: use Amazon's s3:// )
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classpath: fix “No FileSystem for scheme: s3a”
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
See SPARK-7481
Get Spark with
Hadoop 2.7+ JARs
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credentials
core-site.xml or spark-default.conf
spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY
spark-submit automatically propagates Environment Variables
export AWS_ACCESS_KEY=MY_ACCESS_KEY
export AWS_SECRET_KEY=MY_SECRET_KEY
NEVER: share, check in to SCM, paste in bug reports…
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication Failure: 403
com.amazonaws.services.s3.model.AmazonS3Exception:
The request signature we calculated does not match
the signature you provided.
Check your key and signing method.
1. Check joda-time.jar & JVM version
2. Credentials wrong
3. Credentials not propagating
4. Local system clock (more likely on VMs)
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Code: Basic IO
// Read in public dataset
val lines = sc.textFile("s3a://landsat-pds/scene_list.gz")
val lineCount = lines.count()
// generate and write data
val numbers = sc.parallelize(1 to 10000)
numbers.saveAsTextFile("s3a://hwdev-stevel-demo/counts")
All you need is the URL
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Code: just use the URL of the object store
val csvdata = spark.read.options(Map(
"header" -> "true",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
...read time O(distance)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
val landsat = "s3a://stevel-demo/landsat"
csvData.write.parquet(landsat)
val landsatOrc = "s3a://stevel-demo/landsatOrc"
csvData.write.orc(landsatOrc)
val df = spark.read.parquet(landsat)
val orcDf = spark.read.parquet(landsatOrc)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Finding dirty data with Spark SQL
val sqlDF = spark.sql(
"SELECT id, acquisitionDate, cloudCover"
+ s" FROM parquet.`${landsat}`")
val negativeClouds = sqlDF.filter("cloudCover < 0")
negativeClouds.show()
* filter columns and data early
* whether/when to cache()?
* copy popular data to HDFS
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
spark-default.conf
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Notebooks? Classpath & Credentials
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Commitment Problem
⬢ rename() used for atomic commitment transaction
⬢ time to copy() + delete() proportional to data * files
⬢ S3: 6+ MB/s
⬢ Azure: a lot faster —usually
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What about Direct Output Committers?
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Recent S3A Performance (Hadoop 2.8, HDP 2.5, CDH 5.9 (?))
// forward seek by skipping stream
spark.hadoop.fs.s3a.readahead.range 157810688
// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random
// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Azure Storage: wasb://
A full substitute for HDFS
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classpath: fix “No FileSystem for scheme: wasb”
wasb:// : Consistent, with very fast rename (hence: commits)
hadoop-azure-2.7.x.jar
azure-storage-2.2.0.jar
+ (jackson-core; http-components, hadoop-common)
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credentials: core-site.xml / spark-default.conf
<property>
<name>fs.azure.account.key.example.blob.core.windows.net</name>
<value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value>
</property>
spark.hadoop.fs.azure.account.key.example.blob.core.windows.net
0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c
wasb://demo@example.blob.core.windows.net
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Azure Storage and Streaming
val streaming = new StreamingContext(sparkConf,Seconds(10))
val azure = "wasb://demo@example.blob.core.windows.net/in"
val lines = streaming.textFileStream(azure)
val matches = lines.map(line => {
println(line)
line
})
matches.print()
streaming.start()
* PUT into the streaming directory
* keep the dir clean
* size window for slow scans
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Not Covered
⬢ Partitioning/directory layout
⬢ Infrastructure Throttling
⬢ Optimal path names
⬢ Error handling
⬢ Metrics
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
⬢ Object Stores look just like any other URL
⬢ …but do need classpath and configuration
⬢ Issues: performance, commitment
⬢ Use Hadoop 2.7+ JARs
⬢ Tune to reduce I/O
⬢ Keep those credentials secret!
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
s3:// —“inode on S3”
s3n://
“Native” S3
s3a://
Replaces s3n
swift://
OpenStack
wasb://
Azure WASB
s3a:// Stabilize
oss://
Aliyun
gs://
Google Cloud
s3a://
Speed and consistency adl://
Azure Data Lake
2006
2008
2013
2014
2015
2016
s3://
Amazon EMR S3
History of Object Storage Support
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cloud Storage Connectors
Azure WASB ● Strongly consistent
● Good performance
● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent
● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in
progress by Hortonworks
● Performance improvements in progress
● Active development in Apache
EMRFS ● Proprietary connector used in EMR
● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies
● Currently Google open source
● Good performance
● Could improve test coverage
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheme Stable since Speed Consistency Maintenance
s3n:// Hadoop 1.0 (Apache)
s3a:// Hadoop 2.7 2.8+ ongoing Apache
swift:// Hadoop 2.2 Apache
wasb:// Hadoop 2.7 Hadoop 2.7 strong Apache
adl:// Hadoop 3
EMR s3:// AWS EMR For a fee Amazon
gs:// ??? Google @ github
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
S3 Server-Side Encryption
⬢ Encryption of data at rest at S3
⬢ Supports the SSE-S3 option: each object encrypted by a unique key
using AES-256 cipher
⬢ Now covered in S3A automated test suites
⬢ Support for additional options under development (SSE-KMS and SSE-C)
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Advanced authentication
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.InstanceProfileCredentialsProvider,
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
</value>
</property>
+encrypted credentials in JECKS files on
HDFS
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Next? Performance and
integration
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Steps for all Object Stores
⬢ Output Committers
– Logical commit operation decoupled from rename (non-atomic and costly in object stores)
⬢ Object Store Abstraction Layer
– Avoid impedance mismatch with FileSystem API
– Provide specific APIs for better integration with object stores: saving, listing, copying
⬢ Ongoing Performance Improvement
⬢ Consistency
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dependencies in Hadoop 2.8
hadoop-aws-2.8.x.jar
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
hadoop-aws-2.8.x.jar
azure-storage-4.2.0.jar

More Related Content

What's hot

Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_finalRamya Sunil
 
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop MeetupSqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetupgethue
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop UsersKathleen Ting
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerDataWorks Summit
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 

What's hot (20)

Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop MeetupSqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 

Viewers also liked

Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...Amazon Web Services
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Thamme Gowda
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 

Viewers also liked (17)

Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 

Similar to Apache Spark and Object Stores

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?Hortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseMingliang Liu
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 

Similar to Apache Spark and Object Stores (20)

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

More from Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Steve Loughran
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflowSteve Loughran
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-statusSteve Loughran
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduceSteve Loughran
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 

More from Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 
Hoya for Code Review
Hoya for Code ReviewHoya for Code Review
Hoya for Code Review
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Apache Spark and Object Stores

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark and Object Stores —What you need to know Steve Loughran stevel@hortonworks.com @steveloughran October 2016
  • 2. Steve Loughran, Hadoop committer, PMC member, … Chris Nauroth, Apache Hadoop committer & PMC ASF member Rajesh Balamohan Tez Committer, PMC Member
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC, Parquet datasets inbound Elastic ETL HDFS external
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved datasets external Notebooks library
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A Filesystem: Directories, Files  Data / work pending part-00 part-01 00 00 00 01 01 01 complete part-01 rename("/work/pending/part-01", "/work/complete")
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Object Store: hash(name)->blob 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] copy("/work/pending/part-01", "/work/complete/part01") 01 01 01 01 delete("/work/pending/part-01") hash("/work/pending/part-00") ["s01", "s02", "s04"]
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved REST APIs 00 00 00 01 01 s01 s02 s03 s04 HEAD /work/complete/part-01 PUT /work/complete/part01 x-amz-copy-source: /work/pending/part-01 01 DELETE /work/pending/part-01 PUT /work/pending/part-01 ... DATA ... GET /work/pending/part-01 Content-Length: 1-8192 GET /?prefix=/work&delimiter=/
  • 9. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3awasb adlswift gs
  • 10. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Four Challenges 1. Classpath 2. Credentials 3. Code 4. Commitment Let's look At S3 and Azure
  • 11. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use S3A to work with S3 (EMR: use Amazon's s3:// )
  • 12. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Classpath: fix “No FileSystem for scheme: s3a” hadoop-aws-2.7.x.jar aws-java-sdk-1.7.4.jar joda-time-2.9.3.jar (jackson-*-2.6.5.jar) See SPARK-7481 Get Spark with Hadoop 2.7+ JARs
  • 13. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Credentials core-site.xml or spark-default.conf spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY spark-submit automatically propagates Environment Variables export AWS_ACCESS_KEY=MY_ACCESS_KEY export AWS_SECRET_KEY=MY_SECRET_KEY NEVER: share, check in to SCM, paste in bug reports…
  • 14. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication Failure: 403 com.amazonaws.services.s3.model.AmazonS3Exception: The request signature we calculated does not match the signature you provided. Check your key and signing method. 1. Check joda-time.jar & JVM version 2. Credentials wrong 3. Credentials not propagating 4. Local system clock (more likely on VMs)
  • 15. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Code: Basic IO // Read in public dataset val lines = sc.textFile("s3a://landsat-pds/scene_list.gz") val lineCount = lines.count() // generate and write data val numbers = sc.parallelize(1 to 10000) numbers.saveAsTextFile("s3a://hwdev-stevel-demo/counts") All you need is the URL
  • 16. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Code: just use the URL of the object store val csvdata = spark.read.options(Map( "header" -> "true", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz") ...read time O(distance)
  • 17. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames val landsat = "s3a://stevel-demo/landsat" csvData.write.parquet(landsat) val landsatOrc = "s3a://stevel-demo/landsatOrc" csvData.write.orc(landsatOrc) val df = spark.read.parquet(landsat) val orcDf = spark.read.parquet(landsatOrc)
  • 18. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Finding dirty data with Spark SQL val sqlDF = spark.sql( "SELECT id, acquisitionDate, cloudCover" + s" FROM parquet.`${landsat}`") val negativeClouds = sqlDF.filter("cloudCover < 0") negativeClouds.show() * filter columns and data early * whether/when to cache()? * copy popular data to HDFS
  • 19. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark-default.conf spark.sql.parquet.filterPushdown true spark.sql.parquet.mergeSchema false spark.hadoop.parquet.enable.summary-metadata false spark.sql.orc.filterPushdown true spark.sql.orc.splits.include.file.footer true spark.sql.orc.cache.stripe.details.size 10000 spark.sql.hive.metastorePartitionPruning true
  • 20. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Notebooks? Classpath & Credentials
  • 21. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Commitment Problem ⬢ rename() used for atomic commitment transaction ⬢ time to copy() + delete() proportional to data * files ⬢ S3: 6+ MB/s ⬢ Azure: a lot faster —usually spark.speculation false spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
  • 22. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What about Direct Output Committers?
  • 23. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recent S3A Performance (Hadoop 2.8, HDP 2.5, CDH 5.9 (?)) // forward seek by skipping stream spark.hadoop.fs.s3a.readahead.range 157810688 // faster backward seek for ORC and Parquet input spark.hadoop.fs.s3a.experimental.input.fadvise random // PUT blocks in separate threads spark.hadoop.fs.s3a.fast.output.enabled true
  • 24. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Azure Storage: wasb:// A full substitute for HDFS
  • 25. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Classpath: fix “No FileSystem for scheme: wasb” wasb:// : Consistent, with very fast rename (hence: commits) hadoop-azure-2.7.x.jar azure-storage-2.2.0.jar + (jackson-core; http-components, hadoop-common)
  • 26. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Credentials: core-site.xml / spark-default.conf <property> <name>fs.azure.account.key.example.blob.core.windows.net</name> <value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value> </property> spark.hadoop.fs.azure.account.key.example.blob.core.windows.net 0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c wasb://demo@example.blob.core.windows.net
  • 27. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example: Azure Storage and Streaming val streaming = new StreamingContext(sparkConf,Seconds(10)) val azure = "wasb://demo@example.blob.core.windows.net/in" val lines = streaming.textFileStream(azure) val matches = lines.map(line => { println(line) line }) matches.print() streaming.start() * PUT into the streaming directory * keep the dir clean * size window for slow scans
  • 28. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Not Covered ⬢ Partitioning/directory layout ⬢ Infrastructure Throttling ⬢ Optimal path names ⬢ Error handling ⬢ Metrics
  • 29. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary ⬢ Object Stores look just like any other URL ⬢ …but do need classpath and configuration ⬢ Issues: performance, commitment ⬢ Use Hadoop 2.7+ JARs ⬢ Tune to reduce I/O ⬢ Keep those credentials secret!
  • 30.
  • 31. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup Slides
  • 32. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Often: Eventually Consistent 00 00 00 01 01 s01 s02 s03 s04 01 DELETE /work/pending/part-00 GET /work/pending/part-00 GET /work/pending/part-00 200 200 200
  • 33. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved s3:// —“inode on S3” s3n:// “Native” S3 s3a:// Replaces s3n swift:// OpenStack wasb:// Azure WASB s3a:// Stabilize oss:// Aliyun gs:// Google Cloud s3a:// Speed and consistency adl:// Azure Data Lake 2006 2008 2013 2014 2015 2016 s3:// Amazon EMR S3 History of Object Storage Support
  • 34. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloud Storage Connectors Azure WASB ● Strongly consistent ● Good performance ● Well-tested on applications (incl. HBase) ADL ● Strongly consistent ● Tuned for big data analytics workloads Amazon Web Services S3A ● Eventually consistent - consistency work in progress by Hortonworks ● Performance improvements in progress ● Active development in Apache EMRFS ● Proprietary connector used in EMR ● Optional strong consistency for a cost Google Cloud Platform GCS ● Multiple configurable consistency policies ● Currently Google open source ● Good performance ● Could improve test coverage
  • 35. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheme Stable since Speed Consistency Maintenance s3n:// Hadoop 1.0 (Apache) s3a:// Hadoop 2.7 2.8+ ongoing Apache swift:// Hadoop 2.2 Apache wasb:// Hadoop 2.7 Hadoop 2.7 strong Apache adl:// Hadoop 3 EMR s3:// AWS EMR For a fee Amazon gs:// ??? Google @ github
  • 36. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved S3 Server-Side Encryption ⬢ Encryption of data at rest at S3 ⬢ Supports the SSE-S3 option: each object encrypted by a unique key using AES-256 cipher ⬢ Now covered in S3A automated test suites ⬢ Support for additional options under development (SSE-KMS and SSE-C)
  • 37. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Advanced authentication <property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, com.amazonaws.auth.InstanceProfileCredentialsProvider, org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider </value> </property> +encrypted credentials in JECKS files on HDFS
  • 38. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Next? Performance and integration
  • 39. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Steps for all Object Stores ⬢ Output Committers – Logical commit operation decoupled from rename (non-atomic and costly in object stores) ⬢ Object Store Abstraction Layer – Avoid impedance mismatch with FileSystem API – Provide specific APIs for better integration with object stores: saving, listing, copying ⬢ Ongoing Performance Improvement ⬢ Consistency
  • 40. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dependencies in Hadoop 2.8 hadoop-aws-2.8.x.jar aws-java-sdk-core-1.10.6.jar aws-java-sdk-kms-1.10.6.jar aws-java-sdk-s3-1.10.6.jar joda-time-2.9.3.jar (jackson-*-2.6.5.jar) hadoop-aws-2.8.x.jar azure-storage-4.2.0.jar

Editor's Notes

  1. Now people may be saying "hang on, these aren't spark developers". Well, I do have some integration patches for spark, but a lot of the integration problems are actually lower down: -filesystem connectors -ORC performance -Hive metastore Rajesh has been doing lots of scale runs and profiling, initially for Hive/Tez, now looking at Spark, including some of the Parquet problems. Chris has done work on HDFS, Azure WASB and most recently S3A Me? Co-author of the Swift connector. author of the Hadoop FS spec and general mentor of the S3A work, even when not actively working on it. Been full time on S3A, using Spark as the integration test suite, since March
  2. This is one of the simplest deployments in cloud: scheduled/dynamic ETL. Incoming data sources saving to an object store; spark cluster brought up for ETL. Either direct cleanup/filter or multistep operations, but either way: an ETL pipeline. HDFS on the VMs for transient storage, the object store used as the destination for data —now in a more efficient format such as ORC or Parquet
  3. Notebooks on demand. ; it talks to spark in cloud which then does the work against external and internal data; Your notebook itself can be saved to the object store, for persistence and sharing.
  4. Example: streaming on Azure
  5. Everything usies the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. There's actually two: the one with a clean split between client side and "driver side", and the older one which is a direct connect. Most use the latter and actually, in terms of opportunities for object store integration tweaking, this is actually the one where can innovate with the most easily. That is: there's nothing in the way. Under the FS API go filesystems and object stores. HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
  6. you used to have to disable summary data in the spark context Hadoop options, but https://issues.apache.org/jira/browse/SPARK-15719 fixed that for you
  7. It looks the same, you just need to be as aggressive about minimising IO as you can -push down predicates -only select the columns you want -filter -If you read a lot, write to HDFS then re-use. cache()? I don't know. If you do, filter as much as you can first: columns, predicates, ranges, so that parquet/orc can read as little as it needs to, and RAM use is least.
  8. if your distributor didn't stick the JARs in, you can add the hadoop-aws and hadoop-azure dependencies in the interpreter config credentials: keep out of notebooks. Zeppelin can list its settings too; always dangerous (mind you, so does HDFS and YARN, so an XInclude is handy there) when running in EC2, S3 credentials are now automatically picked up. And, if zeppelin is launched with the AWS env vars set, its invocation of spark-submit should pass them down.
  9. This invariably ends up reaching us on JIRA, to the extent I've got a document somewhere explaining the problem in detail. It was taken away because it can corrupt your data, without you noticiing. This is generally considered harmful.
  10. without going into the details, here are things you will want for Hadoop 2.8. They are in HDP 2.5, possible in the next CDH release. The first two boost input by reducing the cost of seeking, which is expensive as it breaks then re-opens the HTTPS connection. Readahead means that hundreds of KB can be skipped before that connect (yes, it can take that long to reconnect). The experimental fadvise random feature speeds up backward reads at the expense of pure-forward file reads. It is significantly faster for reading in optimized binary formats like ORC and Parquet The last one is a successor to fast upload in Hadoop 2.7. That buffers on heap and needs careful tuning; its memory needs conflict with RDD caching. The new version defaults to buffering as files on local disk, so won't run out of memory. Offers the potential of significantly more effective use of bandwidth; the resulting partitioned files may also offer higher read perf. (No data there, just hearsay).
  11. see: Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency for details, essentially it has the semantics HBase needs, that being our real compatibility test.
  12. Azure storage is unique in that there's a pubished paper (+ video) on its internals. Well worth looking at to understand what's going on. In contrast, if you want to know S3 internals, well, you can ply the original author with gin and he still won't reveal anything. ADL adds okhttp for HTTP/2 performance; yet another json parser for unknown reasons
  13. Azure storage is unique in that there's a pubished paper (+ video) on its internals. Well worth looking at to understand what's going on. In contrast, if you want to know S3 internals, well, you can ply the original author with gin and he still won't reveal anything. ADL adds okhttp for HTTP/2 performance; yet another json parser for unknown reasons
  14. This is the history
  15. Ignoring Hadoop's own S3
  16. Hadoop 2.8 adds a lot of control here. (credit: Netfllx, + later us & cloudera) -You can define a list of credential providers to use; the default is simple, env, instance, but you can add temporary and anonymous, choose which are unsupported, etc. -passwords/secrets can be encrypted in hadoop credential files stored locally, HDFS -IAM auth is what EC2 VMs need
  17. And this is the big one, as it spans the lot: Hadoop's own code (so far: distcp), Spark, Hive, Flink, related tooling. If we can't speed up the object stores, we can tune the apps