SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Cloud Storage:
PUT is the new rename()
Steve Loughran
Hortonworks R&D
stevel@hortonworks.com
June 2018
@steveloughran
2 © Hortonworks Inc. 2011–2018. All rights reserved
• Why do the cloud stores matter?
• Where do the stores stand today?
• What new & interesting stuff is there?
Hadoop committer; busy on object storage
Speaker & Talk Overview
3 © Hortonworks Inc. 2011–2018. All rights reserved
Why cloud storage?
• Long-lived persistent store for cloud-hosted analytics applications
• HDFS Backup/restore
• Sharing, data exchange, data collection workflows, …
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adl swiftgs abfs
4 © Hortonworks Inc. 2011–2018. All rights reserved
All examples in Spark 2.3; "S3 landsat CSV" as source
val landsatCsvGZOnS3 = new Path("s3a://landsat-pds/scene_list.gz")
val landsatCsvGZ = new Path("file:///tmp/scene_list.gz")
copyFile(landsatCsvGZOnS3, landsatCsvGZ, conf, false)
val csvSchema = LandsatIO.buildCsvSchema()
val csvDataFrame = LandsatIO.addLandsatColumns(
spark.read.options(LandsatIO.CsvOptions).
schema(csvSchema).csv(landsatCsvGZ.toUri.toString))
val filteredCSV = csvDataFrame.
sample(false, 0.01d).
filter("cloudCover < 15 and year=2014").cache()
5 © Hortonworks Inc. 2011–2018. All rights reserved
Azure wasb://
The ~filesystem for Azure
6 © Hortonworks Inc. 2011–2018. All rights reserved
Azure wasb:// connector shipping and stable
• REST API
• Tested heavily by Microsoft
• Tested heavily by Hortonworks
• New! adaptive seek() algorithm
• New! more Ranger lockdown
Note: Ranger implemented client-side — needs locked-down clients to work
8 © Hortonworks Inc. 2011–2018. All rights reserved
Azure Datalake
adl://
9 © Hortonworks Inc. 2011–2018. All rights reserved
Azure Datalake: Analytics Store
• API matches Hadoop APIs 1:1
• hadoop-azuredatalake JAR
• Runs on YARN
• Authenticates via OAuth2
Ongoing : resilience & supportability
11 © Hortonworks Inc. 2011–2018. All rights reserved
Azure Data Lake Storage Gen 2 (Preview)
Bigger, better, faster, lower cost,
In Beta; live 2H18
12 © Hortonworks Inc. 2011–2018. All rights reserved
abfs:// is the connector
• New REST API
• HADOOP-15407: add the hadoop connector
• Store and client to succeed wasb:// and adl://
• Old endpoints to remain; wasb:// will continue to work.
abfs:// connector being developed with store; usual stabilization process
13 © Hortonworks Inc. 2011–2018. All rights reserved
abfs demo:
val orcDataAbfs = new Path(abfs, "orcData")
logDuration("write to ABFS ORC") {
filteredCSV.write.
partitionBy("year", "month").
mode(SaveMode.Append).format("orc").
save(orcDataAbfs.toString)
}
14 © Hortonworks Inc. 2011–2018. All rights reserved
Amazon S3: Use with Care:
inconsistency is its flaw
15 © Hortonworks Inc. 2011–2018. All rights reserved
S3Guard: High Performance & Consistent Metadata for S3
• Uses DynamoDB for a consistent view
• Ensures subsequent operations see new files
• Filters out recently deleted files
• Can speed up listStatus() & getFileStatus() calls
• Stops rename() (and so job commit) missing files
Prevents data loss during Hadoop, Hive, Spark queries
16 © Hortonworks Inc. 2011–2018. All rights reserved
S3Guard: fast consistent metadata via Dynamo DB
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
PUT part-00
200
00
17 © Hortonworks Inc. 2011–2018. All rights reserved
S3Guard: bind a bucket to a store, initialize the table, use
<property>
<name>fs.s3a.bucket.hwdev-steve-ireland-new.metadatastore.impl</name>
<value>org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore</value>
</property>
hadoop s3guard init -read 50 -write 50 s3a://hwdev-steve-ireland-new
…
Metadata Store Diagnostics:
ARN=arn:aws:dynamodb:eu-west-1:980678866538:table/hwdev-steve-ireland-new
description=S3Guard metadata store in DynamoDB
name=hwdev-steve-ireland-new
read-capacity=20
region=eu-west-1
retryPolicy=ExponentialBackoffRetry(maxRetries=9, sleepTime=100 MILLISECONDS)
size=15497
status=ACTIVE
18 © Hortonworks Inc. 2011–2018. All rights reserved
• Task commit: rename to job directory
• Job commit: rename to final directory
• Filesystems & most stores: O(directories)
• S3: COPY operation @6/10 MB/s
• Without S3Guard, risk of loss
Next: the commit problem
/
work
_temporary
part-00
part-01
part-01
attempt_123
rename("/work/temporary/attempt_123/part-01", "/work/")
19 © Hortonworks Inc. 2011–2018. All rights reserved
MR
Commit Protocol
20 © Hortonworks Inc. 2011–2018. All rights reserved
Spark
Commit Protocol
21 © Hortonworks Inc. 2011–2018. All rights reserved
PUT replaces rename()
00
00
00
01
01
s01 s02
s03 s04
01
put00 = POST "Multipart Upload" /work/part00
Task 00:
POST "Block" put00, /work/part00, data
Task 01:
put01 = POST "Multipart Upload" /work/part01
POST "Block" put01, /work/part01, data
Job Commit
POST "Complete" put00, /work/part00
POST "Complete" put01, /work/part01
00
01
22 © Hortonworks Inc. 2011–2018. All rights reserved
S3A Committers + Spark binding
spark.sql.sources.commitProtocolClass
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.hadoop.fs.s3a.committer.name staging
spark.hadoop.fs.s3a.committer.staging.conflict-mode replace
Also
spark.hadoop.fs.s3a.committer.name partitioned
spark.hadoop.fs.s3a.committer.name magic
23 © Hortonworks Inc. 2011–2018. All rights reserved
S3A Committer writing ORC and Parquet to S3A
val orcDataS3 = new Path(ireland, "orcData")
orcWasbDataframe.write.
partitionBy("year", "month").
mode(SaveMode.Append).orc(orcDataS3.toString)
val orcS3DF = spark.read.orc(orcData.toString).as[LandsatImage]
val parquetS3 = new Path(ireland, "parquetData")
orcDF.write.
partitionBy("year", "month").
mode(SaveMode.Append).parquet(parquetData.toString)
cat(new Path(parquetData, "_SUCCESS"))
24 © Hortonworks Inc. 2011–2018. All rights reserved
CSV files: the "Dark Matter" of data
25 © Hortonworks Inc. 2011–2018. All rights reserved
CSV: less a format, more a mess
 Commas vs tabs? Mixed?
 Single/double/no quotes. Escaping?
 Encoding?
 Schema? At best: header
 missing columns?
 NULL?
 Numbers?
 Date and time?
 Compression?
 CSV injection attacks (cells with = as first char) [Excel, google docs]
26 © Hortonworks Inc. 2011–2018. All rights reserved
S3 Select: SQL Queries on CSV/JSON In S3 Itself
• SQL Select issued when opening a file to read
• Filter rows from WHERE clause
• Project subset of columns
• Use CSV header for column names
• Cost of GET as normal
• Save on bandwidth
27 © Hortonworks Inc. 2011–2018. All rights reserved
S3 SELECT against the landsat index dataset
# examine data
hadoop s3guard select -header use -compression gzip 
-limit 10 
s3a://landsat-pds/scene_list.gz 
"SELECT s.entityId FROM S3OBJECT s WHERE s.cloudCover = '0.0'"
# copy subset of data from S3 to ADL
hadoop s3guard select -header use -compression gzip 
-out adl://steveleu.azuredatalakestore.net/landsat.csv 
s3a://landsat-pds/scene_list.gz 
"SELECT * FROM S3OBJECT s WHERE s.cloudCover = '0.0'"
28 © Hortonworks Inc. 2011–2018. All rights reserved
DEMO: remote client, 300MB /1M rows of AWS landsat index
time hadoop fs -text $LANDSATGZ | grep ",0.0,"
=> 44s
// filter in the select
time hadoop s3guard select -header use -compression gzip $LANDSATGZ 
"SELECT * from S3OBJECT s where s.cloudCover = '0.0'"
=> 12.20s
// select and project
time hadoop s3guard select -header use -compression gzip $LANDSATGZ 
"SELECT s.entityId FROM S3OBJECT s WHERE s.cloudCover = '0.0' LIMIT 100"
=> 4s
29 © Hortonworks Inc. 2011–2018. All rights reserved
S3 Select Transcoding
Input
• JSON-per-row format
• UTF-8 CSV with configured separator, quote policy, optional header
• Raw or .gz
Output
• JSON-per-row format
• UTF-8 CSV with configured separator, quote policy, NO HEADER
• loss of CSV header hurts Spark CSV Schema inference
30 © Hortonworks Inc. 2011–2018. All rights reserved
Integration: Work in Progress. Hadoop 3.2?
• Later AWS SDK JAR, upgrades "mildly stressful"
• new S3AFileSystem.select(path, expression) method
• Filesystem config options for: compression, separator, quotes
• TODO: open() call which takes optional & mandatory properties
(HADOOP-15229)
• TODO: adoption by Hive, Spark, …
31 © Hortonworks Inc. 2011–2018. All rights reserved
Google Cloud Storage
32 © Hortonworks Inc. 2011–2018. All rights reserved
Google Cloud: Increasing use drives storage
1. More users (TensorFlow motivator)
2. More tests (i.e. GCS team using ASF test suites, ...)
3. Google starting to contribute optimizations back to the OSS projects (e.g.
HDFS-13056 for exposing consistent checksums
4. +Cloud BigTable offers HBase APIs
Attend: Running Apache Hadoop on the Google Cloud Platform
Grand Ballroom 220A, Wednesday, 16:40
33 © Hortonworks Inc. 2011–2018. All rights reserved
What else?
34 © Hortonworks Inc. 2011–2018. All rights reserved
Security & Encryption
• Enable everywhere: transparent encryption at rest
• Customer-key-managed encryption available
— may have cost/performance impact
• Key-with-request encryption: usable with care
• Client side encryption. Troublesome
Permissions:
• ADL, GCS: permissions managed in store
• WASB: Ranger checks in connector: needs locked-down client
• S3: IAM Policies (server side, limited syntax, minimal hadoop support)
35 © Hortonworks Inc. 2011–2018. All rights reserved
DistCP++
• Old HDFS ⟺ HDFS
• New: HDFS ⟺ cloud, container ⟺ container, cloud ⟺ cloud,
We are going to have to evolve DistCP for this
1. New: Cloud-optimized -delete option
2. TODO: stop the rename().
3. Need a story for incremental updates across stores
37 © Hortonworks Inc. 2011–2018. All rights reserved
How do they compare? (excluding abfs)
AWS S3 Azure WASB Azure ADL Google GCS
Can run HBase
(i.e. "filesystem")
No Yes No No
Safe destination
of work
With S3Guard
or EMRFS consistent
view
Yes Yes Yes
Fast destination
of work
With S3A Committers Yes Yes Yes
Security IAM Roles Ranger in locked-
down client
Ranger ACLs
Encryption Server-side; can use
AWS KMS
Yes, keys in
Key Vault
Yes, keys in
Key Vault
always; can supply
key with IO request
38 © Hortonworks Inc. 2011–2018. All rights reserved
To Conclude
• All cloud infrastructures have object stores we can use
• Width and depth of product offerings increasing
• Azure teams most engaged with the Hadoop project
• S3 is the least suited as a destination, but we have solutions there.
• Cloud Storage evolving: S3 Select one notable example
Data analytics is a key cloud workload; everything is adapting
39 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
stevel@hortonworks.com
40 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
stevel@hortonworks.com
41 © Hortonworks Inc. 2011–2018. All rights reserved
Code used in this talk
• Hadoop trunk + HADOOP-15407 (abfs://) and HADOOP-15364 (S3 Select) patches
• Spark master with SPARK-23977 PR applied
• Hortonworks Spark Cloud integration module
various utils + schema for Landsat
https://github.com/hortonworks-spark/cloud-integration
• Cloudstore: diagnostics & utils
https://github.com/steveloughran/cloudstore

More Related Content

What's hot

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in Hadoop
DataWorks Summit
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
John Conley
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
Botond Balázs
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
Yousef Fadila
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
NETWAYS
 
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and PrivacyClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
Altinity Ltd
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 

What's hot (20)

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in Hadoop
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
OSDC 2014: Sebastian Harl - SysDB the system management and inventory collect...
 
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and PrivacyClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 

Similar to Put is the new rename: San Jose Summit Edition

PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
Steve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Mingliang Liu
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
Hortonworks
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Artem Ervits
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
alanfgates
 

Similar to Put is the new rename: San Jose Summit Edition (20)

PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 

More from Steve Loughran

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
Steve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
Steve Loughran
 
Testing
TestingTesting
I hate mocking
I hate mockingI hate mocking
I hate mocking
Steve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Steve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
Steve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
Steve Loughran
 
YARN Services
YARN ServicesYARN Services
YARN Services
Steve Loughran
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
Steve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
Steve Loughran
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
Steve Loughran
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
Steve Loughran
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-statusSteve Loughran
 
Hoya for Code Review
Hoya for Code ReviewHoya for Code Review
Hoya for Code Review
Steve Loughran
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
Steve Loughran
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 

More from Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 
Hoya for Code Review
Hoya for Code ReviewHoya for Code Review
Hoya for Code Review
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Put is the new rename: San Jose Summit Edition

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Cloud Storage: PUT is the new rename() Steve Loughran Hortonworks R&D stevel@hortonworks.com June 2018 @steveloughran
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved • Why do the cloud stores matter? • Where do the stores stand today? • What new & interesting stuff is there? Hadoop committer; busy on object storage Speaker & Talk Overview
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Why cloud storage? • Long-lived persistent store for cloud-hosted analytics applications • HDFS Backup/restore • Sharing, data exchange, data collection workflows, … org.apache.hadoop.fs.FileSystem hdfs s3awasb adl swiftgs abfs
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved All examples in Spark 2.3; "S3 landsat CSV" as source val landsatCsvGZOnS3 = new Path("s3a://landsat-pds/scene_list.gz") val landsatCsvGZ = new Path("file:///tmp/scene_list.gz") copyFile(landsatCsvGZOnS3, landsatCsvGZ, conf, false) val csvSchema = LandsatIO.buildCsvSchema() val csvDataFrame = LandsatIO.addLandsatColumns( spark.read.options(LandsatIO.CsvOptions). schema(csvSchema).csv(landsatCsvGZ.toUri.toString)) val filteredCSV = csvDataFrame. sample(false, 0.01d). filter("cloudCover < 15 and year=2014").cache()
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Azure wasb:// The ~filesystem for Azure
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Azure wasb:// connector shipping and stable • REST API • Tested heavily by Microsoft • Tested heavily by Hortonworks • New! adaptive seek() algorithm • New! more Ranger lockdown Note: Ranger implemented client-side — needs locked-down clients to work
  • 7. 8 © Hortonworks Inc. 2011–2018. All rights reserved Azure Datalake adl://
  • 8. 9 © Hortonworks Inc. 2011–2018. All rights reserved Azure Datalake: Analytics Store • API matches Hadoop APIs 1:1 • hadoop-azuredatalake JAR • Runs on YARN • Authenticates via OAuth2 Ongoing : resilience & supportability
  • 9. 11 © Hortonworks Inc. 2011–2018. All rights reserved Azure Data Lake Storage Gen 2 (Preview) Bigger, better, faster, lower cost, In Beta; live 2H18
  • 10. 12 © Hortonworks Inc. 2011–2018. All rights reserved abfs:// is the connector • New REST API • HADOOP-15407: add the hadoop connector • Store and client to succeed wasb:// and adl:// • Old endpoints to remain; wasb:// will continue to work. abfs:// connector being developed with store; usual stabilization process
  • 11. 13 © Hortonworks Inc. 2011–2018. All rights reserved abfs demo: val orcDataAbfs = new Path(abfs, "orcData") logDuration("write to ABFS ORC") { filteredCSV.write. partitionBy("year", "month"). mode(SaveMode.Append).format("orc"). save(orcDataAbfs.toString) }
  • 12. 14 © Hortonworks Inc. 2011–2018. All rights reserved Amazon S3: Use with Care: inconsistency is its flaw
  • 13. 15 © Hortonworks Inc. 2011–2018. All rights reserved S3Guard: High Performance & Consistent Metadata for S3 • Uses DynamoDB for a consistent view • Ensures subsequent operations see new files • Filters out recently deleted files • Can speed up listStatus() & getFileStatus() calls • Stops rename() (and so job commit) missing files Prevents data loss during Hadoop, Hive, Spark queries
  • 14. 16 © Hortonworks Inc. 2011–2018. All rights reserved S3Guard: fast consistent metadata via Dynamo DB 00 00 00 01 01 s01 s02 s03 s04 01 DELETE part-00 200 HEAD part-00 200 HEAD part-00 404 PUT part-00 200 00
  • 15. 17 © Hortonworks Inc. 2011–2018. All rights reserved S3Guard: bind a bucket to a store, initialize the table, use <property> <name>fs.s3a.bucket.hwdev-steve-ireland-new.metadatastore.impl</name> <value>org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore</value> </property> hadoop s3guard init -read 50 -write 50 s3a://hwdev-steve-ireland-new … Metadata Store Diagnostics: ARN=arn:aws:dynamodb:eu-west-1:980678866538:table/hwdev-steve-ireland-new description=S3Guard metadata store in DynamoDB name=hwdev-steve-ireland-new read-capacity=20 region=eu-west-1 retryPolicy=ExponentialBackoffRetry(maxRetries=9, sleepTime=100 MILLISECONDS) size=15497 status=ACTIVE
  • 16. 18 © Hortonworks Inc. 2011–2018. All rights reserved • Task commit: rename to job directory • Job commit: rename to final directory • Filesystems & most stores: O(directories) • S3: COPY operation @6/10 MB/s • Without S3Guard, risk of loss Next: the commit problem / work _temporary part-00 part-01 part-01 attempt_123 rename("/work/temporary/attempt_123/part-01", "/work/")
  • 17. 19 © Hortonworks Inc. 2011–2018. All rights reserved MR Commit Protocol
  • 18. 20 © Hortonworks Inc. 2011–2018. All rights reserved Spark Commit Protocol
  • 19. 21 © Hortonworks Inc. 2011–2018. All rights reserved PUT replaces rename() 00 00 00 01 01 s01 s02 s03 s04 01 put00 = POST "Multipart Upload" /work/part00 Task 00: POST "Block" put00, /work/part00, data Task 01: put01 = POST "Multipart Upload" /work/part01 POST "Block" put01, /work/part01, data Job Commit POST "Complete" put00, /work/part00 POST "Complete" put01, /work/part01 00 01
  • 20. 22 © Hortonworks Inc. 2011–2018. All rights reserved S3A Committers + Spark binding spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter spark.hadoop.fs.s3a.committer.name staging spark.hadoop.fs.s3a.committer.staging.conflict-mode replace Also spark.hadoop.fs.s3a.committer.name partitioned spark.hadoop.fs.s3a.committer.name magic
  • 21. 23 © Hortonworks Inc. 2011–2018. All rights reserved S3A Committer writing ORC and Parquet to S3A val orcDataS3 = new Path(ireland, "orcData") orcWasbDataframe.write. partitionBy("year", "month"). mode(SaveMode.Append).orc(orcDataS3.toString) val orcS3DF = spark.read.orc(orcData.toString).as[LandsatImage] val parquetS3 = new Path(ireland, "parquetData") orcDF.write. partitionBy("year", "month"). mode(SaveMode.Append).parquet(parquetData.toString) cat(new Path(parquetData, "_SUCCESS"))
  • 22. 24 © Hortonworks Inc. 2011–2018. All rights reserved CSV files: the "Dark Matter" of data
  • 23. 25 © Hortonworks Inc. 2011–2018. All rights reserved CSV: less a format, more a mess  Commas vs tabs? Mixed?  Single/double/no quotes. Escaping?  Encoding?  Schema? At best: header  missing columns?  NULL?  Numbers?  Date and time?  Compression?  CSV injection attacks (cells with = as first char) [Excel, google docs]
  • 24. 26 © Hortonworks Inc. 2011–2018. All rights reserved S3 Select: SQL Queries on CSV/JSON In S3 Itself • SQL Select issued when opening a file to read • Filter rows from WHERE clause • Project subset of columns • Use CSV header for column names • Cost of GET as normal • Save on bandwidth
  • 25. 27 © Hortonworks Inc. 2011–2018. All rights reserved S3 SELECT against the landsat index dataset # examine data hadoop s3guard select -header use -compression gzip -limit 10 s3a://landsat-pds/scene_list.gz "SELECT s.entityId FROM S3OBJECT s WHERE s.cloudCover = '0.0'" # copy subset of data from S3 to ADL hadoop s3guard select -header use -compression gzip -out adl://steveleu.azuredatalakestore.net/landsat.csv s3a://landsat-pds/scene_list.gz "SELECT * FROM S3OBJECT s WHERE s.cloudCover = '0.0'"
  • 26. 28 © Hortonworks Inc. 2011–2018. All rights reserved DEMO: remote client, 300MB /1M rows of AWS landsat index time hadoop fs -text $LANDSATGZ | grep ",0.0," => 44s // filter in the select time hadoop s3guard select -header use -compression gzip $LANDSATGZ "SELECT * from S3OBJECT s where s.cloudCover = '0.0'" => 12.20s // select and project time hadoop s3guard select -header use -compression gzip $LANDSATGZ "SELECT s.entityId FROM S3OBJECT s WHERE s.cloudCover = '0.0' LIMIT 100" => 4s
  • 27. 29 © Hortonworks Inc. 2011–2018. All rights reserved S3 Select Transcoding Input • JSON-per-row format • UTF-8 CSV with configured separator, quote policy, optional header • Raw or .gz Output • JSON-per-row format • UTF-8 CSV with configured separator, quote policy, NO HEADER • loss of CSV header hurts Spark CSV Schema inference
  • 28. 30 © Hortonworks Inc. 2011–2018. All rights reserved Integration: Work in Progress. Hadoop 3.2? • Later AWS SDK JAR, upgrades "mildly stressful" • new S3AFileSystem.select(path, expression) method • Filesystem config options for: compression, separator, quotes • TODO: open() call which takes optional & mandatory properties (HADOOP-15229) • TODO: adoption by Hive, Spark, …
  • 29. 31 © Hortonworks Inc. 2011–2018. All rights reserved Google Cloud Storage
  • 30. 32 © Hortonworks Inc. 2011–2018. All rights reserved Google Cloud: Increasing use drives storage 1. More users (TensorFlow motivator) 2. More tests (i.e. GCS team using ASF test suites, ...) 3. Google starting to contribute optimizations back to the OSS projects (e.g. HDFS-13056 for exposing consistent checksums 4. +Cloud BigTable offers HBase APIs Attend: Running Apache Hadoop on the Google Cloud Platform Grand Ballroom 220A, Wednesday, 16:40
  • 31. 33 © Hortonworks Inc. 2011–2018. All rights reserved What else?
  • 32. 34 © Hortonworks Inc. 2011–2018. All rights reserved Security & Encryption • Enable everywhere: transparent encryption at rest • Customer-key-managed encryption available — may have cost/performance impact • Key-with-request encryption: usable with care • Client side encryption. Troublesome Permissions: • ADL, GCS: permissions managed in store • WASB: Ranger checks in connector: needs locked-down client • S3: IAM Policies (server side, limited syntax, minimal hadoop support)
  • 33. 35 © Hortonworks Inc. 2011–2018. All rights reserved DistCP++ • Old HDFS ⟺ HDFS • New: HDFS ⟺ cloud, container ⟺ container, cloud ⟺ cloud, We are going to have to evolve DistCP for this 1. New: Cloud-optimized -delete option 2. TODO: stop the rename(). 3. Need a story for incremental updates across stores
  • 34. 37 © Hortonworks Inc. 2011–2018. All rights reserved How do they compare? (excluding abfs) AWS S3 Azure WASB Azure ADL Google GCS Can run HBase (i.e. "filesystem") No Yes No No Safe destination of work With S3Guard or EMRFS consistent view Yes Yes Yes Fast destination of work With S3A Committers Yes Yes Yes Security IAM Roles Ranger in locked- down client Ranger ACLs Encryption Server-side; can use AWS KMS Yes, keys in Key Vault Yes, keys in Key Vault always; can supply key with IO request
  • 35. 38 © Hortonworks Inc. 2011–2018. All rights reserved To Conclude • All cloud infrastructures have object stores we can use • Width and depth of product offerings increasing • Azure teams most engaged with the Hadoop project • S3 is the least suited as a destination, but we have solutions there. • Cloud Storage evolving: S3 Select one notable example Data analytics is a key cloud workload; everything is adapting
  • 36. 39 © Hortonworks Inc. 2011–2018. All rights reserved Questions? stevel@hortonworks.com
  • 37. 40 © Hortonworks Inc. 2011–2018. All rights reserved Thank you stevel@hortonworks.com
  • 38. 41 © Hortonworks Inc. 2011–2018. All rights reserved Code used in this talk • Hadoop trunk + HADOOP-15407 (abfs://) and HADOOP-15364 (S3 Select) patches • Spark master with SPARK-23977 PR applied • Hortonworks Spark Cloud integration module various utils + schema for Landsat https://github.com/hortonworks-spark/cloud-integration • Cloudstore: diagnostics & utils https://github.com/steveloughran/cloudstore

Editor's Notes

  1. TALK TRACK Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications. [NEXT SLIDE]
  2. I'm going to warn there's a bias in this talk to S3. Not out of any personal preference, just because I've been busiest there
  3. If there's a fault with the adl:// connector, it's usually "error(null"). This can mean anything, including "you don't have your proxy settings and an HTML page is coming back, which confuses our JSON parser"…the kind or problem which surface in the field. Need also to highlight possibility that it's caching things client side for perf. If true, risk of consistency.
  4. A key thing to note here is that the Hadoop connector dev has been going on in sync with thee store service, with full stack regression testing going on. This increases its
  5. Details about out-of-band, and source of truth etc.
  6. Details about out-of-band, and source of truth etc.
  7. I seem to be getting bug reports related to distCP and clouds assigned to me. This is probably a metric of (a) rising popularity and (b) performance/functionality issues I've got a quick PoC of actually using Spark to schedule the work: you can do a basic distcp successor in 200 LoC by ignoring all the existing options and throttling. Tempting to use as a basis for a CloudCP, but as distcp is often used as a library for other apps, it fails to meet some core deployment needs. More likely: extend distcp with ability to save & restore (src, dest) checksums and then compare these on the next incremental update.
  8. Choosing which cloud infrastructure to use is something which tends to choose what your storage option is going to be: AWS and GCS dictate the options. Azure is interesting in that there are two: WASB vs ADL, and a successor coming