SlideShare a Scribd company logo
1 of 64
Download to read offline
Why you should care about
data layout in the file system
Cheng Lian, @liancheng
Vida Ha, @femineer
Spark Summit 2017
1
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Apache Spark is a
powerful framework
with some temper
3
4
Just like
super mario
5
Serve him the
right ingredients
6
Powers up and
gets more efficient
7
Keep serving
8
He even knows
how to Spark!
9
However,
once served
a wrong dish...
10
Meh...
11
And sometimes...
12
It can be messy...
13
Secret sauces
we feed Spark
13
File Formats
14
Choosing a compression scheme
15
The obvious
• Compression ratio: the higher the better
• De/compression speed: the faster the better
Choosing a compression scheme
16
Splittable v.s. non-splittable
• Affects parallelism, crucial for big data
• Common splittable compression schemes
• LZ4, Snappy, BZip2, LZO, and etc.
• GZip is non-splittable
• Still common if file sizes are << 1GB
• Still applicable for Parquet
Columnar formats
Smart, analytics friendly, optimized for big data
• Support for nested data types
• Efficient data skipping
• Column pruning
• Min/max statistics based predicate push-down
• Nice interoperability
• Examples:
• Spark SQL built-in support: Apache Parquet and Apache ORC
• Newly emerging: Apache CarbonData and Spinach
17
Columnar formats
Parquet
• Apache Spark default output format
• Usually the best practice for Spark SQL
• Relatively heavy write path
• Worth the time to encode for repeated analytics scenario
• Does not support fine grained appending
• Not ideal for, e.g., collecting logs
• Check out Parquet presentations for more details
18
Semi-structured text formats
Sort of structured but not self-describing
• Excellent write path performance but slow on the read path
• Good candidates for collecting raw data (e.g., logs)
• Subject to inconsistent and/or malformed records
• Schema inference provided by Spark (for JSON and CSV)
• Sampling-based
• Handy for exploratory scenario but can be inaccurate
• Always specify an accurate schema in production
19
Semi-structured text formats
JSON
• Supported by Apache Spark out of the box
• One JSON object per line for fast file splitting
• JSON object: map or struct?
• Spark schema inference always treats JSON objects as structs
• Watch out for arbitrary number of keys (may OOM executors)
• Specify an accurate schema if you decide to stick with maps
20
Semi-structured text formats
JSON
• Malformed records
• Bad records are collected into column _corrupted_record
• All other columns are set to null
21
Semi-structured text formats
CSV
• Supported by Spark 2.x out of the box
• Check out the spark-csv package for Spark 1.x
• Often used for handling legacy data providers & consumers
• Lacks of a standard file specification
– Separator, escaping, quoting, and etc.
• Lacks of support for nested data types
22
Raw text files
23
Arbitrary line-based text files
• Splitting files into lines using spark.read.text()
• Keep your lines a reasonable size
• Keep file size < 1GB if compressed with a non-splittable
compression scheme (e.g., GZip)
• Handing inevitable malformed data
• Use a filter() transformation to drop bad lines, or
• Use a map() transformation to fix bad line
Directory layout
24
Partitioning
year=2017 genre=classic albums
albumsgenre=folk
25
Overview
• Coarse-grained data skipping
• Available for both persisted
tables and raw directories
• Automatically discovers Hive
style partitioned directories
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
AS SELECT artist, rating, year, genre
FROM music
Partitioning
SQL DataFrame API
spark
.table(“music”)
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(’year, ’genre)
.saveAsTable(“ratings”)
26
Partitioning
year=2017 albums
albums
genre=classic
genre=folk
27
Filter predicates
Use simple filter predicates
containing partition columns to
leverage partition pruning
Partitioning
Filter predicates
• year = 2000 AND genre = ‘folk’
• year > 2000 AND rating > 3
• year > 2000 OR genre <> ‘rock’
28
Partitioning
29
Filter predicates
• year > 2000 OR rating = 5
• year > rating
Partitioning
Avoid excessive partitions
• Stress metastore for persisted tables
• Stress file system when reading directly from the file system
• Suggestions
• Avoid using too many partition columns
• Avoid using partition columns with too many distinct values
– Try hashing the values
– E.g., partition by first letter of first name rather than first name
30
Partitioning
Using persisted partitioned tables
with Spark 2.1+
• Per-partition metadata gets
persisted into the metastore
• Avoids unnecessary partition
discovery (esp. valuable for S3)
Check our blog post for more details
31
Scalable partition handling
• Pre-shuffles and optionally
pre-sorts the data while writing
• Layout information gets persisted
in the metastore
• Avoids shuffling and sorting when
joining large datasets
• Only available for persisted tables
Bucketing
Overview
32
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
CLUSTERED BY (rating) INTO 5 BUCKETS
SORTED BY (rating)
AS SELECT artist, rating, year, genre
FROM music
Bucketing
SQL
33
DataFrame
ratings
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(“year”, “genre”)
.bucketBy(5, “rating”)
.sortBy(“rating”)
.saveAsTable(“ratings”)
Bucketing
In combo with columnar formats
• Bucketing
• Per-bucket sorting
• Columnar formats
• Efficient data skipping based on min/max statistics
• Works best when the searched columns are sorted
34
Bucketing
35
Bucketing
36
min=0, max=99
min=100, max=199
min=200, max=249
Bucketing
In combo with columnar formats
Perfect combination, makes your Spark jobs FLY!
37
More tips
38
File size and compaction
Avoid small files
• Cause excessive parallelism
• Spark 2.x improves this by packing small files
• Cause extra file metadata operations
• Particularly bad when hosted on S3
39
File size and compaction
40
How to control output file sizes
• In general, one task in the output stage writes one file
• Tune parallelism of the output stage
• coalesce(N), for
• Reduces parallelism for small jobs
• repartition(N), for
• Increasing parallelism for all jobs, or
• Reducing parallelism of final output stage for large jobs
• Still preserves high parallelism for previous stages
True story
Customer
• Spark ORC Read Performance is much slower than Parquet
• The same query took
• 3 seconds on a Parquet dataset
• 4 minutes on an equivalent ORC dataset
41
True story
Me
• Ran a simple count(*), which took
• Seconds on the Parquet dataset with a handful IO requests
• 35 minutes on the ORC dataset with 10,000s of IO requests
• Most task execution threads are reading ORC stripe footers
42
True story
43
True story
44
import org.apache.hadoop.hive.ql.io.orc._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
val conf = new Configuration
def countStripes(file: String): Int = {
val path = new Path(file)
val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf))
val metadata = reader.getMetadata
metadata.getStripeStatistics.size
}
True story
45
Maximum file size: ~15 MB
Maximum ORC stripe counts: ~1,400
True story
46
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
True story
47
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
Much worse than even CSV, not mention Parquet
True story
48
Why?
• Tiny ORC files (~10 KB) generated by Streaming jobs
• Resulting one tiny ORC stripe inside each ORC file
• The footers might take even more space than the actual data!
True story
49
Why?
Tiny files got compacted into larger ones using
ALTER TABLE ... PARTITION (...) CONCATENATE;
The CONCATENATE command just, well, concatenated those tiny
stripes and produced larger (~15 MB) files with a huge number of
tiny stripes.
True story
50
Lessons learned
Again, avoid writing small files in columnar formats
• Output files using CSV or JSON for Streaming jobs
• For better write path performance
• Compact small files into large chunks of columnar files later
• For better read path performance
True story
51
The cure
Simply read the ORC dataset and write it back using
spark.read.orc(input).write.orc(output)
So that stripes are adjusted into more reasonable sizes.
Schema evolution
Columns come and go
• Never ever change the data type of a published column
• Columns with the same name should have the same data type
• If you really dislike the data type of some column
• Add a new column with a new name and the right data type
• Deprecate the old one
• Optionally, drop it after updating all downstream consumers
52
Schema evolution
Columns come and go
Spark built-in data sources that support schema evolution
• JSON
• Parquet
• ORC
53
Schema evolution
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
JSON is more tolerating, though
• LONG → DOUBLE → STRING
54
True story
Customer
Parquet dataset corrupted!!! HALP!!!
55
True story
What happened?
Original schema
• {col1: DECIMAL(19, 4), col2: INT}
Accidentally appended data with schema
• {col1: DOUBLE, col2: DOUBLE}
All files written into the same directory
56
True story
What happened?
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
Parquet considered these schemas as incompatible ones and
refused to merge them.
57
True story
BTW
JSON schema inference is more tolerating
• LONG → DOUBLE → STRING
However
• JSON is NOT suitable for analytics scenario
• Schema inference is unreliable, not suitable for production
58
True story
The cure
Correct the schema
• Filter out all the files with the wrong schema
• Rewrite those files using the correct schema
Exhausting because all files are appended into a single directory
59
True story
Lessons learned
• Be very careful on the write path
• Consider partitioning when possible
• Better read path performance
• Easier to fix the data when something went wrong
60
Recap
File formats Directory layout
• Partitioning
• Bucketing
• Compression schemes
• Columnar (Parquet, ORC)
• Semi-structured (JSON, CSV)
• Raw text format
61
Other tips
• File sizes and compaction
• Schema evolution
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
6262
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Early draft available
for free today!
go.databricks.com/book
6363
Thank you
Q & A
64

More Related Content

What's hot

Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Threat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiThreat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiDatabricks
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Web Services
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 

What's hot (20)

Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Threat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiThreat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique Brezinski
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 

Similar to Why you should care about data layout in the file system with Cheng Lian and Vida Ha

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesTanel Poder
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...DataStax Academy
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...DataStax Academy
 
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...DataStax Academy
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFlink Forward
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 

Similar to Why you should care about data layout in the file system with Cheng Lian and Vida Ha (20)

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
 
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

Why you should care about data layout in the file system with Cheng Lian and Vida Ha

  • 1. Why you should care about data layout in the file system Cheng Lian, @liancheng Vida Ha, @femineer Spark Summit 2017 1
  • 2. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 22 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. Apache Spark is a powerful framework with some temper 3
  • 5. 5 Serve him the right ingredients
  • 6. 6 Powers up and gets more efficient
  • 12. 12 It can be messy...
  • 15. Choosing a compression scheme 15 The obvious • Compression ratio: the higher the better • De/compression speed: the faster the better
  • 16. Choosing a compression scheme 16 Splittable v.s. non-splittable • Affects parallelism, crucial for big data • Common splittable compression schemes • LZ4, Snappy, BZip2, LZO, and etc. • GZip is non-splittable • Still common if file sizes are << 1GB • Still applicable for Parquet
  • 17. Columnar formats Smart, analytics friendly, optimized for big data • Support for nested data types • Efficient data skipping • Column pruning • Min/max statistics based predicate push-down • Nice interoperability • Examples: • Spark SQL built-in support: Apache Parquet and Apache ORC • Newly emerging: Apache CarbonData and Spinach 17
  • 18. Columnar formats Parquet • Apache Spark default output format • Usually the best practice for Spark SQL • Relatively heavy write path • Worth the time to encode for repeated analytics scenario • Does not support fine grained appending • Not ideal for, e.g., collecting logs • Check out Parquet presentations for more details 18
  • 19. Semi-structured text formats Sort of structured but not self-describing • Excellent write path performance but slow on the read path • Good candidates for collecting raw data (e.g., logs) • Subject to inconsistent and/or malformed records • Schema inference provided by Spark (for JSON and CSV) • Sampling-based • Handy for exploratory scenario but can be inaccurate • Always specify an accurate schema in production 19
  • 20. Semi-structured text formats JSON • Supported by Apache Spark out of the box • One JSON object per line for fast file splitting • JSON object: map or struct? • Spark schema inference always treats JSON objects as structs • Watch out for arbitrary number of keys (may OOM executors) • Specify an accurate schema if you decide to stick with maps 20
  • 21. Semi-structured text formats JSON • Malformed records • Bad records are collected into column _corrupted_record • All other columns are set to null 21
  • 22. Semi-structured text formats CSV • Supported by Spark 2.x out of the box • Check out the spark-csv package for Spark 1.x • Often used for handling legacy data providers & consumers • Lacks of a standard file specification – Separator, escaping, quoting, and etc. • Lacks of support for nested data types 22
  • 23. Raw text files 23 Arbitrary line-based text files • Splitting files into lines using spark.read.text() • Keep your lines a reasonable size • Keep file size < 1GB if compressed with a non-splittable compression scheme (e.g., GZip) • Handing inevitable malformed data • Use a filter() transformation to drop bad lines, or • Use a map() transformation to fix bad line
  • 25. Partitioning year=2017 genre=classic albums albumsgenre=folk 25 Overview • Coarse-grained data skipping • Available for both persisted tables and raw directories • Automatically discovers Hive style partitioned directories
  • 26. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) AS SELECT artist, rating, year, genre FROM music Partitioning SQL DataFrame API spark .table(“music”) .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(’year, ’genre) .saveAsTable(“ratings”) 26
  • 27. Partitioning year=2017 albums albums genre=classic genre=folk 27 Filter predicates Use simple filter predicates containing partition columns to leverage partition pruning
  • 28. Partitioning Filter predicates • year = 2000 AND genre = ‘folk’ • year > 2000 AND rating > 3 • year > 2000 OR genre <> ‘rock’ 28
  • 29. Partitioning 29 Filter predicates • year > 2000 OR rating = 5 • year > rating
  • 30. Partitioning Avoid excessive partitions • Stress metastore for persisted tables • Stress file system when reading directly from the file system • Suggestions • Avoid using too many partition columns • Avoid using partition columns with too many distinct values – Try hashing the values – E.g., partition by first letter of first name rather than first name 30
  • 31. Partitioning Using persisted partitioned tables with Spark 2.1+ • Per-partition metadata gets persisted into the metastore • Avoids unnecessary partition discovery (esp. valuable for S3) Check our blog post for more details 31 Scalable partition handling
  • 32. • Pre-shuffles and optionally pre-sorts the data while writing • Layout information gets persisted in the metastore • Avoids shuffling and sorting when joining large datasets • Only available for persisted tables Bucketing Overview 32
  • 33. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) CLUSTERED BY (rating) INTO 5 BUCKETS SORTED BY (rating) AS SELECT artist, rating, year, genre FROM music Bucketing SQL 33 DataFrame ratings .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(“year”, “genre”) .bucketBy(5, “rating”) .sortBy(“rating”) .saveAsTable(“ratings”)
  • 34. Bucketing In combo with columnar formats • Bucketing • Per-bucket sorting • Columnar formats • Efficient data skipping based on min/max statistics • Works best when the searched columns are sorted 34
  • 37. Bucketing In combo with columnar formats Perfect combination, makes your Spark jobs FLY! 37
  • 39. File size and compaction Avoid small files • Cause excessive parallelism • Spark 2.x improves this by packing small files • Cause extra file metadata operations • Particularly bad when hosted on S3 39
  • 40. File size and compaction 40 How to control output file sizes • In general, one task in the output stage writes one file • Tune parallelism of the output stage • coalesce(N), for • Reduces parallelism for small jobs • repartition(N), for • Increasing parallelism for all jobs, or • Reducing parallelism of final output stage for large jobs • Still preserves high parallelism for previous stages
  • 41. True story Customer • Spark ORC Read Performance is much slower than Parquet • The same query took • 3 seconds on a Parquet dataset • 4 minutes on an equivalent ORC dataset 41
  • 42. True story Me • Ran a simple count(*), which took • Seconds on the Parquet dataset with a handful IO requests • 35 minutes on the ORC dataset with 10,000s of IO requests • Most task execution threads are reading ORC stripe footers 42
  • 44. True story 44 import org.apache.hadoop.hive.ql.io.orc._ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path val conf = new Configuration def countStripes(file: String): Int = { val path = new Path(file) val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf)) val metadata = reader.getMetadata metadata.getStripeStatistics.size }
  • 45. True story 45 Maximum file size: ~15 MB Maximum ORC stripe counts: ~1,400
  • 46. True story 46 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
  • 47. True story 47 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data Much worse than even CSV, not mention Parquet
  • 48. True story 48 Why? • Tiny ORC files (~10 KB) generated by Streaming jobs • Resulting one tiny ORC stripe inside each ORC file • The footers might take even more space than the actual data!
  • 49. True story 49 Why? Tiny files got compacted into larger ones using ALTER TABLE ... PARTITION (...) CONCATENATE; The CONCATENATE command just, well, concatenated those tiny stripes and produced larger (~15 MB) files with a huge number of tiny stripes.
  • 50. True story 50 Lessons learned Again, avoid writing small files in columnar formats • Output files using CSV or JSON for Streaming jobs • For better write path performance • Compact small files into large chunks of columnar files later • For better read path performance
  • 51. True story 51 The cure Simply read the ORC dataset and write it back using spark.read.orc(input).write.orc(output) So that stripes are adjusted into more reasonable sizes.
  • 52. Schema evolution Columns come and go • Never ever change the data type of a published column • Columns with the same name should have the same data type • If you really dislike the data type of some column • Add a new column with a new name and the right data type • Deprecate the old one • Optionally, drop it after updating all downstream consumers 52
  • 53. Schema evolution Columns come and go Spark built-in data sources that support schema evolution • JSON • Parquet • ORC 53
  • 54. Schema evolution Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE JSON is more tolerating, though • LONG → DOUBLE → STRING 54
  • 55. True story Customer Parquet dataset corrupted!!! HALP!!! 55
  • 56. True story What happened? Original schema • {col1: DECIMAL(19, 4), col2: INT} Accidentally appended data with schema • {col1: DOUBLE, col2: DOUBLE} All files written into the same directory 56
  • 57. True story What happened? Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE Parquet considered these schemas as incompatible ones and refused to merge them. 57
  • 58. True story BTW JSON schema inference is more tolerating • LONG → DOUBLE → STRING However • JSON is NOT suitable for analytics scenario • Schema inference is unreliable, not suitable for production 58
  • 59. True story The cure Correct the schema • Filter out all the files with the wrong schema • Rewrite those files using the correct schema Exhausting because all files are appended into a single directory 59
  • 60. True story Lessons learned • Be very careful on the write path • Consider partitioning when possible • Better read path performance • Easier to fix the data when something went wrong 60
  • 61. Recap File formats Directory layout • Partitioning • Bucketing • Compression schemes • Columnar (Parquet, ORC) • Semi-structured (JSON, CSV) • Raw text format 61 Other tips • File sizes and compaction • Schema evolution
  • 62. UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 6262 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 63. Early draft available for free today! go.databricks.com/book 6363