SlideShare a Scribd company logo
1 of 65
Download to read offline
L E S S O N S L E A R N E D AT T W I T T E R
H A D O O P P E R F O R M A N C E
O P T I M I Z AT I O N AT S C A L E
A L E X L E V E N S O N |
I A N O ' C O N N E L L |
@ T H I S W I L LW O R K
@ 0 X 1 3 8
DATA PLATFORM @TWITTER
Develop, maintain, and support the core data processing
libraries used at Twitter
In a good position to make system-wide performance
improvements
Core Data Libraries Team
DATA PLATFORM @TWITTER
Idiomatic functional Scala library for writing Hadoop map reduce
Functional programming is a natural ๏ฌt for map reduce
Compile time type checked
Core Data Libraries Team
github.com/twitter/scalding
DATA PLATFORM @TWITTER
Columnar storage format for the Hadoop ecosystem
Uses the Google Dremel column shredding and assembly
algorithm
Core Data Libraries Team
APACHE PARQUET
github.com/apache/parquet-mr
DATA PLATFORM @TWITTER
Streaming map reduce for hybrid realtime / batch topologies
Write once, execute in parallel on Storm / Heron (online) and
Scalding (o๏ฌ„ine)
Core Data Libraries Team
SUMMINGBIRD
github.com/twitter/summingbird
Hadoop at Twitter Scale
H A D O O P AT T W I T T E R
300+PETABYTES OF
DATA
100k MAP REDUCE
JOBS DAILY
MULTIPLES OF
1000+MACHINE
HADOOP
CLUSTERS
MULTIPLE
LARGEST
HADOOP
CLUSTERS IN
THE WORLD
AMONG THE
At this scale, even small system-wide
improvements can save signi๏ฌcant
amounts of compute resources
C O S T AT S C A L E
What does your Hadoop cluster
spend most of its time doing?
W H AT T O I M P R O V E ?
Pro๏ฌle your cluster, you might be
surprised by what you ๏ฌnd
M E A S U R E - D O N ' T G U E S S
ENABLE JVM PROFILING WITH -XPROF
Built into the JVM (HotSpot), so there's nothing to install
Xprof: a low overhead pro๏ฌler built into the jvm
mapreduce.task.profile='true'
mapreduce.task.profile.maps='0-'
mapreduce.task.profile.reduces='0-'
mapreduce.task.profile.params='-Xprof'
ENABLE JVM PROFILING WITH -XPROF
Low overhead (uses stack sampling)
Surfaces the most expensive methods
Prints directly to task logs (stdout)
Xprof: a low overhead pro๏ฌler built into the jvm
Flat profile of 412.48 secs (38743 total ticks): SpillThread
Interpreted + native Method
12.5% 0 + 32215 org.apache.hadoop.io.compress.lz4.Lz4Compressor.compressBytesDirect
4.6% 0 + 822 java.io.FileOutputStream.writeBytes
...
19.4% 352 + 3082 Total interpreted (including elided)
Compiled + native Method
50.0% 8549 + 299 java.lang.StringCoding.decode
16.9% 2823 + 158 cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement
4.1% 734 + 0 sun.nio.cs.UTF_8$Decoder.decode
2.3% 401 + 0 org.apache.hadoop.mapred.IFileOutputStream.write
2.0% 352 + 0 cascading.tuple.hadoop.util.TupleComparator.compare
1.7% 296 + 0 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare
...
79.0% 13514 + 467 Total compiled
Thread-local ticks:
54.3% 21053 Blocked (of total)
HADOOP CONFIGURATION OBJECT
Looks and behaves a lot like a HashMap
Surprisingly expensive
Configuration conf = new Configuration()
conf.set("myKey", "myValue")
String value = conf.get("myKey")
HADOOP CONFIGURATION OBJECT
Constructor reads + unzips + parses an XML ๏ฌle from disk
Surprisingly expensive
public class KryoSerialization {
public KryoSerialization() {
this(new Configuration())
}
}
HADOOP CONFIGURATION OBJECT
get() method involves regular expressions, variable substitution
Surprisingly expensive
String value = conf.get("myKey")
HADOOP CONFIGURATION OBJECT
Calling these methods in a loop, or once per record, is
expensive
Some (non trivial) jobs were spending 30% of their time in
Con๏ฌguration methods
Surprisingly expensive
It's hard to predict what needs to
be optimized without a pro๏ฌler
L E S S O N L E A R N E D
If you don't pro๏ฌle, you could be
missing easy wins
L E S S O N L E A R N E D
Measure whether IO or CPU is your
biggest cost
L E S S O N L E A R N E D
INTERMEDIATE COMPRESSION
Xprof surfaced that compression + decompression in the spill
thread was taking a lot of time
Intermediate outputs are temporary
We now use lz4 instead of lzo level 3, which produces 30%
larger intermediate data that's faster to read
Made some large jobs 1.5X faster
Find the right balance
Record Serialization + Deserialization
can be the most expensive part of
your job
L E S S O N L E A R N E D
Record Serialization is CPU intensive,
and may overshadow IO
L E S S O N L E A R N E D
How to reduce costs due to record
serialization?
L E S S O N L E A R N E D
USE HADOOP'S RAW COMPARATOR API
Hadoop MR deserializes the map output keys in order to sort
them between the map and reduce phases
Don't make sorting more expensive than it already is
deserialize(keyBytes1).compare(deserialize(keyBytes2))
USE HADOOP'S RAW COMPARATOR API
This can cost a lot, especially for complex non-primitive keys,
which is fairly common
Don't make sorting more expensive than it already is
requests.groupBy { req => (req.country, req.client) }
USE HADOOP'S RAW COMPARATOR API
This can cost a lot, especially for complex non-primitive keys,
which is fairly common
Don't make sorting more expensive than it already is
Complex object
that requires sorting
requests.groupBy { req => (req.country, req.client) }
Flat profile of 412.48 secs (38743 total ticks): SpillThread
Interpreted + native Method
12.5% 0 + 32215 org.apache.hadoop.io.compress.lz4.Lz4Compressor.compressBytesDirect
4.6% 0 + 822 java.io.FileOutputStream.writeBytes
...
19.4% 352 + 3082 Total interpreted (including elided)
Compiled + native Method
50.0% 8549 + 299 java.lang.StringCoding.decode
16.9% 2823 + 158 cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement
4.1% 734 + 0 sun.nio.cs.UTF_8$Decoder.decode
2.3% 401 + 0 org.apache.hadoop.mapred.IFileOutputStream.write
2.0% 352 + 0 cascading.tuple.hadoop.util.TupleComparator.compare
1.7% 296 + 0 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare
...
79.0% 13514 + 467 Total compiled
Thread-local ticks:
54.3% 21053 Blocked (of total)
USE HADOOP'S RAW COMPARATOR API
Hadoop comes with a RawComparator API for comparing
records in their serialized (raw) form
Don't make sorting more expensive than it already is
deserialize(keyBytes1).compare(deserialize(keyBytes2))
compare(keyBytes1, keyBytes2)
USE HADOOP'S RAW COMPARATOR API
Hadoop comes with a RawComparator API for comparing
records in their serialized (raw) form
Don't make sorting more expensive than it already is
public interface RawComparator<T> {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2);
}
USE HADOOP'S RAW COMPARATOR API
Unfortunately, this requires you to write a custom comparator
by hand
And assumes that your data is actually easy to compare in its
serialized form
Don't make sorting more expensive than it already is
public interface RawComparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
SCALA MACROS FOR RAW COMPARATORS
Macros to the rescue!
A slightly more hipster API for Raw Comparators in Scala
And a handful of macros to generate implementations of this
API for tuples, case classes, thrift objects, primitives, Strings,
etc.
SCALA MACROS FOR RAW COMPARATORS
1 3 f o o 0 1 17 1 88 ...
Macros to the rescue!
First, creates a custom dense serialization format that's easy to
compare
1 3 f o o 0 1 22 0 ... ...
non-null String
null value
non-null int non-null int
null value
SCALA MACROS FOR RAW COMPARATORS
1 3 f o o 0 1 17 1 88 ...
Macros to the rescue!
Then, creates a compare method that takes advantage of this
format
1 3 f 0 o 0 1 22 0 ... ...
SCALA MACROS FOR RAW COMPARATORS
Macros to the rescue!
TotalComputeTime
Default Raw Comparators
1.5X
FASTER
How to reduce costs due to record
serialization?
L E S S O N L E A R N E D
COLUMN PROJECTION
Don't read or deserialize data that you don't need
struct User {
1: i64 id
2: Address address
3: string name
4 list<Interest> interests
}
COLUMN PROJECTION
Columnar ๏ฌle formats like Apache Parquet support this directly
Specialized record deserializers can skip over unwanted ๏ฌelds
in row oriented storage
Don't read or deserialize data that you don't need
APACHE PARQUET
Columnar storage for the people
In traditional row-oriented storage layout, an entire record is
stored sequentially
R1.A R1.B R1.C R2.A R2.B R2.C R3.A R3.B R3.C
APACHE PARQUET
Columnar storage for the people
In traditional row-oriented storage layout, an entire record is
stored sequentially
9903489083
"123 elm street"
"alice"
"columnar file formats"
9903489084
"333 oak street"
"bob"
"Hadoop"
Compressed with lzo / gzip / snappy
APACHE PARQUET
Columnar storage for the people
In columnar storage layout, an entire column is stored
sequentially
R1.A R2.A R3.A R1.B R2.B R3.B R1.C R2.C R3.C
APACHE PARQUET
Columnar storage for the people
All user ids stored together
In columnar storage layout, an entire column is stored
sequentially
9903489083
9903489084
9903489085
9903489075
9903489088
9903489087
"123 elm street"
"333 oak street"
"827 maple drive"
APACHE PARQUET
Columnar storage for the people
Schema aware storage can use specialized encodings
9903489083
9903489084
9903489085
9903489075
9903489088
9903489087
9903489083
+1
+1
-10
+3
-1
delta
"twitter.com/foo/bar"
"blog.twitter.com"
"twitter.com/foo/bar"
"twitter.com/foo/bar"
"blog.twitter.com"
"blog.twitter.com"
"blog.twitter.com"
"blog.twitter.com/123"
"twitter.com/foo/bar": 0
"blog.twitter.com": 1
"blog.twitter.com/123": 2
0
1
0
0
1
1
1
2
dictionary
FILE SIZE COMPARISON
SizeinGB
B64 Lzo Thrift Block Lzo Thrift Gzipped Json Lzo Parquet
2X
SMALLER
B64 Lzo Thrift Block Lzo Thrift Gzipped Json Lzo Parquet
APACHE PARQUET
Columnar storage for the people
Collocating entire columns allows for e๏ฌƒcient
column projection
Read o๏ฌ€ disk only the columns you need
Possibly more importantly: deserialize only
the columns you need
TotalComputeTime
1 column 10 columns 40 columns
Parquet Lzo Thrift
COLUMN PROJECTION WITH PARQUET
3X
FASTER
1.5X
FASTER
1.15X
FASTER
TotalComputeTime
1 column 10 columns 40 columns
Parquet Lzo Thrift
COLUMN PROJECTION WITH PARQUET
3X
FASTER
1.5X
FASTER
1.15X
FASTER
APACHE PARQUET
Columnar storage for the people
Parquet is often slower to read all columns than row oriented
storage
Parquet is a dense format, read performance scales with the
number of columns in the schema -- nulls take time read
Sparse, row oriented formats (thrift) scale with the number of
columns present in the data -- nulls take no time read
COLUMN PROJECTION FOR ROW ORIENTED DATA
Row oriented is a very common way to store Thrift, Avro,
Protocol Bu๏ฌ€ers, etc.
Specialized record deserializers can skip over unwanted ๏ฌelds
in these row oriented storage formats
Prototype implemented as a Scala macro that creates a custom
deserializer at compile time
Don't deserialize data that you don't need
COLUMN PROJECTION FOR ROW ORIENTED DATA
Don't deserialize data that you don't need
198 111 121 054 e l m _ s t r ... a l i c e ...
Decode User Id to Long
Skip over unwanted address ๏ฌeld
Decode Name to String
COLUMN PROJECTION FOR ROW ORIENTED DATA
No IO savings
But only decodes the ๏ฌelds you care about into objects
CPU time spent decoding Strings can be huge compared to
time it takes to load + ignore the encoded bytes
Don't deserialize data that you don't need
TotalComputeTime
Number of Columns Selected
1 7 10 13 48
Parquet Thrift
Parquet Pig
Lzo Thrift + Projection
COLUMN PROJECTION: THRIFT VS. PARQUET
Parquet Thrift has a lot
of room for
improvement
Parquet faster than row
oriented until 13 columns
This schema is relatively
๏ฌ‚at, and most columns
populated
APACHE PARQUET
Columnar storage for the people
Predicate push-down also allows parquet to
skip over records that don't match your ๏ฌlter
criteria
Parquet stores statistics about chunks of
records, so in some cases entire chunks of
data can be skipped after examining these
statistics
APACHE PARQUET
Columnar storage for the people
Combining both column projection and predicate push
down is a powerful combination
TotalComputeTime
Lzo Thrift Parquet + Filter Parquet + Filter + Project
FILTER PUSH DOWN WITH PARQUET
4.3X
FASTER
APACHE PARQUET
Columnar storage for the people
Predicate push-down performance depends on the nature
of the ๏ฌlter
Searching for rare records is the best case, entire chunks of
records are likely to not contain the records you are looking
for
Key take aways
I N S U M M A R Y
IN SUMMARY
Key takeaways
Pro๏ฌle!
Serialization is expensive, and Hadoop does a lot of it
Choose a storage format that ๏ฌts your access patterns
Use column projection
Sorting is expensive -- use Raw Comparators
IO may not be your bottleneck -- more IO for less CPU may be
a good tradeo๏ฌ€
ACKNOWLEDGEMENTS
Thanks to everyone involved!
Dmitriy Ryaboy @squarecog
Gera Shegalov @gerashegalov
Julien Le Dem @J_
Katya Gonina @katyagonina
Mansur Ashraf @mansur_ashraf
Oscar Boykin @posco
Sriram Krishnan @krishnansriram
Tianshuo Deng @tsdeng
Zak Taylor @zakattacktaylor
And many more!
GET INVOLVED
Contributions always welcome!
github.com/twitter/scalding
github.com/twitter/algebird
github.com/twitter/chill
github.com/apache/parquet-mr
JOIN THE FLOCK
We're Hiring!
Work on data processing challenges at scale
Strong commitment to open source
jobs.twitter.com
Data Platform: (https://about.twitter.com/careers/positions?jvi=oipMYfwb,Job)
Q U E S T I O N S ?
A L E X L E V E N S O N |
I A N O ' C O N N E L L |
@ T H I S W I L LW O R K
@ 0 X 1 3 8

More Related Content

What's hot

Apache Parquet
Apache ParquetApache Parquet
Apache Parquetmegrhi haikel
ย 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
ย 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
ย 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
ย 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
ย 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
ย 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
ย 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
ย 
Apache Spark Coreโ€”Deep Diveโ€”Proper Optimization
Apache Spark Coreโ€”Deep Diveโ€”Proper OptimizationApache Spark Coreโ€”Deep Diveโ€”Proper Optimization
Apache Spark Coreโ€”Deep Diveโ€”Proper OptimizationDatabricks
ย 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
ย 
Designing Structured Streaming Pipelinesโ€”How to Architect Things Right
Designing Structured Streaming Pipelinesโ€”How to Architect Things RightDesigning Structured Streaming Pipelinesโ€”How to Architect Things Right
Designing Structured Streaming Pipelinesโ€”How to Architect Things RightDatabricks
ย 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache HiveDataWorks Summit
ย 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
ย 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
ย 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
ย 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
ย 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
ย 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
ย 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
ย 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
ย 

What's hot (20)

Apache Parquet
Apache ParquetApache Parquet
Apache Parquet
ย 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
ย 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
ย 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
ย 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
ย 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ย 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
ย 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
ย 
Apache Spark Coreโ€”Deep Diveโ€”Proper Optimization
Apache Spark Coreโ€”Deep Diveโ€”Proper OptimizationApache Spark Coreโ€”Deep Diveโ€”Proper Optimization
Apache Spark Coreโ€”Deep Diveโ€”Proper Optimization
ย 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
ย 
Designing Structured Streaming Pipelinesโ€”How to Architect Things Right
Designing Structured Streaming Pipelinesโ€”How to Architect Things RightDesigning Structured Streaming Pipelinesโ€”How to Architect Things Right
Designing Structured Streaming Pipelinesโ€”How to Architect Things Right
ย 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
ย 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
ย 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
ย 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
ย 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
ย 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
ย 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
ย 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
ย 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
ย 

Viewers also liked

Pysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆž
Pysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆžPysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆž
Pysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆžTanaka Yuichi
ย 
Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016
Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016
Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016Tatsuya Atsumi
ย 
Redshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑ
Redshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑRedshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑ
Redshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑYoshiki Kouno
ย 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAmazon Web Services Japan
ย 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
ย 
Dockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–น
Dockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–นDockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–น
Dockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–นYuichi Ito
ย 
AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–
AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–
AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–Amazon Web Services Japan
ย 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR Amazon Web Services Japan
ย 
AWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆ
AWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆAWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆ
AWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆAmazon Web Services Japan
ย 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏ
AWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏAWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏ
AWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏAmazon Web Services Japan
ย 
Amazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณ
Amazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณAmazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณ
Amazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณAmazon Web Services Japan
ย 

Viewers also liked (13)

Pysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆž
Pysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆžPysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆž
Pysparkใงๅง‹ใ‚ใ‚‹ใƒ‡ใƒผใ‚ฟๅˆ†ๆž
ย 
Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016
Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016
Pythonใงๅ…ฅ้–€ใ™ใ‚‹Apache Spark at PyCon2016
ย 
Redshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑ
Redshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑRedshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑ
Redshift Spectrumใ‚’ไฝฟใฃใฆใฟใŸ่ฉฑ
ย 
Alibaba Cloud
Alibaba CloudAlibaba Cloud
Alibaba Cloud
ย 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon Connect
ย 
AWS Black Belt - AWS Glue
AWS Black Belt - AWS GlueAWS Black Belt - AWS Glue
AWS Black Belt - AWS Glue
ย 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
ย 
Dockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–น
Dockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–นDockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–น
Dockerๅ…ฅ้–€: ใ‚ณใƒณใƒ†ใƒŠๅž‹ไปฎๆƒณๅŒ–ๆŠ€่ก“ใฎไป•็ต„ใฟใจไฝฟใ„ๆ–น
ย 
AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–
AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–
AWS BlackBelt AWSไธŠใงใฎDDoSๅฏพ็ญ–
ย 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR
ย 
AWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆ
AWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆAWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆ
AWS Black Belt Online Seminar 2017 AWSใธใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๆŽฅ็ถšใจAWSไธŠใฎใƒใƒƒใƒˆใƒฏใƒผใ‚ฏๅ†…้ƒจ่จญ่จˆ
ย 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏ
AWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏAWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏ
AWS Black Belt Online Seminar 2017 Amazon Pinpoint ใงๅง‹ใ‚ใ‚‹ใƒขใƒใ‚คใƒซใ‚ขใƒ—ใƒชใฎใ‚ฐใƒญใƒผใ‚นใƒใƒƒใ‚ฏ
ย 
Amazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณ
Amazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณAmazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณ
Amazon Athena ๅˆๅฟƒ่€…ๅ‘ใ‘ใƒใƒณใ‚บใ‚ชใƒณ
ย 

Similar to Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twitter (Alex Levenson)

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
ย 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
ย 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
ย 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
ย 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw sparkWisely chen
ย 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
ย 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
ย 
Data Science
Data ScienceData Science
Data ScienceSubhajit75
ย 
Katello on TorqueBox
Katello on TorqueBoxKatello on TorqueBox
Katello on TorqueBoxlzap
ย 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
ย 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
ย 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
ย 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMRAmazon Web Services
ย 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkIan Pointer
ย 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
ย 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
ย 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
ย 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
ย 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
ย 

Similar to Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twitter (Alex Levenson) (20)

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
ย 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
ย 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
ย 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
ย 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
ย 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
ย 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
ย 
Data Science
Data ScienceData Science
Data Science
ย 
Katello on TorqueBox
Katello on TorqueBoxKatello on TorqueBox
Katello on TorqueBox
ย 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
ย 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
ย 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
ย 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
ย 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
ย 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
ย 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
ย 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
ย 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
ย 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
ย 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
ย 

Recently uploaded

CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
ย 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
ย 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
ย 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
ย 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
ย 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธanilsa9823
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
ย 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธDelhi Call girls
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
ย 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
ย 

Recently uploaded (20)

CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
ย 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
ย 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
ย 
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS LiveVip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
ย 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
ย 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
ย 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
ย 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
ย 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
ย 

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twitter (Alex Levenson)

  • 1. L E S S O N S L E A R N E D AT T W I T T E R H A D O O P P E R F O R M A N C E O P T I M I Z AT I O N AT S C A L E A L E X L E V E N S O N | I A N O ' C O N N E L L | @ T H I S W I L LW O R K @ 0 X 1 3 8
  • 2. DATA PLATFORM @TWITTER Develop, maintain, and support the core data processing libraries used at Twitter In a good position to make system-wide performance improvements Core Data Libraries Team
  • 3. DATA PLATFORM @TWITTER Idiomatic functional Scala library for writing Hadoop map reduce Functional programming is a natural ๏ฌt for map reduce Compile time type checked Core Data Libraries Team github.com/twitter/scalding
  • 4. DATA PLATFORM @TWITTER Columnar storage format for the Hadoop ecosystem Uses the Google Dremel column shredding and assembly algorithm Core Data Libraries Team APACHE PARQUET github.com/apache/parquet-mr
  • 5. DATA PLATFORM @TWITTER Streaming map reduce for hybrid realtime / batch topologies Write once, execute in parallel on Storm / Heron (online) and Scalding (o๏ฌ„ine) Core Data Libraries Team SUMMINGBIRD github.com/twitter/summingbird
  • 6. Hadoop at Twitter Scale H A D O O P AT T W I T T E R
  • 8. 100k MAP REDUCE JOBS DAILY MULTIPLES OF
  • 11. At this scale, even small system-wide improvements can save signi๏ฌcant amounts of compute resources C O S T AT S C A L E
  • 12. What does your Hadoop cluster spend most of its time doing? W H AT T O I M P R O V E ?
  • 13. Pro๏ฌle your cluster, you might be surprised by what you ๏ฌnd M E A S U R E - D O N ' T G U E S S
  • 14. ENABLE JVM PROFILING WITH -XPROF Built into the JVM (HotSpot), so there's nothing to install Xprof: a low overhead pro๏ฌler built into the jvm mapreduce.task.profile='true' mapreduce.task.profile.maps='0-' mapreduce.task.profile.reduces='0-' mapreduce.task.profile.params='-Xprof'
  • 15. ENABLE JVM PROFILING WITH -XPROF Low overhead (uses stack sampling) Surfaces the most expensive methods Prints directly to task logs (stdout) Xprof: a low overhead pro๏ฌler built into the jvm
  • 16. Flat profile of 412.48 secs (38743 total ticks): SpillThread Interpreted + native Method 12.5% 0 + 32215 org.apache.hadoop.io.compress.lz4.Lz4Compressor.compressBytesDirect 4.6% 0 + 822 java.io.FileOutputStream.writeBytes ... 19.4% 352 + 3082 Total interpreted (including elided) Compiled + native Method 50.0% 8549 + 299 java.lang.StringCoding.decode 16.9% 2823 + 158 cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement 4.1% 734 + 0 sun.nio.cs.UTF_8$Decoder.decode 2.3% 401 + 0 org.apache.hadoop.mapred.IFileOutputStream.write 2.0% 352 + 0 cascading.tuple.hadoop.util.TupleComparator.compare 1.7% 296 + 0 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare ... 79.0% 13514 + 467 Total compiled Thread-local ticks: 54.3% 21053 Blocked (of total)
  • 17. HADOOP CONFIGURATION OBJECT Looks and behaves a lot like a HashMap Surprisingly expensive Configuration conf = new Configuration() conf.set("myKey", "myValue") String value = conf.get("myKey")
  • 18. HADOOP CONFIGURATION OBJECT Constructor reads + unzips + parses an XML ๏ฌle from disk Surprisingly expensive public class KryoSerialization { public KryoSerialization() { this(new Configuration()) } }
  • 19. HADOOP CONFIGURATION OBJECT get() method involves regular expressions, variable substitution Surprisingly expensive String value = conf.get("myKey")
  • 20. HADOOP CONFIGURATION OBJECT Calling these methods in a loop, or once per record, is expensive Some (non trivial) jobs were spending 30% of their time in Con๏ฌguration methods Surprisingly expensive
  • 21. It's hard to predict what needs to be optimized without a pro๏ฌler L E S S O N L E A R N E D
  • 22. If you don't pro๏ฌle, you could be missing easy wins L E S S O N L E A R N E D
  • 23. Measure whether IO or CPU is your biggest cost L E S S O N L E A R N E D
  • 24. INTERMEDIATE COMPRESSION Xprof surfaced that compression + decompression in the spill thread was taking a lot of time Intermediate outputs are temporary We now use lz4 instead of lzo level 3, which produces 30% larger intermediate data that's faster to read Made some large jobs 1.5X faster Find the right balance
  • 25. Record Serialization + Deserialization can be the most expensive part of your job L E S S O N L E A R N E D
  • 26. Record Serialization is CPU intensive, and may overshadow IO L E S S O N L E A R N E D
  • 27. How to reduce costs due to record serialization? L E S S O N L E A R N E D
  • 28. USE HADOOP'S RAW COMPARATOR API Hadoop MR deserializes the map output keys in order to sort them between the map and reduce phases Don't make sorting more expensive than it already is deserialize(keyBytes1).compare(deserialize(keyBytes2))
  • 29. USE HADOOP'S RAW COMPARATOR API This can cost a lot, especially for complex non-primitive keys, which is fairly common Don't make sorting more expensive than it already is requests.groupBy { req => (req.country, req.client) }
  • 30. USE HADOOP'S RAW COMPARATOR API This can cost a lot, especially for complex non-primitive keys, which is fairly common Don't make sorting more expensive than it already is Complex object that requires sorting requests.groupBy { req => (req.country, req.client) }
  • 31. Flat profile of 412.48 secs (38743 total ticks): SpillThread Interpreted + native Method 12.5% 0 + 32215 org.apache.hadoop.io.compress.lz4.Lz4Compressor.compressBytesDirect 4.6% 0 + 822 java.io.FileOutputStream.writeBytes ... 19.4% 352 + 3082 Total interpreted (including elided) Compiled + native Method 50.0% 8549 + 299 java.lang.StringCoding.decode 16.9% 2823 + 158 cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement 4.1% 734 + 0 sun.nio.cs.UTF_8$Decoder.decode 2.3% 401 + 0 org.apache.hadoop.mapred.IFileOutputStream.write 2.0% 352 + 0 cascading.tuple.hadoop.util.TupleComparator.compare 1.7% 296 + 0 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare ... 79.0% 13514 + 467 Total compiled Thread-local ticks: 54.3% 21053 Blocked (of total)
  • 32. USE HADOOP'S RAW COMPARATOR API Hadoop comes with a RawComparator API for comparing records in their serialized (raw) form Don't make sorting more expensive than it already is deserialize(keyBytes1).compare(deserialize(keyBytes2)) compare(keyBytes1, keyBytes2)
  • 33. USE HADOOP'S RAW COMPARATOR API Hadoop comes with a RawComparator API for comparing records in their serialized (raw) form Don't make sorting more expensive than it already is public interface RawComparator<T> { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }
  • 34. USE HADOOP'S RAW COMPARATOR API Unfortunately, this requires you to write a custom comparator by hand And assumes that your data is actually easy to compare in its serialized form Don't make sorting more expensive than it already is public interface RawComparator<T> { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }
  • 35. SCALA MACROS FOR RAW COMPARATORS Macros to the rescue! A slightly more hipster API for Raw Comparators in Scala And a handful of macros to generate implementations of this API for tuples, case classes, thrift objects, primitives, Strings, etc.
  • 36. SCALA MACROS FOR RAW COMPARATORS 1 3 f o o 0 1 17 1 88 ... Macros to the rescue! First, creates a custom dense serialization format that's easy to compare 1 3 f o o 0 1 22 0 ... ... non-null String null value non-null int non-null int null value
  • 37. SCALA MACROS FOR RAW COMPARATORS 1 3 f o o 0 1 17 1 88 ... Macros to the rescue! Then, creates a compare method that takes advantage of this format 1 3 f 0 o 0 1 22 0 ... ...
  • 38. SCALA MACROS FOR RAW COMPARATORS Macros to the rescue! TotalComputeTime Default Raw Comparators 1.5X FASTER
  • 39. How to reduce costs due to record serialization? L E S S O N L E A R N E D
  • 40. COLUMN PROJECTION Don't read or deserialize data that you don't need struct User { 1: i64 id 2: Address address 3: string name 4 list<Interest> interests }
  • 41. COLUMN PROJECTION Columnar ๏ฌle formats like Apache Parquet support this directly Specialized record deserializers can skip over unwanted ๏ฌelds in row oriented storage Don't read or deserialize data that you don't need
  • 42. APACHE PARQUET Columnar storage for the people In traditional row-oriented storage layout, an entire record is stored sequentially R1.A R1.B R1.C R2.A R2.B R2.C R3.A R3.B R3.C
  • 43. APACHE PARQUET Columnar storage for the people In traditional row-oriented storage layout, an entire record is stored sequentially 9903489083 "123 elm street" "alice" "columnar file formats" 9903489084 "333 oak street" "bob" "Hadoop" Compressed with lzo / gzip / snappy
  • 44. APACHE PARQUET Columnar storage for the people In columnar storage layout, an entire column is stored sequentially R1.A R2.A R3.A R1.B R2.B R3.B R1.C R2.C R3.C
  • 45. APACHE PARQUET Columnar storage for the people All user ids stored together In columnar storage layout, an entire column is stored sequentially 9903489083 9903489084 9903489085 9903489075 9903489088 9903489087 "123 elm street" "333 oak street" "827 maple drive"
  • 46. APACHE PARQUET Columnar storage for the people Schema aware storage can use specialized encodings 9903489083 9903489084 9903489085 9903489075 9903489088 9903489087 9903489083 +1 +1 -10 +3 -1 delta "twitter.com/foo/bar" "blog.twitter.com" "twitter.com/foo/bar" "twitter.com/foo/bar" "blog.twitter.com" "blog.twitter.com" "blog.twitter.com" "blog.twitter.com/123" "twitter.com/foo/bar": 0 "blog.twitter.com": 1 "blog.twitter.com/123": 2 0 1 0 0 1 1 1 2 dictionary
  • 47. FILE SIZE COMPARISON SizeinGB B64 Lzo Thrift Block Lzo Thrift Gzipped Json Lzo Parquet 2X SMALLER B64 Lzo Thrift Block Lzo Thrift Gzipped Json Lzo Parquet
  • 48. APACHE PARQUET Columnar storage for the people Collocating entire columns allows for e๏ฌƒcient column projection Read o๏ฌ€ disk only the columns you need Possibly more importantly: deserialize only the columns you need
  • 49. TotalComputeTime 1 column 10 columns 40 columns Parquet Lzo Thrift COLUMN PROJECTION WITH PARQUET 3X FASTER 1.5X FASTER 1.15X FASTER
  • 50. TotalComputeTime 1 column 10 columns 40 columns Parquet Lzo Thrift COLUMN PROJECTION WITH PARQUET 3X FASTER 1.5X FASTER 1.15X FASTER
  • 51. APACHE PARQUET Columnar storage for the people Parquet is often slower to read all columns than row oriented storage Parquet is a dense format, read performance scales with the number of columns in the schema -- nulls take time read Sparse, row oriented formats (thrift) scale with the number of columns present in the data -- nulls take no time read
  • 52. COLUMN PROJECTION FOR ROW ORIENTED DATA Row oriented is a very common way to store Thrift, Avro, Protocol Bu๏ฌ€ers, etc. Specialized record deserializers can skip over unwanted ๏ฌelds in these row oriented storage formats Prototype implemented as a Scala macro that creates a custom deserializer at compile time Don't deserialize data that you don't need
  • 53. COLUMN PROJECTION FOR ROW ORIENTED DATA Don't deserialize data that you don't need 198 111 121 054 e l m _ s t r ... a l i c e ... Decode User Id to Long Skip over unwanted address ๏ฌeld Decode Name to String
  • 54. COLUMN PROJECTION FOR ROW ORIENTED DATA No IO savings But only decodes the ๏ฌelds you care about into objects CPU time spent decoding Strings can be huge compared to time it takes to load + ignore the encoded bytes Don't deserialize data that you don't need
  • 55. TotalComputeTime Number of Columns Selected 1 7 10 13 48 Parquet Thrift Parquet Pig Lzo Thrift + Projection COLUMN PROJECTION: THRIFT VS. PARQUET Parquet Thrift has a lot of room for improvement Parquet faster than row oriented until 13 columns This schema is relatively ๏ฌ‚at, and most columns populated
  • 56. APACHE PARQUET Columnar storage for the people Predicate push-down also allows parquet to skip over records that don't match your ๏ฌlter criteria Parquet stores statistics about chunks of records, so in some cases entire chunks of data can be skipped after examining these statistics
  • 57. APACHE PARQUET Columnar storage for the people Combining both column projection and predicate push down is a powerful combination
  • 58. TotalComputeTime Lzo Thrift Parquet + Filter Parquet + Filter + Project FILTER PUSH DOWN WITH PARQUET 4.3X FASTER
  • 59. APACHE PARQUET Columnar storage for the people Predicate push-down performance depends on the nature of the ๏ฌlter Searching for rare records is the best case, entire chunks of records are likely to not contain the records you are looking for
  • 60. Key take aways I N S U M M A R Y
  • 61. IN SUMMARY Key takeaways Pro๏ฌle! Serialization is expensive, and Hadoop does a lot of it Choose a storage format that ๏ฌts your access patterns Use column projection Sorting is expensive -- use Raw Comparators IO may not be your bottleneck -- more IO for less CPU may be a good tradeo๏ฌ€
  • 62. ACKNOWLEDGEMENTS Thanks to everyone involved! Dmitriy Ryaboy @squarecog Gera Shegalov @gerashegalov Julien Le Dem @J_ Katya Gonina @katyagonina Mansur Ashraf @mansur_ashraf Oscar Boykin @posco Sriram Krishnan @krishnansriram Tianshuo Deng @tsdeng Zak Taylor @zakattacktaylor And many more!
  • 63. GET INVOLVED Contributions always welcome! github.com/twitter/scalding github.com/twitter/algebird github.com/twitter/chill github.com/apache/parquet-mr
  • 64. JOIN THE FLOCK We're Hiring! Work on data processing challenges at scale Strong commitment to open source jobs.twitter.com Data Platform: (https://about.twitter.com/careers/positions?jvi=oipMYfwb,Job)
  • 65. Q U E S T I O N S ? A L E X L E V E N S O N | I A N O ' C O N N E L L | @ T H I S W I L LW O R K @ 0 X 1 3 8