Big Data Lakes Benchmarking 2018

Big Data Benchmarks 2018
Tom Grek
Software & Data Engineer
December 2018

Essentials of a data lake
● Large amount of data
○ At the point where you wonder if Postgres will cope
● Immutable data - no upserts (I’ll get to this ...)
○ In other words: not suitable as a transactional data store
● Predominantly batch access (I’ll get to this ...)
○ Random access can be ok with good design but it’s an add-on, not the primary use case

First ask yourself ...
Does it fit on a desktop computer?
Can I keep it in memory? (If so, do!)
Can I keep it on SSD/HDD? (If so, do! Or an EBS volume, that is ...)
Can I leverage caching to keep hot data in RAM? (If so, do!)
Can I just use SQLite on that desktop computer? (If so, do!)

$10
Figure on about $10 / month per terabyte in AWS
Storage is cheap!

Examples of data lakes
On-prem
HDFS / Hadoop
Cloud & managed
Google BigQuery / BigTable
Apache HBase
AWS Redshift / EMR

Primary design consideration
Ease of access and use
One big user will be your internal analysts / data scientists
Make them happy by making it as easy as `pd.read_csv( ... )`
Data lakes can also back external applications, e.g. web apps
Make application developers happy by making it as easy as `pd.read_csv( ... )`
No need to differentiate between the two users! A well built data lake can do both.

Front it with a microservice
Allows you to provide an easy REST interface for various query types
Allows you to build up indexes of metadata about the data lake
(e.g. an inverted unigram/bigram index)
Indexes can be stored in a secondary database and direct the microservice to only retrieve the relevant partition of data,
thereby limiting network access and processing required.

My preferred design
Object storage (i.e. S3)
If storage is in S3, compute must also be in S3 (same region/AZ!)
Python based
Dask (https://dask.org) is a mature platform for distributed, delayed DAG computation with all the
familiar Python/SciPy idioms
No need for Spark, Hive, Pig etc. however there is no SQL for Dask
Postgres maintaining an index/metadata upon ingest (yes, appending to the data lake is fine)

S3 stores what?
Raw JSONL? (Not JSON!)
Raw CSV?
GZIP’d JSONL?
Parquet?

Methodology
● Time to scan through 1.2GB or ~300,000 rows of text data (5 columns including an ID)
● To retrieve a row with a given ID (random access)
● 6 Dask workers / 6 partitions
● Repeat until < 10ms standard deviation
● Not exactly a representative use case ...
● With more traditional tabular data with categoricals, parquet would have a clear advantage

Methodology
Not tremendously scientific...
Files in S3
Compute nodes in (same region; same AZ) Kubernetes pod
Wall time (CPU time << network read)
In Dask:
%%timeit -n 4
df = dd.read_parquet('s3://my-bucket/*.parquet', engine='fastparquet')
df[df['id'] == 1234567890].compute()

Results
Pandas / Single process, single threaded benchmarks, local files
pd.read_json, a single JSONL file 9.6s 1.3GB
pd.read_parquet, 6 partitions 4.4s 1.2GB
Dask / Distributed benchmarks, local files
Parquet, 6 partitions 1.29s 1.2GB
Dask / Distributed benchmarks, remote files
JSONL, 6 partitions N/A, Dask’s JSON handling is memory-
hungry and buggy*.
1.3GB
CSV, 6 partitions 13.7s 1.2GB
CSV, 6 partitions, GZIP’d 16.9s 415MB
Parquet, 6 partitions 8.6s 1.2GB
Parquet, 6 partitions, GZIP compression 9.53s 407MB
Parquet with predicate pushdown, 6
partitions
1.75s 1.2GB, only 300MB read
Parquet, GZIP’d, with predicate pushdown,
6 partitions
1.62s 350MB, only 80MB read

Parquet wins, even
for row-oriented,
largely unstructured
data

Predicate
pushdown limits
data read which is
the bottleneck
* Thanks, Parquet metadata

Thanks!
Tom Grek
tom@primer.ai

Big Data Lakes Benchmarking 2018

More Related Content

What's hot

Similar to Big Data Lakes Benchmarking 2018

Recently uploaded

Big Data Lakes Benchmarking 2018

Editor's Notes