FROM FLAT FILES TO DECONSTRUCTED
DATABASE
The evolution and future of the Big Data ecosystem.
Julien Le Dem
@J_
julien.ledem@wework.com
April 2018
Julien Le Dem
@J_
Principal Data Engineer
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
❖ At the beginning there was Hadoop (2005)
❖ Actually, SQL was invented in the 70s
“MapReduce: A major step backwards”
❖ The deconstructed database
❖ What next?
At the beginning there was Hadoop
Hadoop
Storage:
A distributed file system
Execution:
Map Reduce
Based on Google’s GFS and MapReduce papers
Great at looking for a needle in a haystack
… with snowplows
Great at looking for a needle in a haystack …
Original Hadoop abstractions
Execution
Map/Reduce
Storage
File System
Just flat files
● Any binary
● No schema
● No standard
● The job must know how to
split the data
M
M
M
R
R
R
Shuffle
Read
locally
write
locally
Simple
● Flexible/Composable
● Logic and optimizations
tightly coupled
● Ties execution with
persistence
“MapReduce: A major step backwards”
(2008)
SQL
Databases have been around for a long time
• Originally SEQUEL developed in the early 70s at IBM
• 1986: first SQL standard
• Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016
Relational model
Global Standard
• Universal Data access language
• Widely understood
• First described in 1969 by Edgar F. Codd
Underlying principles of relational databases
Standard
SQL is understood by many
Separation of logic and optimization
Separation of Schema and Application
High level language focusing on logic (SQL)
Indexing
Optimizer
Evolution
Views
Schemas
Integrity
Transactions
Integrity constraints
Referential integrity
Storage
Tables
Abstracted notion of data
● Defines a Schema
● Format/layout decoupled
from queries
● Has statistics/indexing
● Can evolve over time
Execution
SQL
SQL
● Decouples logic of query
from:
○ Optimizations
○ Data representation
○ Indexing
Relational Database abstractions
SELECT a, AVG(b)
FROM FOO GROUP BY a
FOO
Query evaluation
Syntax Semantic Optimization Execution
A well integrated system
Storage
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
Table
Metadata
(Schema,
stats,
layout,…)
Columnar
data
Push
downs
Columnar
data
Columnar
data
Push
downs
Push
downs
user
So why? Why Hadoop?
Why Snowplows?
The relational model was constrained
We need the right
Constraints
Need the right abstractions
Traditional SQL implementations:
• Flat schema
• Inflexible schema evolution
• History rewrite required
• No lower level abstractions
• Not scalable
Constraints are good
They allow optimizations
• Statistics
• Pick the best join algorithm
• Change the data layout
• Reusable optimization logic
It’s just code
Hadoop is flexible and scalable
• Room to scale algorithms that are not part of the standard
• Machine learning
• Your imagination is the limit
No Data shape constraint
Open source
• You can improve it
• You can expand it
• You can reinvent it
• Nested data structures
• Unstructured text with semantic annotations
• Graphs
• Non-uniform schemas
You can actually implement SQL with this
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Parser
FOO
Execution
GROUP BY
JOIN
FILTER
BAR
And they did…
(open-sourced 2009)
10 years later
The deconstructed database
Author: gamene https://www.flickr.com/photos/gamene/4007091102
The deconstructed database
The deconstructed database
Query
model
Machine
learning
Storage
Batch
execution
Data
Exchange
Stream
Processing
We can mix and match individual components
*not exhaustive!
Specialized
Components
Stream processing
Storage
Execution SQL
Stream persistance
Streams
Resource management
Iceberg
Machine learning
We can mix and match individual components
Storage
Row oriented or columnar
Immutable or mutable
Stream storage vs analytics optimized
Query model
SQL
Functional
…
Machine Learning
Training models
Data Exchange
Row oriented
Columnar
Batch Execution
Optimized for high throughput and
historical analysis
Streaming Execution
Optimized for High Throughput and Low
latency processing
Emergence of standard components
Emergence of standard components
Columnar Storage
Apache Parquet as columnar
representation at rest.
SQL parsing and
optimization
Apache Calcite as a versatile query
optimizer framework
Schema model
Apache Avro as pipeline friendly schema
for the analytics world.
Columnar Exchange
Apache Arrow as the next generation in-
memory representation and no-overhead
data exchange
Table abstraction
Netflix’s Iceberg has a great potential to
provide Snapshot isolation and layout
abstraction on top of distributed file
systems.
The deconstructed database’s optimizer: Calcite
Storage
Execution engine
Schema
plugins
Optimizer
rules
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
…
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Apache Calcite is used in:
Streaming SQL
• Apache Apex
• Apache Flink
• Apache SamzaSQL
• Apache StormSQL
…
Batch SQL
• Apache Hive
• Apache Drill
• Apache Phoenix
• Apache Kilin
…
Columnar Row oriented
Mutable
The deconstructed database’s storage
Optimized for analytics Optimized for serving
Immutable
Query integration
To be performant a query engine requires deep integration with the storage layer.
Implementing push down and a vectorized reader producing data in an efficient
representation (for example Apache Arrow).
Storage: Push downs
PROJECTION
Read only what you need
PREDICATE
Filter
AGGREGATION
Avoid materializing intermediary data
To reduce IO, aggregation can
also be implemented during the
scan to:
• minimize materialization of
intermediary data
Evaluate filters during scan to:
• Leverage storage properties
(min/max stats, partitioning,
sort, etc)
• Avoid decoding skipped data.
• Reduce IO.
Read only the columns that are
needed:
• Columnar storage makes this
efficient.
The deconstructed database interchange: Apache Arrow
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
SELECT SUM(a)
FROM t
WHERE c = 5
GROUP BY b
Projection and
predicate push
downs
Incumbent Interesting
Storage: Stream persistence
Open source projects
Features
• State of consumer: how do we recover from failure
• Snapshot
• Decoupling reads from writes
• Parallel reads
• Replication
• Data isolation
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Data Infra
Stream persistance
Stream processing
Batch processing
Real time
dashboards
Interactive
analysis
Periodic
dashboards
Analyst
Real time
publishing
Batch
publishing
DataAPI
Data API
Schema
registry
Datasets
metadata
Scheduling/
Job deps
Persistence
Legend:
Processing
Metadata
Monitoring /
Observability
Data Storage
(S3, GCS, HDFS)
(Parquet, Avro)
Data API
UI
Eng
The Future
Still Improving
Better interoperability
• More efficient interoperability:
Continued Apache Arrow adoption
A better data
abstraction
• A better metadata repository
• A better table abstraction:
Netflix/Iceberg
• Common Push down
implementations (filter,
projection, aggregation)
Better data governance
• Global Data Lineage
• Access control
• Protect privacy
• Record User consent
Some predictions
A common access layer
Distributed access
service
Centralizes:
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Table abstraction layer
(Schema, push downs, access control, anonymization)
SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …)
File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
A multi tiered streaming batch storage system
Batch-Stream storage
integration
Convergence of Kafka and Kudu
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Time based
Ingestion
API
1) In memory
row oriented
store
2) in memory
column
oriented
store
3) On Disk Column oriented
store
Stream consumer API
Projection, Predicate, Aggregation push downs
Batch consumer API
Projection, Predicate, Aggregation push downs
Mainly reads from here
Mainly reads from here
THANK YOU!
julien.ledem@wework.com
Julien Le Dem @J_
April 2018
We’re hiring!
Contact: julien.ledem@wework.com
We’re growing
We’re hiring!
• WeWork doubled in size last year and the year before
• We expect to double in size this year as well
• Broadened scope: WeWork Labs, Powered by We, Flatiron
School, Meetup, WeLive, …
WeWork in numbers
Technology jobs Locations
• San Francisco: Platform teams
• New York
• Tel Aviv
• Montevideo
• Singapore
• WeWork has 242 locations, in 72 cities internationally
• Over 210,000 members worldwide
• More than 20,000 member companies
Contact: julien.ledem@wework.com
Questions?
Julien Le Dem @J_ julien.ledem@wework.com
April 2018

From flat files to deconstructed database

  • 1.
    FROM FLAT FILESTO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com April 2018
  • 2.
    Julien Le Dem @J_ PrincipalData Engineer • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3.
    Agenda ❖ At thebeginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database ❖ What next?
  • 4.
    At the beginningthere was Hadoop
  • 5.
    Hadoop Storage: A distributed filesystem Execution: Map Reduce Based on Google’s GFS and MapReduce papers
  • 6.
    Great at lookingfor a needle in a haystack
  • 7.
    … with snowplows Greatat looking for a needle in a haystack …
  • 8.
    Original Hadoop abstractions Execution Map/Reduce Storage FileSystem Just flat files ● Any binary ● No schema ● No standard ● The job must know how to split the data M M M R R R Shuffle Read locally write locally Simple ● Flexible/Composable ● Logic and optimizations tightly coupled ● Ties execution with persistence
  • 9.
    “MapReduce: A majorstep backwards” (2008)
  • 10.
    SQL Databases have beenaround for a long time • Originally SEQUEL developed in the early 70s at IBM • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 Relational model Global Standard • Universal Data access language • Widely understood • First described in 1969 by Edgar F. Codd
  • 11.
    Underlying principles ofrelational databases Standard SQL is understood by many Separation of logic and optimization Separation of Schema and Application High level language focusing on logic (SQL) Indexing Optimizer Evolution Views Schemas Integrity Transactions Integrity constraints Referential integrity
  • 12.
    Storage Tables Abstracted notion ofdata ● Defines a Schema ● Format/layout decoupled from queries ● Has statistics/indexing ● Can evolve over time Execution SQL SQL ● Decouples logic of query from: ○ Optimizations ○ Data representation ○ Indexing Relational Database abstractions SELECT a, AVG(b) FROM FOO GROUP BY a FOO
  • 13.
    Query evaluation Syntax SemanticOptimization Execution
  • 14.
    A well integratedsystem Storage SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution Table Metadata (Schema, stats, layout,…) Columnar data Push downs Columnar data Columnar data Push downs Push downs user
  • 15.
    So why? WhyHadoop? Why Snowplows?
  • 16.
    The relational modelwas constrained We need the right Constraints Need the right abstractions Traditional SQL implementations: • Flat schema • Inflexible schema evolution • History rewrite required • No lower level abstractions • Not scalable Constraints are good They allow optimizations • Statistics • Pick the best join algorithm • Change the data layout • Reusable optimization logic
  • 17.
    It’s just code Hadoopis flexible and scalable • Room to scale algorithms that are not part of the standard • Machine learning • Your imagination is the limit No Data shape constraint Open source • You can improve it • You can expand it • You can reinvent it • Nested data structures • Unstructured text with semantic annotations • Graphs • Non-uniform schemas
  • 18.
    You can actuallyimplement SQL with this
  • 19.
    SELECT f.c, AVG(b.d) FROMFOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Parser FOO Execution GROUP BY JOIN FILTER BAR And they did… (open-sourced 2009)
  • 20.
  • 21.
    The deconstructed database Author:gamene https://www.flickr.com/photos/gamene/4007091102
  • 22.
  • 23.
  • 24.
    We can mixand match individual components *not exhaustive! Specialized Components Stream processing Storage Execution SQL Stream persistance Streams Resource management Iceberg Machine learning
  • 25.
    We can mixand match individual components Storage Row oriented or columnar Immutable or mutable Stream storage vs analytics optimized Query model SQL Functional … Machine Learning Training models Data Exchange Row oriented Columnar Batch Execution Optimized for high throughput and historical analysis Streaming Execution Optimized for High Throughput and Low latency processing
  • 26.
  • 27.
    Emergence of standardcomponents Columnar Storage Apache Parquet as columnar representation at rest. SQL parsing and optimization Apache Calcite as a versatile query optimizer framework Schema model Apache Avro as pipeline friendly schema for the analytics world. Columnar Exchange Apache Arrow as the next generation in- memory representation and no-overhead data exchange Table abstraction Netflix’s Iceberg has a great potential to provide Snapshot isolation and layout abstraction on top of distributed file systems.
  • 28.
    The deconstructed database’soptimizer: Calcite Storage Execution engine Schema plugins Optimizer rules SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution … Select Scan FOO Scan BAR JOIN GROUP BY FILTER
  • 29.
    Apache Calcite isused in: Streaming SQL • Apache Apex • Apache Flink • Apache SamzaSQL • Apache StormSQL … Batch SQL • Apache Hive • Apache Drill • Apache Phoenix • Apache Kilin …
  • 30.
    Columnar Row oriented Mutable Thedeconstructed database’s storage Optimized for analytics Optimized for serving Immutable Query integration To be performant a query engine requires deep integration with the storage layer. Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow).
  • 31.
    Storage: Push downs PROJECTION Readonly what you need PREDICATE Filter AGGREGATION Avoid materializing intermediary data To reduce IO, aggregation can also be implemented during the scan to: • minimize materialization of intermediary data Evaluate filters during scan to: • Leverage storage properties (min/max stats, partitioning, sort, etc) • Avoid decoding skipped data. • Reduce IO. Read only the columns that are needed: • Columnar storage makes this efficient.
  • 32.
    The deconstructed databaseinterchange: Apache Arrow Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Projection and predicate push downs
  • 33.
    Incumbent Interesting Storage: Streampersistence Open source projects Features • State of consumer: how do we recover from failure • Snapshot • Decoupling reads from writes • Parallel reads • Replication • Data isolation
  • 34.
  • 35.
  • 36.
    Big Data infrastructureblueprint Data Infra Stream persistance Stream processing Batch processing Real time dashboards Interactive analysis Periodic dashboards Analyst Real time publishing Batch publishing DataAPI Data API Schema registry Datasets metadata Scheduling/ Job deps Persistence Legend: Processing Metadata Monitoring / Observability Data Storage (S3, GCS, HDFS) (Parquet, Avro) Data API UI Eng
  • 37.
  • 38.
    Still Improving Better interoperability •More efficient interoperability: Continued Apache Arrow adoption A better data abstraction • A better metadata repository • A better table abstraction: Netflix/Iceberg • Common Push down implementations (filter, projection, aggregation) Better data governance • Global Data Lineage • Access control • Protect privacy • Record User consent
  • 39.
  • 40.
    A common accesslayer Distributed access service Centralizes: • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Table abstraction layer (Schema, push downs, access control, anonymization) SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
  • 41.
    A multi tieredstreaming batch storage system Batch-Stream storage integration Convergence of Kafka and Kudu • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Time based Ingestion API 1) In memory row oriented store 2) in memory column oriented store 3) On Disk Column oriented store Stream consumer API Projection, Predicate, Aggregation push downs Batch consumer API Projection, Predicate, Aggregation push downs Mainly reads from here Mainly reads from here
  • 42.
  • 43.
  • 44.
    We’re growing We’re hiring! •WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Montevideo • Singapore • WeWork has 242 locations, in 72 cities internationally • Over 210,000 members worldwide • More than 20,000 member companies Contact: julien.ledem@wework.com
  • 45.
    Questions? Julien Le Dem@J_ julien.ledem@wework.com April 2018