Schema-on-Read vs Schema-on-Write
Amr Awadallah
CTO, Cloudera, Inc.
aaa@cloudera.com
Schema-on-Read
Traditional data systems require users to create a
schema before loading any data into the system.
This all...
Agility/Flexibility
Schema-on-Write (RDBMS):
•

Prescriptive Data Modeling:

Schema-on-Read (Hadoop):
•

Descriptive Data ...
Traditional Data Stack
Business Intelligent Software (OLAP, etc)
Datamart Database

200GB/day

Extract-Transform-Load
Foun...
Upcoming SlideShare
Loading in...5
×

Schema-on-Read vs Schema-on-Write

7,182

Published on

This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,182
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
142
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide

Schema-on-Read vs Schema-on-Write

  1. 1. Schema-on-Read vs Schema-on-Write Amr Awadallah CTO, Cloudera, Inc. aaa@cloudera.com
  2. 2. Schema-on-Read Traditional data systems require users to create a schema before loading any data into the system. This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility. In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start flowing into the system in its original form, then the schema is parsed at read time (each user can apply their own "data-lens“ to interpret the data). This allows for extreme agility while dealing with complex evolving data structures.
  3. 3. Agility/Flexibility Schema-on-Write (RDBMS): • Prescriptive Data Modeling: Schema-on-Read (Hadoop): • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format (does ETL on the fly) • New columns must be added explicitly before new data can propagate into the system. • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Known Unknowns (Repetition) • Good for Unknown Unknowns (Exploration) 3
  4. 4. Traditional Data Stack Business Intelligent Software (OLAP, etc) Datamart Database 200GB/day Extract-Transform-Load Foundational Warehouse Grid Processing System (1st stage ETL) File Server Farm Log Collection Instrumentation 20TB/day
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×