This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.
Schema-on-Read vs Schema-on-Write
CTO, Cloudera, Inc.
Traditional data systems require users to create a
schema before loading any data into the system.
This allows such systems to tightly control the
placement of the data during load time hence
enabling them to answer interactive queries very
fast. However, this leads to loss of agility.
In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start
flowing into the system in its original form, then the
schema is parsed at read time (each user can apply
their own "data-lens“ to interpret the data). This
allows for extreme agility while dealing with
complex evolving data structures.
Prescriptive Data Modeling:
Descriptive Data Modeling:
Create static DB schema
Copy data in its native format
Transform data into RDBMS
Create schema + parser
Query data in RDBMS format
Query Data in its native format
(does ETL on the fly)
New columns must be added
explicitly before new data can
propagate into the system.
New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.
Good for Known Unknowns
Good for Unknown Unknowns
Traditional Data Stack
Business Intelligent Software (OLAP, etc)
Grid Processing System (1st stage ETL)
File Server Farm