This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.
2. Schema-on-Read
Traditional data systems require users to create a
schema before loading any data into the system.
This allows such systems to tightly control the
placement of the data during load time hence
enabling them to answer interactive queries very
fast. However, this leads to loss of agility.
In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start
flowing into the system in its original form, then the
schema is parsed at read time (each user can apply
their own "data-lens“ to interpret the data). This
allows for extreme agility while dealing with
complex evolving data structures.
3. Agility/Flexibility
Schema-on-Write (RDBMS):
•
Prescriptive Data Modeling:
Schema-on-Read (Hadoop):
•
Descriptive Data Modeling:
•
Create static DB schema
•
Copy data in its native format
•
Transform data into RDBMS
•
Create schema + parser
•
Query data in RDBMS format
•
Query Data in its native format
(does ETL on the fly)
•
New columns must be added
explicitly before new data can
propagate into the system.
•
New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.
•
Good for Known Unknowns
(Repetition)
•
Good for Unknown Unknowns
(Exploration)
3
4. Traditional Data Stack
Business Intelligent Software (OLAP, etc)
Datamart Database
200GB/day
Extract-Transform-Load
Foundational Warehouse
Grid Processing System (1st stage ETL)
File Server Farm
Log Collection
Instrumentation
20TB/day