Schema-on-Read vs Schema-on-Write

•Download as PPTX, PDF•

24 likes•27,330 views

This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.

Technology

Schema-on-Read vs Schema-on-Write
Amr Awadallah
CTO, Cloudera, Inc.
aaa@cloudera.com

Schema-on-Read
Traditional data systems require users to create a
schema before loading any data into the system.
This allows such systems to tightly control the
placement of the data during load time hence
enabling them to answer interactive queries very
fast. However, this leads to loss of agility.
In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start
flowing into the system in its original form, then the
schema is parsed at read time (each user can apply
their own "data-lens“ to interpret the data). This
allows for extreme agility while dealing with
complex evolving data structures.

Agility/Flexibility
Schema-on-Write (RDBMS):
•

Prescriptive Data Modeling:

Schema-on-Read (Hadoop):
•

Descriptive Data Modeling:

•

Create static DB schema

•

Copy data in its native format

•

Transform data into RDBMS

•

Create schema + parser

•

Query data in RDBMS format

•

Query Data in its native format
(does ETL on the fly)

•

New columns must be added
explicitly before new data can
propagate into the system.

•

New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.

•

Good for Known Unknowns
(Repetition)

•

Good for Unknown Unknowns
(Exploration)
3

Traditional Data Stack
Business Intelligent Software (OLAP, etc)
Datamart Database

200GB/day

Extract-Transform-Load
Foundational Warehouse
Grid Processing System (1st stage ETL)
File Server Farm
Log Collection

Instrumentation

20TB/day

What's hot

An overview of snowflakeSivakumar Ramar

An Overview of Apache CassandraDataStax

NoSQL databasesHarri Kauhanen

Nosql data modelsViet-Trung TRAN

Snowflake Data Loading.pptxParag860410

[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1

The CAP Theorem Aleksandar Bradic

All of the Performance Tuning Features in Oracle SQL DeveloperJeff Smith

Introduction to Kafka Cruise ControlJiangjie Qin

Scaling Data Quality @ NetflixMichelle Ufford

Cassandra DatabaseYounesCharfaoui

PostgreSQLAmazon Web Services

Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala

Introduction to CassandraGokhan Atil

Scalability, Availability & Stability PatternsJonas Bonér

Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent

Introduction to Apache ZooKeeperSaurav Haloi

Cloudera Impala InternalsDavid Groozman

Making Apache Spark Better with Delta LakeDatabricks

What's hot (20)

An overview of snowflake

An Overview of Apache Cassandra

NoSQL databases

Nosql data models

Snowflake Data Loading.pptx

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

The CAP Theorem

All of the Performance Tuning Features in Oracle SQL Developer

Introduction to Kafka Cruise Control

Scaling Data Quality @ Netflix

Cassandra Database

PostgreSQL

Polyglot persistence @ netflix (CDE Meetup)

Introduction to Cassandra

Scalability, Availability & Stability Patterns

Best practices and lessons learnt from Running Apache NiFi at Renault

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

Introduction to Apache ZooKeeper

Cloudera Impala Internals

Making Apache Spark Better with Delta Lake

Similar to Schema-on-Read vs Schema-on-Write

Chapter 2 dbChapter 2 dbChapter 2 dbChapter 2 db.pptmohammedabomashowrms

Big Data_Architecture.pptxbetalab

From discovering to trusting datamarkgrover

MongoDBfsbrooke

Big data architectures and the data lakeJames Serra

no sql presentationchandanm2

Master.pptxKarthikR780430

Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime

Microsoft Data Integration Pipelines: Azure Data Factory and SSISMark Kromer

Migrating Oracle database to CassandraUmair Mansoob

Dbms module iSANTOSH RATH

Google Data Engineering.pdfavenkatram

Data Engineering on GCPBlibBlobb

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov

NoSQL and CouchbaseSangharsh agarwal

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Agile data lake? An oxymoron?samthemonad

Data management in cloud study of existing systems and future opportunitiesEditor Jacotech

Db lec 05_newRamadan Babers, PhD

What is Scalability and How can affect on overall system performance of databaseAlireza Kamrani

Similar to Schema-on-Read vs Schema-on-Write (20)

Chapter 2 dbChapter 2 dbChapter 2 dbChapter 2 db.ppt

Big Data_Architecture.pptx

From discovering to trusting data

MongoDB

Big data architectures and the data lake

no sql presentation

Master.pptx

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Microsoft Data Integration Pipelines: Azure Data Factory and SSIS

Migrating Oracle database to Cassandra

Dbms module i

Google Data Engineering.pdf

Data Engineering on GCP

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

NoSQL and Couchbase

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Agile data lake? An oxymoron?

Data management in cloud study of existing systems and future opportunities

Db lec 05_new

What is Scalability and How can affect on overall system performance of database

Recently uploaded

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

MINDCTI Revenue Release Quarter One 2024MIND CTI

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

MS Copilot expands with MS Graph connectorsNanddeep Nachan

ICT role in 21st century education and its challengesrafiqahmad00786416

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Why Teams call analytics are critical to your entire businesspanagenda

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Powerful Google developer tools for immediate impact! (2023-24 C)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MINDCTI Revenue Release Quarter One 2024

Apidays New York 2024 - The value of a flexible API Management solution for O...

MS Copilot expands with MS Graph connectors

ICT role in 21st century education and its challenges

AWS Community Day CPH - Three problems of Terraform

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

A Beginners Guide to Building a RAG App Using Open Source Milvus

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Why Teams call analytics are critical to your entire business

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Boost Fertility New Invention Ups Success Rates.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Schema-on-Read vs Schema-on-Write

1. Schema-on-Read vs Schema-on-Write Amr Awadallah CTO, Cloudera, Inc. aaa@cloudera.com

2. Schema-on-Read Traditional data systems require users to create a schema before loading any data into the system. This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility. In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start flowing into the system in its original form, then the schema is parsed at read time (each user can apply their own "data-lens“ to interpret the data). This allows for extreme agility while dealing with complex evolving data structures.

3. Agility/Flexibility Schema-on-Write (RDBMS): • Prescriptive Data Modeling: Schema-on-Read (Hadoop): • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format (does ETL on the fly) • New columns must be added explicitly before new data can propagate into the system. • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Known Unknowns (Repetition) • Good for Unknown Unknowns (Exploration) 3

4. Traditional Data Stack Business Intelligent Software (OLAP, etc) Datamart Database 200GB/day Extract-Transform-Load Foundational Warehouse Grid Processing System (1st stage ETL) File Server Farm Log Collection Instrumentation 20TB/day

Schema-on-Read vs Schema-on-Write

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Schema-on-Read vs Schema-on-Write

Similar to Schema-on-Read vs Schema-on-Write (20)

More from Amr Awadallah

More from Amr Awadallah (6)

Recently uploaded

Recently uploaded (20)

Schema-on-Read vs Schema-on-Write