Schema-on-Read vs Schema-on-Write

Amr Awadallah
Amr AwadallahCTO, Founder at Cloudera, Inc
Schema-on-Read vs Schema-on-Write
Amr Awadallah
CTO, Cloudera, Inc.
aaa@cloudera.com
Schema-on-Read
Traditional data systems require users to create a
schema before loading any data into the system.
This allows such systems to tightly control the
placement of the data during load time hence
enabling them to answer interactive queries very
fast. However, this leads to loss of agility.
In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start
flowing into the system in its original form, then the
schema is parsed at read time (each user can apply
their own "data-lens“ to interpret the data). This
allows for extreme agility while dealing with
complex evolving data structures.
Agility/Flexibility
Schema-on-Write (RDBMS):
•

Prescriptive Data Modeling:

Schema-on-Read (Hadoop):
•

Descriptive Data Modeling:

•

Create static DB schema

•

Copy data in its native format

•

Transform data into RDBMS

•

Create schema + parser

•

Query data in RDBMS format

•

Query Data in its native format
(does ETL on the fly)

•

New columns must be added
explicitly before new data can
propagate into the system.

•

New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.

•

Good for Known Unknowns
(Repetition)

•

Good for Unknown Unknowns
(Exploration)
3
Traditional Data Stack
Business Intelligent Software (OLAP, etc)
Datamart Database

200GB/day

Extract-Transform-Load
Foundational Warehouse
Grid Processing System (1st stage ETL)
File Server Farm
Log Collection

Instrumentation

20TB/day
1 of 4

Recommended

Espresso: LinkedIn's Distributed Data Serving Platform (Paper) by
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
40.1K views12 slides
The Real Cost of Slow Time vs Downtime by
The Real Cost of Slow Time vs DowntimeThe Real Cost of Slow Time vs Downtime
The Real Cost of Slow Time vs DowntimeRadware
20.3K views56 slides
What is in a Lucene index? by
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
48K views38 slides
Modularized ETL Writing with Apache Spark by
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
374 views54 slides
Databricks Delta Lake and Its Benefits by
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
5.1K views21 slides
Relational databases vs Non-relational databases by
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
22.6K views46 slides

More Related Content

What's hot

Kafka replication apachecon_2013 by
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
21.3K views31 slides
Data Lake Overview by
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
19.8K views53 slides
Building robust CDC pipeline with Apache Hudi and Debezium by
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
2.7K views17 slides
Presto by
PrestoPresto
PrestoKnoldus Inc.
5.1K views42 slides
Data pipelines from zero to solid by
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
10.7K views58 slides
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
10.8K views45 slides

What's hot(20)

Kafka replication apachecon_2013 by Jun Rao
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao21.3K views
Data Lake Overview by James Serra
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra19.8K views
Building robust CDC pipeline with Apache Hudi and Debezium by Tathastu.ai
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai2.7K views
Data pipelines from zero to solid by Lars Albertsson
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson10.7K views
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks10.8K views
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake by Databricks
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks2.2K views
Scaling Data Analytics Workloads on Databricks by Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
Databricks1.3K views
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... by Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K views
Modernizing to a Cloud Data Architecture by Databricks
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks649 views
Koalas: Pandas on Apache Spark by Databricks
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks3K views
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by HostedbyConfluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent4.3K views
Designing Structured Streaming Pipelines—How to Architect Things Right by Databricks
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks4.5K views
Intro to Delta Lake by Databricks
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks1.5K views
Scalability, Availability & Stability Patterns by Jonas Bonér
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér516K views

Similar to Schema-on-Read vs Schema-on-Write

Big Data_Architecture.pptx by
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
9 views30 slides
From discovering to trusting data by
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
423 views50 slides
MongoDB by
MongoDBMongoDB
MongoDBfsbrooke
1.3K views22 slides
Big data architectures and the data lake by
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
54.1K views53 slides
no sql presentation by
no sql presentationno sql presentation
no sql presentationchandanm2
262 views14 slides
Master.pptx by
Master.pptxMaster.pptx
Master.pptxKarthikR780430
2 views43 slides

Similar to Schema-on-Read vs Schema-on-Write(20)

Big Data_Architecture.pptx by betalab
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab9 views
From discovering to trusting data by markgrover
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
markgrover423 views
MongoDB by fsbrooke
MongoDBMongoDB
MongoDB
fsbrooke1.3K views
Big data architectures and the data lake by James Serra
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra54.1K views
no sql presentation by chandanm2
no sql presentationno sql presentation
no sql presentation
chandanm2262 views
Cloudera Impala - San Diego Big Data Meetup August 13th 2014 by cdmaxime
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime1.3K views
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS by Mark Kromer
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer2.3K views
Migrating Oracle database to Cassandra by Umair Mansoob
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to Cassandra
Umair Mansoob1.8K views
Google Data Engineering.pdf by avenkatram
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
avenkatram2 views
Data Engineering on GCP by BlibBlobb
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
BlibBlobb5 views
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL by Arseny Chernov
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov635 views
Agile data lake? An oxymoron? by samthemonad
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad287 views
Data management in cloud study of existing systems and future opportunities by Editor Jacotech
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
Editor Jacotech359 views

More from Amr Awadallah

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics... by
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
3.6K views28 slides
Cloudera/Stanford EE203 (Entrepreneurial Engineer) by
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Amr Awadallah
3.4K views14 slides
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook by
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
22.4K views24 slides
Service Primitives for Internet Scale Applications by
Service Primitives for Internet Scale ApplicationsService Primitives for Internet Scale Applications
Service Primitives for Internet Scale ApplicationsAmr Awadallah
1.3K views14 slides
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact... by
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...Amr Awadallah
2.9K views42 slides
Yahoo Microstrategy 2008 by
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Amr Awadallah
1.8K views18 slides

More from Amr Awadallah(6)

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics... by Amr Awadallah
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah3.6K views
Cloudera/Stanford EE203 (Entrepreneurial Engineer) by Amr Awadallah
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Amr Awadallah3.4K views
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook by Amr Awadallah
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah22.4K views
Service Primitives for Internet Scale Applications by Amr Awadallah
Service Primitives for Internet Scale ApplicationsService Primitives for Internet Scale Applications
Service Primitives for Internet Scale Applications
Amr Awadallah1.3K views
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact... by Amr Awadallah
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Amr Awadallah2.9K views
Yahoo Microstrategy 2008 by Amr Awadallah
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008
Amr Awadallah1.8K views

Recently uploaded

Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
22 views27 slides
Piloting & Scaling Successfully With Microsoft Viva by
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
13 views160 slides
Evolving the Network Automation Journey from Python to Platforms by
Evolving the Network Automation Journey from Python to PlatformsEvolving the Network Automation Journey from Python to Platforms
Evolving the Network Automation Journey from Python to PlatformsNetwork Automation Forum
17 views21 slides
Uni Systems for Power Platform.pptx by
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptxUni Systems S.M.S.A.
58 views21 slides
20231123_Camunda Meetup Vienna.pdf by
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
45 views73 slides
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
69 views8 slides

Recently uploaded(20)

Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty22 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays24 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views

Schema-on-Read vs Schema-on-Write

  • 1. Schema-on-Read vs Schema-on-Write Amr Awadallah CTO, Cloudera, Inc. aaa@cloudera.com
  • 2. Schema-on-Read Traditional data systems require users to create a schema before loading any data into the system. This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility. In this talk I will demonstrate Hadoop's schema-onread capability. Using this approach data can start flowing into the system in its original form, then the schema is parsed at read time (each user can apply their own "data-lens“ to interpret the data). This allows for extreme agility while dealing with complex evolving data structures.
  • 3. Agility/Flexibility Schema-on-Write (RDBMS): • Prescriptive Data Modeling: Schema-on-Read (Hadoop): • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format (does ETL on the fly) • New columns must be added explicitly before new data can propagate into the system. • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Known Unknowns (Repetition) • Good for Unknown Unknowns (Exploration) 3
  • 4. Traditional Data Stack Business Intelligent Software (OLAP, etc) Datamart Database 200GB/day Extract-Transform-Load Foundational Warehouse Grid Processing System (1st stage ETL) File Server Farm Log Collection Instrumentation 20TB/day