SlideShare a Scribd company logo
1 of 36
Download to read offline
www.scling.com
Schema on read is obsolete.
Welcome metaprogramming.
Data Innovation Summit, 2024-04-24
Lars Albertsson
Scling
1
www.scling.com
IT craft to factory
2
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
www.scling.com
Security Waterfall
Data factories
3
Application
delivery
Traditional
operations
DevSecOps
Traditional
QA
Infrastructure
DB-oriented
architecture
Agile
Containers
DevOps CI/CD
Infrastructure
as code
Data factories,
data pipelines,
DataOps
www.scling.com
Craft vs industry
4
● Each step steered by human
○ Or primitive automation
● Improving artifacts
● Craft is primary competence
● Components made for humans
○ Look nice, "easy to use"
○ More popular
● Autonomous processes
● Improving process that creates artifacts
● Multitude of competences
● Some components unusable by humans
○ Hard, greasy
○ Made for integration
○ Less popular
www.scling.com
Data engineering in the future
5
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 100-1000
6
2014: 6500 datasets / day
2016: 20000 datasets / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2021: 500B events collected / day
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
www.scling.com
Data-factory-as-a-service
7
Data lake
● Data factory
○ Collected, raw data →
processed, valuable data
● Data pipelines customised for client
○ Analytics (BI, reports, A/B testing)
○ Data-fed features (autocomplete, search)
○ Learning systems (recommendations, fraud)
● Compete with data leaders:
○ Quick idea-to-production
○ Operational efficiency
{....}
{....}
{....}
www.scling.com
Data agility
8
● Siloed: 6+ months
Cultural work
● Autonomous: 1 month
Technical work
● Coordinated: days
Data lake
∆
∆
Latency?
www.scling.com
● Lowest common denominator = name, type, required
○ Types: string, long, double, binary, array, map, union, record
● Schema specification may support additional constraints, e.g. integer range, other collections
What is a schema?
9
Id Name Age Phone
1 "Anna" 34 null
2 "Bob" 42 "08-123456"
Fields
Name Type Required?
In RDBMS, relations are explicit
In lake/stream datasets, relations are implicit
www.scling.com
Schema definitions
10
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "id" , "type" : "int" },
{ "name" : "name" , "type" : "string" },
{ "name" : "age" , "type" : "int" },
{ "name" : "phone" , "type" : ["null", "string"],
"default": null }
]
}
● RDBMS: Table metadata
● Avro format: JSON/DSL definition
○ Definition is bundled with avro data files
○ Reused by Parquet format
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
{ "id": 1, "name": "Alice", "age": "34" }
{ "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" }
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema on write
11
● Schema defined by writer
● Destination (table / dataset / stream topic) has defined schema
○ Technical definition with metadata (e.g. RDMBS, Kafka + registry)
○ By convention
● Writes not in compliance are not accepted
○ Technically aborted (e.g. RDBMS)
○ In violation of intent (e.g. HDFS datasets)
● Can be technically enforced by producer driver
○ Through ORM / code generation
○ Schema registry lookup
Strict checking philosophy
www.scling.com
Schema on read
12
● Anything (technically) accepted when writing
● Schema defined by reader, at consumption
○ Reader may impose requirements on type & value
● In dynamic languages, field propagate implicitly
○ E-shopping example:
i. Join order + customer.
ii. Add device_type to order schema
iii. device_type becomes available in downstream datasets
● Violations of constraints are detected at read
○ Perhaps long after production?
○ By team not owning producer?
Loose checking philosophy
www.scling.com
Dynamic vs static typing
13
Schema on write Schema on read
Static typing Dynamic typing
Strict Loose
Possible
Java:
user.setName("Alice");
user2.getName();
Scala:
user = User(name = "Alice", ...)
user2.name
Java:
user.set("name", "Alice");
user2.get("name");
Python:
user.name = "Alice"
user2.name
www.scling.com
Schema on read or write?
14
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
● Expressive
● Custom types
● IDE support
● Avro for data lake storage
Schema definition choice
15
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema offspring Test record
difference render
type classes
16
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
www.scling.com
Avro codecs
17
case classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
{
"name": "JavaUser",
{ "name": "age", "type": "int" }
{ "name": "phone", "type": [ "null", "string" ] }
}
public class JavaUser implements SpecificRecord {
public Integer getAge() { ... }
public String getPhone() { ... }
}
object UserConverter extends AvroConverter[User] {
def fromSpecific(u: JavaUser): User
def toSpecific(u: User): JavaUser
}
case class User(age: Int,
phone: Option[String] = None)
www.scling.com
Scalameta
● Parsing and analysis of scala
source code
18
val a = b() + 3
["val", " ", "a", " ", "=", " ", "b",
"(", ")", " ", "+", " ", "3"]
[val, "a", =, Call("b"), +, Int(3)]
[val, Int(a), =,
Call(com.scling.func.b), +, Int(3)]
lex
parse
semantic
analysis
www.scling.com
Scalameta use cases
● Scalafmt
● Scalafix
○ Static analysis
○ Code transformation
● Online code generation - macros
● Offline code generation
19
// Example from scio 0.7 -> 0.8 upgrade rules
final class FixTensorflow extends SemanticRule("FixTensorflow") {
override def fix(implicit doc: SemanticDocument): Patch =
doc.tree.collect {
case t @ Term.Select(s, Term.Name(
"saveAsTfExampleFile")) =>
Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax)
}.asPatch
}
www.scling.com
Schema & syntax tree
20
Defn.Class(
List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case),
Type.Name("SaleTransaction"),
List(),
Ctor.Primary(
List(),
,
List(
List(
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))),
Term.Name("customerClubId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))),
Term.Name("storeId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(),
Term.Name("item"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None)
)
)
),
Template(List(), List(), Self(, None), List()))
@PrivacyShielded
case class SaleTransaction(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
www.scling.com
Quasiquotes
21
val stat: Stat = "val a = b() + 3".parse[Stat].get
val stat: Stat = q"val a = b() + 3"
www.scling.com
Quasiquotes in practice
22
q"""
object $converterName extends AvroConverter[${srcClass.clazz.name}
] {
import RecordFieldConverters._
type S = $jClassName
def schema: Schema = $javaClassTerm.getClassSchema()
def tag: ClassTag[S] = implicitly[ClassTag[S]]
def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName])
def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName])
def fromSpecific(record: $jClassName): ${srcClass.clazz.name} =
${Term.Name(srcClass.clazz.name.value)}
(..$fromInits )
def toSpecific(record: ${srcClass.clazz.name}
): $jClassName =
new $jClassName(..$specificArgs)
}
"""
www.scling.com
Test equality Test record
difference render
type classes
23
case classes
test equality
type classes
trait REquality[ T] { def equal(value: T, right: T): Boolean }
object REquality {
implicit val double: REquality[Double] = new REquality[Double] {
def equal(left: Double, right: Double): Boolean = {
// Use a combination of absolute and relative tolerance
left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5)
}
}
/** binds the Magnolia macro to the `gen` method */
implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T]
}
object Equalities {
implicit val equalityUser: REquality[User] =
REquality. gen[User]
}
www.scling.com
case class User(
age: Int,
@AvroProp ("sqlType", "varchar(1012)")
phone: Option[String] = None)
Python + RDBMS
24
case classes
Avro
definitions
Avro type
annotations
MySQL
schemas
Python
{
"name": "User",
{ "name": "age", "type": "int" }
{ "name": "phone",
"type": [ "null", "string" ],
"sqlType": "varchar(1012)",
}
}
class UserEgressJob(CopyToTable):
columns = [
( "age", "int"),
( "name", "varchar(1012)"),
]
...
www.scling.com
Logical types
25
case classes
Logical types
case t"""Instant""" =>
JObject(List(JField("type", JString("long")), JField("logicalType",
JString("timestamp-micros"))))
case t"""LocalDate""" => JObject(List(JField("type", JString("int")),
JField("logicalType", JString("date"))))
case t"""YearMonth""" => JObject(List(JField("type", JString("int"))))
case t"""JObject""" => JString("string")
● Avro logical types
○ E.g. date → int, timestamp → long
○ Default is timestamp-millis
■ Great for year > 294441 (!)
● Custom logical types
○ Time
○ Collections
○ Physical
www.scling.com
Schema on read or write?
26
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Hydration boilerplate
27
www.scling.com
Chimney - case class transformer
28
● Commonality in schema classes
○ Copy + a few more fields
○ Drop fields
● Statically typed
○ Forgot a field - error
○ Wrong type - error
www.scling.com
Chimney in real code
29
www.scling.com
Stretching the type system
30
● Fail: mixup kW and kWh
● Could be a compile-time error. Should be.
● Physical dimension libraries
○ Boost.Units - C++
○ Coulomb - Scala
www.scling.com
Data
lake
Private
pond
Cold
store
Ingest prepared for deletion
31
Mutation
Landing
pond
Append +
delete
Immutable,
limited
retention
www.scling.com
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles transformed PII fields
Lost key pattern
32
www.scling.com
Shieldformation
33
@PrivacyShielded
case class Sale(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
case class SaleShielded(
shieldId: Option[String],
customerClubIdEncrypted: Option[String],
storeIdEncrypted: Option[String],
item: Option[String],
timestamp: String
)
case class SaleAnonymous(
item: Option[String],
timestamp: String
)
object SaleAnonymize extends SparkJob {
...
}
ShieldForm
object SaleExpose extends SparkJob {
...
}
object SaleShield extends SparkJob {
...
}
case class Shield(
shieldId: String,
personId: Option[String],
keyStr: Option[String],
encounterDate: String
)
www.scling.com
Shield
Shieldformation & lost key
34
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
www.scling.com
Schema on write!
35
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Data factory track record
36
Time to
first flow
Staff size 1st flow
effort, weeks
1st flow cost
(w * 50K ?)
Time to
innovation
Flows 1y
after first
Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ?
Finance 2 years 10-50 2000? 100M? Years 10?
Media 3 weeks 4.5 - 8 15 750K 3 months 30
Retail 7 weeks 1-3 7 500K * 6 months 70
Telecom 12 weeks 2-5 30 1500K 6 months 50
Consumer
products
20+ weeks 1.5 30+ 1200+K 6+ months 20
Construction 8 weeks 0.5 4 150K * 7 months 10
Manufacturing 8 weeks 0.5 4 200K * 6 months ?

More Related Content

Similar to Schema on read is obsolete. Welcome metaprogramming..pdf

Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 

Similar to Schema on read is obsolete. Welcome metaprogramming..pdf (20)

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Puppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on WindowsPuppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on Windows
 
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, PuppetPuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
PuppetConf 2016: Puppet on Windows – Nicolas Corrarello, Puppet
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 

More from Lars Albertsson

The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 

More from Lars Albertsson (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Recently uploaded (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 

Schema on read is obsolete. Welcome metaprogramming..pdf

  • 1. www.scling.com Schema on read is obsolete. Welcome metaprogramming. Data Innovation Summit, 2024-04-24 Lars Albertsson Scling 1
  • 2. www.scling.com IT craft to factory 2 Security Waterfall Application delivery Traditional operations Traditional QA Infrastructure DevSecOps Agile Containers DevOps CI/CD Infrastructure as code
  • 4. www.scling.com Craft vs industry 4 ● Each step steered by human ○ Or primitive automation ● Improving artifacts ● Craft is primary competence ● Components made for humans ○ Look nice, "easy to use" ○ More popular ● Autonomous processes ● Improving process that creates artifacts ● Multitude of competences ● Some components unusable by humans ○ Hard, greasy ○ Made for integration ○ Less popular
  • 5. www.scling.com Data engineering in the future 5 DW ~10 year capability gap "data factory engineering" Enterprise big data failures "Modern data stack" - traditional workflows, new technology 4GL / UML phase of data engineering Data engineering education
  • 6. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ○ Each dataset has business value ● Proxy value/cost metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 100-1000 6 2014: 6500 datasets / day 2016: 20000 datasets / day 2018: 100000+ datasets / day, 25% of staff use BigQuery 2021: 500B events collected / day 2016: 1600 000 000 datasets / day Disruptive value of data, machine learning Financial, reporting Insights, data-fed features effort value
  • 7. www.scling.com Data-factory-as-a-service 7 Data lake ● Data factory ○ Collected, raw data → processed, valuable data ● Data pipelines customised for client ○ Analytics (BI, reports, A/B testing) ○ Data-fed features (autocomplete, search) ○ Learning systems (recommendations, fraud) ● Compete with data leaders: ○ Quick idea-to-production ○ Operational efficiency {....} {....} {....}
  • 8. www.scling.com Data agility 8 ● Siloed: 6+ months Cultural work ● Autonomous: 1 month Technical work ● Coordinated: days Data lake ∆ ∆ Latency?
  • 9. www.scling.com ● Lowest common denominator = name, type, required ○ Types: string, long, double, binary, array, map, union, record ● Schema specification may support additional constraints, e.g. integer range, other collections What is a schema? 9 Id Name Age Phone 1 "Anna" 34 null 2 "Bob" 42 "08-123456" Fields Name Type Required? In RDBMS, relations are explicit In lake/stream datasets, relations are implicit
  • 10. www.scling.com Schema definitions 10 { "type" : "record", "namespace" : "com.mapflat.example", "name" : "User", "fields" : [ { "name" : "id" , "type" : "int" }, { "name" : "name" , "type" : "string" }, { "name" : "age" , "type" : "int" }, { "name" : "phone" , "type" : ["null", "string"], "default": null } ] } ● RDBMS: Table metadata ● Avro format: JSON/DSL definition ○ Definition is bundled with avro data files ○ Reused by Parquet format ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema { "id": 1, "name": "Alice", "age": "34" } { "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" } case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 11. www.scling.com Schema on write 11 ● Schema defined by writer ● Destination (table / dataset / stream topic) has defined schema ○ Technical definition with metadata (e.g. RDMBS, Kafka + registry) ○ By convention ● Writes not in compliance are not accepted ○ Technically aborted (e.g. RDBMS) ○ In violation of intent (e.g. HDFS datasets) ● Can be technically enforced by producer driver ○ Through ORM / code generation ○ Schema registry lookup Strict checking philosophy
  • 12. www.scling.com Schema on read 12 ● Anything (technically) accepted when writing ● Schema defined by reader, at consumption ○ Reader may impose requirements on type & value ● In dynamic languages, field propagate implicitly ○ E-shopping example: i. Join order + customer. ii. Add device_type to order schema iii. device_type becomes available in downstream datasets ● Violations of constraints are detected at read ○ Perhaps long after production? ○ By team not owning producer? Loose checking philosophy
  • 13. www.scling.com Dynamic vs static typing 13 Schema on write Schema on read Static typing Dynamic typing Strict Loose Possible Java: user.setName("Alice"); user2.getName(); Scala: user = User(name = "Alice", ...) user2.name Java: user.set("name", "Alice"); user2.get("name"); Python: user.name = "Alice" user2.name
  • 14. www.scling.com Schema on read or write? 14 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 15. www.scling.com ● Expressive ● Custom types ● IDE support ● Avro for data lake storage Schema definition choice 15 ● RDBMS: Table metadata ● Avro: JSON/DSL definition ○ Definition is bundled with avro data files ● Parquet ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 16. www.scling.com Schema offspring Test record difference render type classes 16 case classes test equality type classes Avro definitions Java Avro codec classes Java <-> Scala converters Avro type annotations MySQL schemas CSV codecs Privacy by design machinery Python Logical types
  • 17. www.scling.com Avro codecs 17 case classes Avro definitions Java Avro codec classes Java <-> Scala converters { "name": "JavaUser", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ] } } public class JavaUser implements SpecificRecord { public Integer getAge() { ... } public String getPhone() { ... } } object UserConverter extends AvroConverter[User] { def fromSpecific(u: JavaUser): User def toSpecific(u: User): JavaUser } case class User(age: Int, phone: Option[String] = None)
  • 18. www.scling.com Scalameta ● Parsing and analysis of scala source code 18 val a = b() + 3 ["val", " ", "a", " ", "=", " ", "b", "(", ")", " ", "+", " ", "3"] [val, "a", =, Call("b"), +, Int(3)] [val, Int(a), =, Call(com.scling.func.b), +, Int(3)] lex parse semantic analysis
  • 19. www.scling.com Scalameta use cases ● Scalafmt ● Scalafix ○ Static analysis ○ Code transformation ● Online code generation - macros ● Offline code generation 19 // Example from scio 0.7 -> 0.8 upgrade rules final class FixTensorflow extends SemanticRule("FixTensorflow") { override def fix(implicit doc: SemanticDocument): Patch = doc.tree.collect { case t @ Term.Select(s, Term.Name( "saveAsTfExampleFile")) => Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax) }.asPatch }
  • 20. www.scling.com Schema & syntax tree 20 Defn.Class( List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case), Type.Name("SaleTransaction"), List(), Ctor.Primary( List(), , List( List( Term.Param( List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))), Term.Name("customerClubId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))), Term.Name("storeId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(), Term.Name("item"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None) ) ) ), Template(List(), List(), Self(, None), List())) @PrivacyShielded case class SaleTransaction( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String )
  • 21. www.scling.com Quasiquotes 21 val stat: Stat = "val a = b() + 3".parse[Stat].get val stat: Stat = q"val a = b() + 3"
  • 22. www.scling.com Quasiquotes in practice 22 q""" object $converterName extends AvroConverter[${srcClass.clazz.name} ] { import RecordFieldConverters._ type S = $jClassName def schema: Schema = $javaClassTerm.getClassSchema() def tag: ClassTag[S] = implicitly[ClassTag[S]] def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName]) def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName]) def fromSpecific(record: $jClassName): ${srcClass.clazz.name} = ${Term.Name(srcClass.clazz.name.value)} (..$fromInits ) def toSpecific(record: ${srcClass.clazz.name} ): $jClassName = new $jClassName(..$specificArgs) } """
  • 23. www.scling.com Test equality Test record difference render type classes 23 case classes test equality type classes trait REquality[ T] { def equal(value: T, right: T): Boolean } object REquality { implicit val double: REquality[Double] = new REquality[Double] { def equal(left: Double, right: Double): Boolean = { // Use a combination of absolute and relative tolerance left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5) } } /** binds the Magnolia macro to the `gen` method */ implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T] } object Equalities { implicit val equalityUser: REquality[User] = REquality. gen[User] }
  • 24. www.scling.com case class User( age: Int, @AvroProp ("sqlType", "varchar(1012)") phone: Option[String] = None) Python + RDBMS 24 case classes Avro definitions Avro type annotations MySQL schemas Python { "name": "User", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ], "sqlType": "varchar(1012)", } } class UserEgressJob(CopyToTable): columns = [ ( "age", "int"), ( "name", "varchar(1012)"), ] ...
  • 25. www.scling.com Logical types 25 case classes Logical types case t"""Instant""" => JObject(List(JField("type", JString("long")), JField("logicalType", JString("timestamp-micros")))) case t"""LocalDate""" => JObject(List(JField("type", JString("int")), JField("logicalType", JString("date")))) case t"""YearMonth""" => JObject(List(JField("type", JString("int")))) case t"""JObject""" => JString("string") ● Avro logical types ○ E.g. date → int, timestamp → long ○ Default is timestamp-millis ■ Great for year > 294441 (!) ● Custom logical types ○ Time ○ Collections ○ Physical
  • 26. www.scling.com Schema on read or write? 26 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 28. www.scling.com Chimney - case class transformer 28 ● Commonality in schema classes ○ Copy + a few more fields ○ Drop fields ● Statically typed ○ Forgot a field - error ○ Wrong type - error
  • 30. www.scling.com Stretching the type system 30 ● Fail: mixup kW and kWh ● Could be a compile-time error. Should be. ● Physical dimension libraries ○ Boost.Units - C++ ○ Coulomb - Scala
  • 31. www.scling.com Data lake Private pond Cold store Ingest prepared for deletion 31 Mutation Landing pond Append + delete Immutable, limited retention
  • 32. www.scling.com ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles transformed PII fields Lost key pattern 32
  • 33. www.scling.com Shieldformation 33 @PrivacyShielded case class Sale( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String ) case class SaleShielded( shieldId: Option[String], customerClubIdEncrypted: Option[String], storeIdEncrypted: Option[String], item: Option[String], timestamp: String ) case class SaleAnonymous( item: Option[String], timestamp: String ) object SaleAnonymize extends SparkJob { ... } ShieldForm object SaleExpose extends SparkJob { ... } object SaleShield extends SparkJob { ... } case class Shield( shieldId: String, personId: Option[String], keyStr: Option[String], encounterDate: String )
  • 34. www.scling.com Shield Shieldformation & lost key 34 SaleShield Sale Sale Shielded Shield Deletion requests Customer History Exposed egress SaleExpose Limited retention SaleAnonymize Sale Anonymous Sale Stats
  • 36. www.scling.com Data factory track record 36 Time to first flow Staff size 1st flow effort, weeks 1st flow cost (w * 50K ?) Time to innovation Flows 1y after first Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ? Finance 2 years 10-50 2000? 100M? Years 10? Media 3 weeks 4.5 - 8 15 750K 3 months 30 Retail 7 weeks 1-3 7 500K * 6 months 70 Telecom 12 weeks 2-5 30 1500K 6 months 50 Consumer products 20+ weeks 1.5 30+ 1200+K 6+ months 20 Construction 8 weeks 0.5 4 150K * 7 months 10 Manufacturing 8 weeks 0.5 4 200K * 6 months ?