How fast can you modify your data collection to include a new field, make all the necessary changes in data processing and storage, and then use that field in analytics or product features? For many companies, the answer is a few quarters, whereas others do it in a day. This data agility latency has a direct impact on companies' ability to innovate with data. Schema-on-read has been a key strategy to lower that latency - as the community has shifted towards storing data outside relational databases, we no longer need to make series of schema changes through the whole data chain, coordinated between teams to minimise operational risk. Schema-on-read comes with a cost, however. Errors that we used to catch during testing or in early test deployments can now sneak into production undetected and surface as product errors or hard-to-debug data quality problems later than with schema-on-write solutions.
In this presentation, we will show how we have rejected the tradeoff between slow schema change rate and quality to achieve the best of both worlds. By using metaprogramming and versioned pipelines that are tested end-to-end, we can achieve fast schema changes with schema-on-write and the protection of static typing. We will describe the tools in our toolbox - Scalameta, Chimney, Bazel, and custom tools. We will also show how we leverage them to take static typing one step further and differentiate between domain types that share representation, e.g. EmailAddress vs ValidatedEmailAddress or kW vs kWh, while maintaining harmony with data technology ecosystems.
Schema on read is obsolete. Welcome metaprogramming..pdf
1. www.scling.com
Schema on read is obsolete.
Welcome metaprogramming.
Data Innovation Summit, 2024-04-24
Lars Albertsson
Scling
1
2. www.scling.com
IT craft to factory
2
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
4. www.scling.com
Craft vs industry
4
● Each step steered by human
○ Or primitive automation
● Improving artifacts
● Craft is primary competence
● Components made for humans
○ Look nice, "easy to use"
○ More popular
● Autonomous processes
● Improving process that creates artifacts
● Multitude of competences
● Some components unusable by humans
○ Hard, greasy
○ Made for integration
○ Less popular
5. www.scling.com
Data engineering in the future
5
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
6. www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 100-1000
6
2014: 6500 datasets / day
2016: 20000 datasets / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2021: 500B events collected / day
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
7. www.scling.com
Data-factory-as-a-service
7
Data lake
● Data factory
○ Collected, raw data →
processed, valuable data
● Data pipelines customised for client
○ Analytics (BI, reports, A/B testing)
○ Data-fed features (autocomplete, search)
○ Learning systems (recommendations, fraud)
● Compete with data leaders:
○ Quick idea-to-production
○ Operational efficiency
{....}
{....}
{....}
9. www.scling.com
● Lowest common denominator = name, type, required
○ Types: string, long, double, binary, array, map, union, record
● Schema specification may support additional constraints, e.g. integer range, other collections
What is a schema?
9
Id Name Age Phone
1 "Anna" 34 null
2 "Bob" 42 "08-123456"
Fields
Name Type Required?
In RDBMS, relations are explicit
In lake/stream datasets, relations are implicit
11. www.scling.com
Schema on write
11
● Schema defined by writer
● Destination (table / dataset / stream topic) has defined schema
○ Technical definition with metadata (e.g. RDMBS, Kafka + registry)
○ By convention
● Writes not in compliance are not accepted
○ Technically aborted (e.g. RDBMS)
○ In violation of intent (e.g. HDFS datasets)
● Can be technically enforced by producer driver
○ Through ORM / code generation
○ Schema registry lookup
Strict checking philosophy
12. www.scling.com
Schema on read
12
● Anything (technically) accepted when writing
● Schema defined by reader, at consumption
○ Reader may impose requirements on type & value
● In dynamic languages, field propagate implicitly
○ E-shopping example:
i. Join order + customer.
ii. Add device_type to order schema
iii. device_type becomes available in downstream datasets
● Violations of constraints are detected at read
○ Perhaps long after production?
○ By team not owning producer?
Loose checking philosophy
13. www.scling.com
Dynamic vs static typing
13
Schema on write Schema on read
Static typing Dynamic typing
Strict Loose
Possible
Java:
user.setName("Alice");
user2.getName();
Scala:
user = User(name = "Alice", ...)
user2.name
Java:
user.set("name", "Alice");
user2.get("name");
Python:
user.name = "Alice"
user2.name
14. www.scling.com
Schema on read or write?
14
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
15. www.scling.com
● Expressive
● Custom types
● IDE support
● Avro for data lake storage
Schema definition choice
15
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
16. www.scling.com
Schema offspring Test record
difference render
type classes
16
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
17. www.scling.com
Avro codecs
17
case classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
{
"name": "JavaUser",
{ "name": "age", "type": "int" }
{ "name": "phone", "type": [ "null", "string" ] }
}
public class JavaUser implements SpecificRecord {
public Integer getAge() { ... }
public String getPhone() { ... }
}
object UserConverter extends AvroConverter[User] {
def fromSpecific(u: JavaUser): User
def toSpecific(u: User): JavaUser
}
case class User(age: Int,
phone: Option[String] = None)
25. www.scling.com
Logical types
25
case classes
Logical types
case t"""Instant""" =>
JObject(List(JField("type", JString("long")), JField("logicalType",
JString("timestamp-micros"))))
case t"""LocalDate""" => JObject(List(JField("type", JString("int")),
JField("logicalType", JString("date"))))
case t"""YearMonth""" => JObject(List(JField("type", JString("int"))))
case t"""JObject""" => JString("string")
● Avro logical types
○ E.g. date → int, timestamp → long
○ Default is timestamp-millis
■ Great for year > 294441 (!)
● Custom logical types
○ Time
○ Collections
○ Physical
26. www.scling.com
Schema on read or write?
26
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
28. www.scling.com
Chimney - case class transformer
28
● Commonality in schema classes
○ Copy + a few more fields
○ Drop fields
● Statically typed
○ Forgot a field - error
○ Wrong type - error
30. www.scling.com
Stretching the type system
30
● Fail: mixup kW and kWh
● Could be a compile-time error. Should be.
● Physical dimension libraries
○ Boost.Units - C++
○ Coulomb - Scala
32. www.scling.com
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles transformed PII fields
Lost key pattern
32
33. www.scling.com
Shieldformation
33
@PrivacyShielded
case class Sale(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
case class SaleShielded(
shieldId: Option[String],
customerClubIdEncrypted: Option[String],
storeIdEncrypted: Option[String],
item: Option[String],
timestamp: String
)
case class SaleAnonymous(
item: Option[String],
timestamp: String
)
object SaleAnonymize extends SparkJob {
...
}
ShieldForm
object SaleExpose extends SparkJob {
...
}
object SaleShield extends SparkJob {
...
}
case class Shield(
shieldId: String,
personId: Option[String],
keyStr: Option[String],
encounterDate: String
)
34. www.scling.com
Shield
Shieldformation & lost key
34
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats