Schema Registry - Set Your Data Free

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Registry
Sriharsha Chintalapani, Hortonworks
Satish Duggana, Hortonworks
Dataworks summit, 2017, San Jose

Agenda
 Introduction
 Concepts
 Architecture
 Integration
 Security
 Roadmap

What is Schema Registry? What Value Does it Provide?
 What is Schema Registry?
• A shared repository of schemas that allows applications to flexibly interact with each other
 What Value does Schema Registry Provide?
– Data Governance
• Provide reusable schema
• Define relationship between schemas
• Enable generic format conversion, and generic routing
– Operational Efficiency
• To avoid attaching schema to every piece of data
• Producers and consumers can evolve at different rates
 Example Use
– Register Schemas for Kafka Topics to be used by consumers of Kafka Topic (e.g: Nifi, StreamLine)

Schema Registry Concepts
• Schema Group
A logical grouping/container for
similar type of schemas or
based any criteria that the
customer has from managing
the schemas
• Schema Metadata
Metadata associated with a
named schema.
• Schema Version
The actual versioned schema
associated a schema meta
definition
Schema Metadata 1
Schema Name
Schema Type
Description
Compatibility Policy
Serializers
Deserializers
Schema Group
Group Name
SchemaVersion 3
SchemaVersion 2
Schema Version 1
version
text
Fingerprint

Schema Registry
Schema Registry Component Architecture
SR Web Server
Schema Registry
Web App
REST APISchema Registry Client
Java Client
Integrations
Nifi Processors Kafka Ser/Des StreamLine
Schema
Storage
Pluggable Storage
Serializer/Deserializer
Jar Storage
MySQL In-Memory Local File
System
HDFSPostgres

Sender/Receiver flow
Local
schema/serdes
cache
Serializer
Sender
Schema Registry
Client
Message Store
Local
schema/serdes
cache
Deserializer
Schema Registry
Client
version
payload
version
payload
Schema Storage SerDes Storage
Receiver
SchemaRegistrySchemaRegistry SchemaRegistry

Writer/Reader schemas
 Writer schema
– Senders/Producers use this schema while sending the payloads according to the given schema viz
writer’s schema
 Reader/Projection schema
– Receivers uses this schema to project the received payload written with a writer schema.
Sender Receiver
Writer
Schema
Writer
Schema
Projection
Schema

Schema evolution
Producer
v2
Consumer
v2
Producer
v1
Producer
v4
Consumer
v5
Producer
v1
Consumer
v7

Schema Compatibility Policies
 What is a Compatibility Policy?
– Defines the rules of how the schemas can evolve
– Subsequent version updates has to honor the schema’s original compatibility.
 Policies Supported
– Backward
– Forward
– Both
– None

Backward compatibility
 New version of a schema would be compatible with earlier version of that schema.
 Data written from earlier version of the schema, can be read with a new version of the
schema.
V1
{
"type": "record",
"name": "book",
"namespace": "registry.example",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "color",
"type": "string",
"default": "blue"
}
]
}
V2
{
"type": "record",
"name": "book",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "color",
"type": "string",
"default": "blue"
},
{
"name": "pages",
"type": "int",
"default": -1
}
]
}

Forward compatibility
 Existing schema is compatible with future versions of the schema.
 That means the data written from new version of the schema can still be read with old
version of the schema.
V1
{
"type": "record",
"name": "book",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "color",
"type": "string",
"default": "blue"
}
]
}
V2
{
"type": "record",
"name": "book",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "color",
"type": "string",
"default": "blue"
},
{
"name": "pages",
"type": "int"
}
]
}

Both/Full compatibility
 New version of the schema provides both backward and forward compatibilities.
V1
{
"type": "record",
"name": "book",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "color",
"type": "string",
"default": "blue"
}
]
}
V2
{
"type": "record",
"name": "book",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "color",
"type": "string",
"default": "blue"
},
{
"name": "pages",
"type": "int",
"default": -1
},
{
"name": "title",
"type" : "string",
"default": ""
}
]
}

Schema composition
 Schemas can be shared and reused
 Inbuilt support in default serializer/deserializer to build effective schemas
{
"name": "account",
"namespace": "com.hortonworks.example.types",
"includeSchemas": [
{
"name": "utils”
}
],
"type": "record",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "id",
"type": "com.hortonworks.datatypes.uuid"
}
]
}
{
"name": "uuid",
"type": "record",
"namespace": "com.hortonworks.datatypes",
"doc": "A Universally Unique Identifier, in canonical form in
lowercase. This is generated from java.util.UUID Example:
de305d54-75b4-431b-adb2-eb6b9e546014",
"fields": [
{
"name": "value",
"type": "string",
"default": ""
}
]
}

Serializers/Deserializers
 Snapshot based serializer/deserializer
– Seriliazes the complete payload
– Deserializes the payload to respective type
 Pull based serializer/deserializer
– Serialize whatever elements are required and ignore other elements
– Pull out whatever elements that are required to build the desired object
 Push based deserializer
– Gives callback to receive parsing events for respective fields in schema

Schema registry client
 REST based client
 Caching
– Metadata
– Schema versions
– Ser/des libs and class loaders
 URL selectors
– Round robin
– Failover

HA
 Storage provider
– Depends on transactional support of
underlying SQL stores
– Spinup required schema registry
instances
 Supports HA at SchemaRegistry
– Using ZK/Curator
– Automatic failover of master
– Master gets all writes
– Slaves receives only reads
SchemaRegistry
storage
SchemaRegistrySchemaRegistry
SchemaRegistry
SchemaRegistry
SchemaRegistry
storage

Integration of Schema Registry
 Kafka
– Using producer/consumer API for serializer/deserializer
 Nifi Processors for Schema Registry
– Fetch Schema
– Serialize/Deserialize with Schema
 StreamLine processors for Schema Registry
– Lookup Schema of a Kafka Topic

Kafka integration
Local
schema/serdes
cache
KafkaAvro
Serializer
Sender
Schema Registry
Client
Local
schema/serdes
cache
KafkaAvro
Deserializer
Schema Registry
Client
version
payload
version
payload
Receiver
SchemaRegistrySchemaRegistry SchemaRegistry
Kafka

Kafka Avro ser/des protocol
 Multiple ser/des can be registered with respective protocol versions
 Default ser/des send protocol/schema versions as part of the binary payload of kafka
messages
– This can be enhanced once there is headers/metadata support for kafka messages
– Custom ser/des can be registered for schemas.
Default ser/des message format
<protocol-id><identification-info-for-schema-version><message-payload>

Nifi integration
 Nifi Controller Service
 Nifi processors
– Transforms
• Avro – CSV
• Avro – Json
• Json – CSV
– Extracting Avro fields

Security
 Kerberos support
 Enabled with SPNEGO based filter
 Verified integration with
– Streaming Analytics Manager
– Storm
– NiFi

Schema Registry UI

Roadmap
 Collaboration
– Notifications
– Schema life cycle management
– Audit log
– Improved UI
 Rich data types
 Operations
– Cross-cluster mirroring
 Security
– SSL and OAuth 2.0
– Schema and sub schema level
authorization
– Ranger support
 Integration
– Multi lang client
– Pluggable listeners
– Converters

Try it out!
 Docs
– http://registry-project.readthedocs.io/en/latest/index.html
 Repo
– https://github.com/hortonworks/registry
 Google groups
– https://groups.google.com/forum/#!forum/registry
 Open sourced under Apache License
 Apache incubation soon
 Contributions are welcome

Q&A
https://github.com/hortonworks/registry

Schema Registry - Set Your Data Free

More Related Content

What's hot

Similar to Schema Registry - Set Your Data Free

More from DataWorks Summit

Recently uploaded

Schema Registry - Set Your Data Free

Editor's Notes