Scala idiomatic data binding generator for JSON.
Courier is a language binding for Scala for the Pegasus schema and data system.
Pegasus contains an expressive schema language for JSON structured data that is based on the Avro schema language, but adds optional fields a few other conveniences to make it easy to define the structure of natural looking JSON. Pegasus also has a rich feature set including schema based validation, data translation between multiple data formats, schema compatibility with Avro, and generated Java data bindings.
By using Courier, all the features of Pegasus can be leveraged by Scala developers but with Scala idiomatic data bindings that look and feel natural to a Scala developer.
2. Courier is a code generator
{
"name": "Fortune",
"namespace": "org.example",
"type": "record",
"fields": [
{
"name": "message",
"type": "string"
}
]
}
{ "message": "Today is your lucky day!" }
case class Fortune(message: String)
JSON Data
ScalaPegasus Schema generate
serialize / deserialize
3.
4. ● Extension of Apache Avro’s schema language built at Linkedin.
● Designed for natural looking JSON.
● Rich type system maps well between JSON and type-safe
languages like Scala.
● Schema language is machine readable and easy to extend.
● Tooling and language support.
Pegasus Schema Language
Pegasus Schemas
Avro Schemas
+optional record fields, +typerefs
Core schema language:
records, maps, arrays, unions, enums,
primitives
Courier
14. Pegasus Schema Types
Pegasus Type Scala Type Example JSON
int, long, float, double, boolean,
string
Int, Long, Float, Double, Boolean, String 1, 10000000, 3.14, 2.718281, true, “Coursera”
record case class R(f1: T1, f1: T2, ...) { “f1”: 1, “f2”: “Coursera” }
array A extends IndexedSeq[T] [1, 2, 3]
map M extends Map[String, T] { “key1”: 1, “key2”: 2 }
union sealed abstract class U
case class M1(T1) extends U
case class M2(T2) extends U
{
“org.example.M1”: <T1 Value>
}
enum object E extend Enumeration “SYMBOL”
* unions and typerefs will be covered in more detail later.
31. org/example/AnswerFormats.pdsc
{
"name": "AnswerFormats", "namespace": "org.example", "type": "typeref",
"ref": [ "MultipleChoice", "TextEntry" ]
}
Scala
sealed abstract class AnswerFormats()
object AnswerFormats {
case class MultipleChoiceMember(v: MultipleChoice) extends AnswerFormats
case class TextEntryMember(v: TextEntry) extends AnswerFormats
}
Example
MultipleChoiceMember(MultipleChoice(…)) =>
{ "org.example.MultipleChoice": { … }
Naming a Union with a Typeref
32. org/example/DateTime.pdsc
{
"name": "DateTime", "namespace": "org.example", "type": "typeref",
"ref": "string",
"scala": {
"class": "org.joda.time.DateTime",
"coercerClass": "org.coursera.models.common.DateTimeCoercer"
}
}
Scala
Use org.joda.time.DateTime directly.
Example
Record(createdAt = new org.joda.time.DateTime(…)) =>
{ "createdAt": "2015-06-21T18:24:18Z" }
Custom Bindings with a Typeref
33. Pegasus System
Schema system: Schema based validation + custom validators.
Data system: JSON Object and Array equivalent types, support for binding to native types.
Code generators: Java, Scala (via Courier), Swift (in progress), Android Java (planned)
Codecs:
● via Pegasus:
o JSON - Jackson streaming
o PSON - non-standard JSON equivalent binary protocol
o Avro binary - compact binary protocol
● via Courier:
o StringKeyCodec - compatible with our legacy StringKeyFormats
o InlineStringCodec - a new “URL Friendly” JSON compatible format
Hardened and performance optimized at Linkedin. In large scale production use for over 3 years.
Hi, my name is “Joe”.
I work on the infrastructure team and I’m here to talk to you about a new project called “Courier”.
Courier is a code generator for Scala.
This slide shows what it essentially does. If you only remember one slide from this presentation, remember this one.
Courier takes pegasus schema files as inputs.
It generates Scala classes.
These classes serialize/deserialize to JSON. Kinda like we do today with our Scala case classes and Play! JSON’s Formats.
Two things I should note about this slide:
For Play JSON’s Formats, We have this utility called AutoSchema that is able to generate schema from Scala classes. So basically it will generate a schema from a Scala class, the opposite direction that Courier does generation. However, AutoSchema deeply flawed. It could only do this correctly for some of our case classes and it would require significant change both to AutoSchema and to our JsonFormats to get it to work properly. Still, we could have chosen to take the approach where we and generate schemas from scala classes instead of the other way and I’ll talk a bit more later about why we prefer generating code from schemas.
While JSON is shown here, Courier supports multiple message formats, not just JSON, and we’ll be look at those in more detail shortly.
Okay, at this point, some of you might be wondering, ...
What is the world are these pegasus schemas?
I’ve never heard of this pegasus thing!
Pegasus is a schema language and data system used by the LinkedIn rest.li opensource project. Unless you’ve used Rest.li, odds are you’ve never heard of Pegasus.
The main thing to know is that pegasus’s schema language is just an extension of Avro’s schema language. In fact, pegasus schemas are almost identical to Avro schemas with a few specific improvements, the most important one being direct support for optional fields.
And if you’ve heard of Avro, it’s most likely you know if it as a binary protocol, usually associated with Hadoop. We won’t be looking at the binary protocol much as we’re more focused on JSON, but we will be looking at the schema language Avro uses.
We’ll talk more about the specific advantages of the Pegasus schema language shortly.
Before we dive into Pegasus in more detail. Let’s look at why we care about schemas.
If you’re a Web or Mobile developer
you should not need to study our Scala code to try to figure the JSON structure of our REST resources
you should not need to guess the structure from sample responses
you should not need to ask the developer who wrote the resource to explain to you the structure data it returns
Odds are, if you’re working with our current REST resources, you’ve got not choice but to do one of these things.
WIth schemas, you can just open a file, and it tells you the exact structure of the data. And there is no possibility that some custom OFormat has been somewhere defined that manipulates the JSON structure in an unexpected way.
Also, from schemas, we should be able to generate documentation like this ...
This is actually a screenshot of our API Explorer. Unfortunately, the current API Explorer uses the broken AutoSchema utility, which often generates these documentation pages incorrectly. This is one it happened to get right. I think. It’s hard to be sure, AutoSchema is not particularly trustworthy. But with a schema language we will be able to generate documentation like this correctly.
Documents like this are great when they’re correct. If you want to use a new REST API, you just pull up the API Explorer documentation and you immediately know what fields exist, if they are required or optional, and what their type is. Huge time saver.
Currently, our backend Scala developers benefit from strong type safety and we’ve seen substantial productivity gains, but that type safety only spans service boundaries if both the client and server are written Scala, even though mobile developers are using statically typed languages. But if we instead generate bindings from schemas, we generating bindings for Java and Swift as well and we can expand that type-safety and all the productivity gains it offers.
Okay, so schemas are great, but why Pegasus Schemas? Why not something else like JSON Schema?
We want a rich type system, richer than what JSON Schema offers. For numerics JSON Schema offers a “integer” type and “number” type, which is sort of weird, JSON only has a “number” type, so I’m not sure how the inventors of JSON Schema landed on “integer” and “number” as their numeric types. Pegasus has “int”, “long”, “float” and “double”, which matches up better with how most statically typed languages, not just Scala!, define numeric types.
JSON Schema is also missing direct support for type variance (no tagged union type like thrift or avro, no subtype polymorphism), so there is no natural and type-safe way to model polymorphic types. Pegasus, on the other hand, has a well thought out type system that is based on ADTs (Abstract Data Types) and maps well to the level of type-safety we expect in our Scala code.
We’ll talk about this in a bit more detail when we review the Pegasus type system.
Because the schemas themselves are written in JSON, we can read the schemas in just about any programming language because all you need is a JSON parser.
This makes it easy to build tools like
code generators
schema based data validators
documentation tools like our API explorer
Being JSON, we can add additional fields to pegasus schemas for things we need and schema readers that are unaware of the fields simply ignore them.
We’ve already done with a couple times with Courier, I’ll point those out as we go.
If we had used a grammer based schema language like thrift of protobuf have, we would be unable to do this.
With Play JSON, if we need to support an additional data format, we must add more implicit converters. With Pegasus, we define our schemas once, and then we can use the generate classes with any codecs that exist. If in the future we decide we need a new codec, we can add it and use it immediately with all Courier generated classes.
This a high level overview of the pegasus types.
I’ll look at each one individually, so you don’t need worry about understanding this all now.
The left column shows the major types.
The primitive types map directly to Scala types. All four numeric types maps to JSON’s “number” type.
Pegaus schemas have the .pdsc file extension.
Let’s look at some examples, starting with Record types.
In each example, the first section is the pegasus schema.
The second section shows basically what the generated class looks like. The actual generated classes are a bit more sophisticated but they will be consistent with the declaration shown here.
The third section shows a few examples of Scala code and the equivalent JSON.
This particular example shows a simple record with two fields. The second field is optional.
Here’s another record.
This shows how optional and default fields work. Note how the defaults in the schema correspond to the constructor defaults in the generated Scala class.
The “defaultNone” schema property, highlighted in green, is a custom property added by Courier, it is not part of the pegasus schema language. Being able to add properties like this is an example of why we prefer having a schema language that is written in JSON.
Okay, moving on to arrays and maps.
Here’s an array.
Here, the type definition highlighted in green is an array. The rest is a record type that contains the array.
In Pegasus, arrays are usually defined “inline” like this. It is possible to define an array as a top level type, and I’ll show that later. But for now we’ll look an this inline array.
Courier generates a class for the array, highlighted in green, extends IndexedSeq, and has all the convenience methods one expects of a Scala seq type. Map, filter, scan, fold, and so on..
Note that we do not use generic collection types for all arrays, instead we generate a custom array class for each contained type. We do this so that we can attach some schema related information to the array type which is needed for validation and some other Courier internals. This doesn’t end up being a problem in practice. It’s trivial to convert from Seq, List, etc. to the generated classes.
Here’s a map.
Currently all maps are string keyed, but we plan to add a feature to Courier called “Typed Map Keys” which allow key types to be defined as well similar to how we define the type of the values here. I’ll show how we will encode the map keys later on.
Just like with arrays, we generate a map class for each contained type.
Pegasus unions are tagged unions. Aka disjoint types, for those of you familiar with ADTs (Algebraic Data Types), unions are your sum type.
… There is no notion of sub-types. But type hierarchies can be mapped to unions quite easily.
… this is a bit of a subtle point. The notion here is that, with a union, all the member types are declared directly in the union declaration, so any code written to process the union is guaranteed to know what all the possible types are. This allows, for example, for a compiler to check if the cases of a pattern matching expression against a union are exhaustive or not. Those of you coding in Scala regularly should feel right at home with this concept.
Okay, here’s an example union. Unions are one of the most complex types in Courier, so we’ll spend a bit of extra time on this slide.
Here the union type declaration is highlighted in green. Again, it’s inside a record.
Note the structure. Unions are declared using a JSON array. Each item in the array is a union member type.
If you look at the generated scala, you’ll see a sealed AnswerFormat class inside the record classes companion object, and under that a case class for each of the union members.
In Scala, we have to “box” the member types. For example “MultipleChoiceMember” boxes “MultipleChoice”. We do this because scala does not yet directly support unions. Support for unions for Scala has been mentioned by Odersky in a recent talk (I believe he refered to them as “disjoint types”), so if they are added to Scala, we won’t need to box the member types like we do here.
How unions are represented in JSON is worth looking at more closely. In the example, highlighted in red is a Pegasus type name. This is the union tag. It indicates that this JSON data contains a “MultipleChoice” answer format. If instead the answerFormat had been a TextEntry, this would be “org.example.TextEntry”. This is a union, so it’s either/or, Since this union has two member types, it can either be MultipleChoice or TextEntry.
The is the one feature of Pegasus where the JSON looks “un-natural”. But it is also flexible and clear.. Unions can be contain all possible member types, including primitives, and the tag used for each type makes it very clear what the member data is.
We’ve spoken with a few javascript experts here at Coursera and written some sample code to make sure that we’ll be able to bind to this union format from javascript in a reasonable way.
I like to think of enums as degenerate unions. In pegasus they are a completely separate type, which is probably a good thing. They map to JSON strings.
They’re pretty simple.
We’ve covered all the base types, but there is one language feature left to go over: Typerefs
Typerefs do not exist in Avro, they are added by Pegasus.
Here’s a basic typeref. It just defines an alias to the long type called “Timestamp”.
This is just the most lightweight possible alias to an existing type. It really doesn’t do anything other than introduce a new name for the referenced type. In Scala and JSON, a “Timestamp” is handled exactly like a long is.
We will rarely if ever use basic typerefs just to provide lightweight aliases.
While basic typerefs are not all that useful. Pegasus use of typerefs to activate other, more powerful features, that are missing from Avro.
Remember how we defined all unions, maps and arrays as inline types inside some other type? This is because Avro does not support defining union, maps or arrays as top level type.
Typerefs allow us to ge around this.
We can define a new typeref, and set the “ref” type to as a union.
Courier will then generate a top level class for the union.
We can do the same thing for maps or arrays if we want.
This is the most powerful feature that typrefs enable.
With a typeref we can bind a pegasus type to any Scala class we want. When we do this, Courier will not generate a class, but instead will use the existing class the we specify.
In order for this to work, we must define a separate “coercer” class that converts the pegasus type (a string here) into the scala class.
We expect custom type bindings to be used heavily for core types like as CourseId and UUID.
For arity-1 case classes we plan add a feature to Courier to automatically handle the coercion, so a coercerClass will not need to be defined for simple “AnyVal” type case classes.
In addition to the schema language, pegasus provides a well engineered Java implementation that we can take advantage of.
One feature that we get “out-of-the-gate” is schema based data validation as well as the ability to write custom validators.
We also get Java and Scala code generators.
Any Android developers might be interested to know that the Java generator is available via a gradle plugin.
Linkedin is working on code generators for Swift and planning to write a specialized code generator specifically for Android. I checked with LinkedIn a few weeks ago and at the time they were not willing to give me a firm date when these will be available. I was told that the Swift generator was basically “code complete” they were still working with the mobile team to improve the generated code to address some specific concerns I’m happy to let them flesh this stuff and get something polished when it’s ready! I will continue to check in with the rest.li team to make sure this is progressing.
As I mentioned earlier, pegasus supports a multiple message formats. It calls these codecs.
PSON is a binary format that is a bit more compact than JSON and is much cheaper, computationally to serialize/deserialize. We don’t have any immediate plans to use PSON, but have considered using it instead of JSON at the storage layer, and could turn it on between high traffic services to reduce CPU is it becomes as bottleneck..
Compatibility with Avro may have a number of advantages as it is a particularly compact binary protocol. Again, we don’t have any immediate plans to use Avro, but could potentially use it at the storage layer.
Courier adds two custom codecs for string keys that I’ll discuss in more detail later.
At LinkedIn, pegasus is the primary system for inter-service communication. It’s battle tested and performance tested. A huge amount of engineering effort has gone into productionalizing this code and performance optimizing it.
Okay, so how does Courier integrate with SBT?
Courier has been a part of “models” project in infra-services for over a month.
The code generator runs before the compile task. You don’t need to change your development workflow. Courier will automatically generate classes as needed.
If you make a change to a .pdsc, just run `compile`. If you want, you can run SBT `~compile`, which will watch for changes to .pdsc files and run courier automatically.
Courier emits contextual error messages like the one shown here if there are any errors in schemas.
The Courier plugin is integrated with Play!
Error messages, like the missing closing quote on line 7 here, will appear in the browser when using Play!