At exactEarth, we have a lot of formats for the same AIS data. We had a proliferation of ad-hoc converters that developers had written with varying quality and performance.
By modelling the problem of format conversion as a graph traversal problem, I was able to create a library of battle tested code that could dynamically produce the converters needed to make the task of format conversion simple for developers.
2. Overview
• At exactEarth, we deal with data in a lot of different formats.
• We had problems with a proliferation of converters.
• I’m going to discuss a pattern we developed for dealing with situations like
this that turned out well.
6. Automatic Identification System (AIS)
• Designed during the 1990s
• Adopted as a standard in 2002
• Very High Frequency (VHF)
radio transmissions
• 27 different types of messages
transmitted
7. Maritime Mobile Service Identity
(MMSI)
Location
Speed over ground
Course over ground
Heading
Rate of Turn
Message Types
1, 2, 3
• MaritimeMobile ServiceIdentity
(MMSI)
• Name
• IMO Number
• Callsign
• Dimensions of the ship
• Destination and ETA
Message
Type 5
8. Effective January 1 2005, AIS transceivers are
required by:
• All ships of 300 gross tonnage and upwards engaged
on international voyages
• All cargo ships of 500 gross tonnage and upwards
not engaged on international voyages
• All passenger ships irrespective of size.
AIS transceivers must be on at all times
(with some limited exceptions)
10. Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
11. Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many representations of the same data
12. Conversions between formats
In order to ingest data from third parties, and to satisfy customer demands
for data in a particular format, we need to be able to convert between all the
formats
13. Lossy vs. Lossless Conversions
Some conversions are lossless:
• For example, both NMEAv4 and GNM v3.1 capture all the same data.
But some are lossy, meaning that data is lost in the conversion:
• For example, NMEAv4 to KML
• KML doesn’t have all of the fields that AIS-specific formats do
14. Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
17. Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
The other fields are either format syntax, checksums, or some trivial additional fields
18. Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
19. Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
KML:
<Placemark>
<name>431300061</name>
<TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp>
<Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point>
<Style><IconStyle>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/track.png</href>
<w>64</w><h>64</h>
</Icon><color>ffff0000</color>
<heading>344.0</heading>
</IconStyle>
</Style>
</Placemark>
20. Problem: Proliferation of Converters
• Code Duplication
• Bug prone, not performant
• Testing + optimization efforts were strained by so many
implementations
• Not flexible
• If a component consumes GNM today, it was hard to add the ability to
consume NMEA
• Inadvertent use of lossy conversions
21. Step 1: One format to rule them all
We created a new format: EEA
Built for AIS, faithfully reflects the spec.
Extension fields for format-specific metadata
22. Side benefit: Multi-type fields
Some fields in the AIS spec are multi-typed
Example: Speed over Ground (10 bits, 0-1023)
From the spec:
“Speed over ground in 1/10 knot steps (0-102.2 knots)
1023 = not available, 1022 = 102.2 knots or higher”
Developers were often performing mathematical operations on the fields (!)
In EEA, we made the types of this fields:
Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
35. Generating the converter
Now that we have a graph, to make a converter just compose the functions
on the edges of the shortest path:
NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input)))))
Function composition in Python:
https://mathieularose.com/function-composition-in-python/#solution
38. Example usage
NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD"))
with open("my_nmea_v4_file.nm4", 'rb') as fin:
with open("my_converted_file.gnm", 'wb') as fout:
fout.write(NM4_to_GNM(fin))
39. Prevention of lossy conversions
Create 2 different conversion graphs:
1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS”
2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS”
Use lossless graph by default, make users explicitly ask to use lossy
conversions
If the user asks for a lossy conversion without being explicit, there will be
no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a
path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message:
“No lossless path from NMEAv4 to KML. If you want to perform a
lossy conversion, you must explicitly allow lossy conversions.”
41. Extra parameters
1. Mark edges as having required parameters:
conversions.add_edge(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “3”, “UNPARSED”),
function=tokenize,
required_params=set(["timestamp"])
)
2. Allow the user to supply arbitrary keyword arguments to get_converter():
get_converter(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id)
)
43. Benefits
• Centralizes the conversion code
• Less bugs, more performant
• Simplifies the code + less duplication
• Don’t need to know all of the input formats a priori
• Dynamic generation of converters
• Reduces chance of accidental lossy conversions
44. Summary
• We had a problem with multiple formats and converters between them
• By modelling it as a graph problem, it was easy to dynamically generate
converters
• This allowed for greater flexibility, greater safety
• When you have a web of conversion steps, you can use graph traversal
libraries to generate the shortest path to get the answers you want.
Why?:
Environmental: reef protection, bilge water dumping, oil spills, but most importantly illegal fishing…
Logistical: Port authorities, logistics companies, scheduling
Security: surveillance, smuggling, piracy
As a ship captain, how do you prevent collisions with other vessels?
People tend to jump immediately to SONAR/RADAR, but there are a few major problems to that:
The equipment is expensive
The equipment requires a lot of power
These systems are actually quite difficult to read. They require some skill to operate.
The most common method was simply to visually observe the other ships and try to estimate their speed, course, heading, and acceleration.
Then you would do the same for your vessel and figure out the calculus to determine if you are going to collide or not.
Obviously, this I also tricky, and fails in situations like:
Night time
Stormy weather
When you are going around a tight curve in a waterway, and can’t see what’s coming at you
Ships don’t stop on a dime. In fact, some of these vessels take in the neighbourhood of 20 minutes to stop. So sometimes, you will have two ships that know they are going to collide well in advance of the collision, but they can’t turn or stop fast enough to do anything about it.
So the problem of ship collisions is what triggered the creation of AIS.
All vessels transmit a “Hear I am! Please don’t hit me” to the other vessels in the area.
Here are some of the fields that are transmitted.
In the Type 1,2,3 messages, we have position information. These are transmitted every few seconds while the vessel is moving.
In other message types, we have more static information that doesn’t change very often, like registration and destination and ETA.
The first thing you do when you have a lot of formats: Create a new format!
We created a new internal format that we called EEA. Not only could it hold all of the AIS message fields, but it also had “extension fields” where we would shove all the fields that might be specific to a format. For example, GNM has some metadata fields that are specific to GNM. We put those into a GNMMetadata field within the EEA spec. So now we have a format that can capture all of the complexity of all the AIS formats.
It was written to be faithful to the AIS spec, which means that it handles the full complexity of the AIS spec.
A side benefit of redesigning our format (and our in-memory representation) with EEA is that we got to fix some of the issues developers were accidentally creating.
AIS is a complicated spec. For example, look at the definition for speed over ground. There are 1024 bits, and while most of the bit values can be interpreted as doubles, there are two reserved values that can not. There is 1023 = Not Available, for when the ship doesn’t know how fast it is going. There is also the 1022 value which means 102.2 knots or greater.
The problem with all the ad hoc parsers is that developers of them would often get the parsing of these fields wrong. They would often just parse the field as a double, not realizing that these special values existed, and then perform mathematical operations on the fields. So they would do things like average the values of the speed over ground, so you would end up with massive values when a large number of vessels were reporting “Not available”.
With EEA, we fixed that, changing the type of the field to be either a double, or one of two special values. This prevents mathematical operations being applied on the field, and forces the developer to stop and think about how they want to actually handle the math.
The next step was to define a common interface of functions we apply when parsing all formats. We eventually settled on this interface.
You start with a “PAYLOAD” which is a collection of bytes representing messages, which might be the contents of a file for example.
You then call tokenize() which finds the boundaries of the messages within the payload and splits the payload on those boundaries. You still haven’t parsed the messages, so you don’t know what they say yet. You only know the bytes that make up each message. We called this “UNPARSED” in this diagram.
You can then call deserialize, which actually parses the bytes of the message and gives you an in-memory representation of the message. Most commonly, this was the new EEA format.
Then on the reverse direction, we take individual messages and call serialize() on them to return them to the UNPARSED tokens we had before. And finally we call merge(), which writes the messages, one after another, into a payload.
So these four functions are pretty much universal to format parsing. They also form a graph, perhaps a sort of state-machine where the data parse levels are the nodes, and the functions are the edges.
We went and implemented these four functions for all of data formats.
And when you put all the conversion graphs beside one another, you notice something.
The PARSED node for most of the formats is EEA.
It’s the same format. Really, it’s all the same node. Conceptually, you could add edges between them with a NO-OP function.
So now if you wanted to convert between GNM and NM4, you can just follow the edges from GNM PAYLOAD to NMEA4 PAYLOAD
And you suddenly have the conversion steps.
In order to represent the graph in code, we need a graphing library.
I came across NetworkX and have had no complaints.
It allows you to create nodes and edges in a graph, and then gives you an easy way to do algorithms like shortest path across graphs.
Here’s what it looks like in code:
We have our conversions object, which is just a networkX graph object.
Then on each line, we add an edge between each of our parse levels. On each edge, we also supply the function to perform.
Notice in the last line, that this is an example where we jump between formats that are already parsed into the EEA in-memory representation. Therefore the function is a simple NO-OP.