FormatConversion

FormatConversion
A handy pattern for format conversions
2017-11-27

Overview
• At exactEarth, we deal with data in a lot of different formats.
• We had problems with a proliferation of converters.
• I’m going to discuss a pattern we developed for dealing with situations like
this that turned out well.

What does exactEarth do?
“We track all of the world’s ships using satellites.”

Automatic Identification System (AIS)

Automatic Identification System (AIS)
• Designed during the 1990s
• Adopted as a standard in 2002
• Very High Frequency (VHF)
radio transmissions
• 27 different types of messages
transmitted

Maritime Mobile Service Identity
(MMSI)
Location
Speed over ground
Course over ground
Heading
Rate of Turn
Message Types
1, 2, 3
• MaritimeMobile ServiceIdentity
(MMSI)
• Name
• IMO Number
• Callsign
• Dimensions of the ship
• Destination and ETA
Message
Type 5

Effective January 1 2005, AIS transceivers are
required by:
• All ships of 300 gross tonnage and upwards engaged
on international voyages
• All cargo ships of 500 gross tonnage and upwards
not engaged on international voyages
• All passenger ships irrespective of size.
AIS transceivers must be on at all times
(with some limited exceptions)

Top pictures: gross tonnage of
300
Right picture: gross tonnage of
500

Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats

Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many representations of the same data

Conversions between formats
In order to ingest data from third parties, and to satisfy customer demands
for data in a particular format, we need to be able to convert between all the
formats

Lossy vs. Lossless Conversions
Some conversions are lossless:
• For example, both NMEAv4 and GNM v3.1 capture all the same data.
But some are lossy, meaning that data is lost in the conversion:
• For example, NMEAv4 to KML
• KML doesn’t have all of the fields that AIS-specific formats do

Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00

GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00

GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
The other fields are either format syntax, checksums, or some trivial additional fields

Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …

Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
KML:
<Placemark>
<name>431300061</name>
<TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp>
<Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point>
<Style><IconStyle>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/track.png</href>
<w>64</w><h>64</h>
</Icon><color>ffff0000</color>
<heading>344.0</heading>
</IconStyle>
</Style>
</Placemark>

Problem: Proliferation of Converters
• Code Duplication
• Bug prone, not performant
• Testing + optimization efforts were strained by so many
implementations
• Not flexible
• If a component consumes GNM today, it was hard to add the ability to
consume NMEA
• Inadvertent use of lossy conversions

Step 1: One format to rule them all
We created a new format: EEA
Built for AIS, faithfully reflects the spec.
Extension fields for format-specific metadata

Side benefit: Multi-type fields
Some fields in the AIS spec are multi-typed
Example: Speed over Ground (10 bits, 0-1023)
From the spec:
“Speed over ground in 1/10 knot steps (0-102.2 knots)
1023 = not available, 1022 = 102.2 knots or higher”
Developers were often performing mathematical operations on the fields (!)
In EEA, we made the types of this fields:
Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]

Step 2: Standardized low-level API
• tokenize(input:file_like)
• deserialize_message(unparsed_message)
• serialize_message(parsed_message)
• merge(Iterable[unparsed_message], output:file_like)

Step 3: Conversion Graph
GNM
NM4
DOF

GNM
NM4
DOF
EEA

GNM
NMEA4
DOF
EEA
EEA
EEA

GNM
DOF
EEA
EEA
EEA
NMEA4

GNM
NMEA4

Generating the converter
Now that we have a graph, to make a converter just compose the functions
on the edges of the shortest path:
NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input)))))
Function composition in Python:
https://mathieularose.com/function-composition-in-python/#solution

NetworkX
https://networkx.github.io/
3-clause BSD license
I’ve used it heavily and have no complaints with it

Building the graph
conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize)
conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge)
…
conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop
…

Example usage
NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD"))
with open("my_nmea_v4_file.nm4", 'rb') as fin:
with open("my_converted_file.gnm", 'wb') as fout:
fout.write(NM4_to_GNM(fin))

Prevention of lossy conversions
Create 2 different conversion graphs:
1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS”
2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS”
Use lossless graph by default, make users explicitly ask to use lossy
conversions
If the user asks for a lossy conversion without being explicit, there will be
no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a
path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message:
“No lossless path from NMEAv4 to KML. If you want to perform a
lossy conversion, you must explicitly allow lossy conversions.”

Extra parameters
Sometimes conversions need additional information
Example:
Conversion from DOFv3 -> DOFv4 requires a timestamp

Extra parameters
1. Mark edges as having required parameters:
conversions.add_edge(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “3”, “UNPARSED”),
function=tokenize,
required_params=set(["timestamp"])
)
2. Allow the user to supply arbitrary keyword arguments to get_converter():
get_converter(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id)
)

Final API
get_converter(
source_schema,
target_schema,
graph=FORWARD_FORMAT_CONVERSIONS,
**kwargs
)

Benefits
• Centralizes the conversion code
• Less bugs, more performant
• Simplifies the code + less duplication
• Don’t need to know all of the input formats a priori
• Dynamic generation of converters
• Reduces chance of accidental lossy conversions

Summary
• We had a problem with multiple formats and converters between them
• By modelling it as a graph problem, it was easy to dynamically generate
converters
• This allowed for greater flexibility, greater safety
• When you have a web of conversion steps, you can use graph traversal
libraries to generate the shortest path to get the answers you want.

Thanks
Questions?
I’m Michael Overmeyer:
@movermeyer on every platform

FormatConversion

Recommended

Recommended

More Related Content

Similar to FormatConversion

Similar to FormatConversion (20)

Recently uploaded

Recently uploaded (20)

FormatConversion

Editor's Notes