SlideShare a Scribd company logo
FormatConversion
A handy pattern for format conversions
2017-11-27
Overview
• At exactEarth, we deal with data in a lot of different formats.
• We had problems with a proliferation of converters.
• I’m going to discuss a pattern we developed for dealing with situations like
this that turned out well.
What does exactEarth do?
“We track all of the world’s ships using satellites.”
Automatic Identification System (AIS)
Automatic Identification System (AIS)
• Designed during the 1990s
• Adopted as a standard in 2002
• Very High Frequency (VHF)
radio transmissions
• 27 different types of messages
transmitted
Maritime Mobile Service Identity
(MMSI)
Location
Speed over ground
Course over ground
Heading
Rate of Turn
Message Types
1, 2, 3
• MaritimeMobile ServiceIdentity
(MMSI)
• Name
• IMO Number
• Callsign
• Dimensions of the ship
• Destination and ETA
Message
Type 5
Effective January 1 2005, AIS transceivers are
required by:
• All ships of 300 gross tonnage and upwards engaged
on international voyages
• All cargo ships of 500 gross tonnage and upwards
not engaged on international voyages
• All passenger ships irrespective of size.
AIS transceivers must be on at all times
(with some limited exceptions)
Top pictures: gross tonnage of
300
Right picture: gross tonnage of
500
Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many representations of the same data
Conversions between formats
In order to ingest data from third parties, and to satisfy customer demands
for data in a particular format, we need to be able to convert between all the
formats
Lossy vs. Lossless Conversions
Some conversions are lossless:
• For example, both NMEAv4 and GNM v3.1 capture all the same data.
But some are lossy, meaning that data is lost in the conversion:
• For example, NMEAv4 to KML
• KML doesn’t have all of the fields that AIS-specific formats do
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
The other fields are either format syntax, checksums, or some trivial additional fields
Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
KML:
<Placemark>
<name>431300061</name>
<TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp>
<Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point>
<Style><IconStyle>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/track.png</href>
<w>64</w><h>64</h>
</Icon><color>ffff0000</color>
<heading>344.0</heading>
</IconStyle>
</Style>
</Placemark>
Problem: Proliferation of Converters
• Code Duplication
• Bug prone, not performant
• Testing + optimization efforts were strained by so many
implementations
• Not flexible
• If a component consumes GNM today, it was hard to add the ability to
consume NMEA
• Inadvertent use of lossy conversions
Step 1: One format to rule them all
We created a new format: EEA
Built for AIS, faithfully reflects the spec.
Extension fields for format-specific metadata
Side benefit: Multi-type fields
Some fields in the AIS spec are multi-typed
Example: Speed over Ground (10 bits, 0-1023)
From the spec:
“Speed over ground in 1/10 knot steps (0-102.2 knots)
1023 = not available, 1022 = 102.2 knots or higher”
Developers were often performing mathematical operations on the fields (!)
In EEA, we made the types of this fields:
Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
Step 2: Standardized low-level API
• tokenize(input:file_like)
• deserialize_message(unparsed_message)
• serialize_message(parsed_message)
• merge(Iterable[unparsed_message], output:file_like)
Step 3: Conversion Graph
GNM
NM4
DOF
Step 3: Conversion Graph
GNM
NM4
DOF
EEA
Step 3: Conversion Graph
GNM
NMEA4
DOF
EEA
EEA
EEA
Step 3: Conversion Graph
GNM
DOF
EEA
EEA
EEA
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Generating the converter
Now that we have a graph, to make a converter just compose the functions
on the edges of the shortest path:
NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input)))))
Function composition in Python:
https://mathieularose.com/function-composition-in-python/#solution
NetworkX
https://networkx.github.io/
3-clause BSD license
I’ve used it heavily and have no complaints with it
Building the graph
conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize)
conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge)
…
conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop
…
Example usage
NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD"))
with open("my_nmea_v4_file.nm4", 'rb') as fin:
with open("my_converted_file.gnm", 'wb') as fout:
fout.write(NM4_to_GNM(fin))
Prevention of lossy conversions
Create 2 different conversion graphs:
1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS”
2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS”
Use lossless graph by default, make users explicitly ask to use lossy
conversions
If the user asks for a lossy conversion without being explicit, there will be
no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a
path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message:
“No lossless path from NMEAv4 to KML. If you want to perform a
lossy conversion, you must explicitly allow lossy conversions.”
Extra parameters
Sometimes conversions need additional information
Example:
Conversion from DOFv3 -> DOFv4 requires a timestamp
Extra parameters
1. Mark edges as having required parameters:
conversions.add_edge(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “3”, “UNPARSED”),
function=tokenize,
required_params=set(["timestamp"])
)
2. Allow the user to supply arbitrary keyword arguments to get_converter():
get_converter(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id)
)
Final API
get_converter(
source_schema,
target_schema,
graph=FORWARD_FORMAT_CONVERSIONS,
**kwargs
)
Benefits
• Centralizes the conversion code
• Less bugs, more performant
• Simplifies the code + less duplication
• Don’t need to know all of the input formats a priori
• Dynamic generation of converters
• Reduces chance of accidental lossy conversions
Summary
• We had a problem with multiple formats and converters between them
• By modelling it as a graph problem, it was easy to dynamically generate
converters
• This allowed for greater flexibility, greater safety
• When you have a web of conversion steps, you can use graph traversal
libraries to generate the shortest path to get the answers you want.
Thanks
Questions?
I’m Michael Overmeyer:
@movermeyer on every platform

More Related Content

Similar to FormatConversion

NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
The HDF-EOS Tools and Information Center
 
Distributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container EraDistributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container Era
Glenn Davis
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container EraDistributed Logging Architecture in Container Era
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
An introduction to Apache Camel
An introduction to Apache CamelAn introduction to Apache Camel
An introduction to Apache Camel
Kapil Kumar
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus
 
Charting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine OperationsCharting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine Operations
marketing932765
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
Bronco Oostermeyer
 
Biztalk ESB Toolkit Introduction
Biztalk ESB Toolkit IntroductionBiztalk ESB Toolkit Introduction
Biztalk ESB Toolkit Introduction
Saffi Ali
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
MDC_UNICA
 
server-side-fusion-vts
server-side-fusion-vtsserver-side-fusion-vts
server-side-fusion-vts
Ladislav Horký
 
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero DowntimeMigrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Paul Chandler
 
Using FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of DataUsing FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of Data
Safe Software
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
Carol McDonald
 
PPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdfPPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdf
TrngTn67
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus
 
SmartMet Server OSGeo
SmartMet Server OSGeoSmartMet Server OSGeo
SmartMet Server OSGeo
Roope Tervo
 
Telnet and FTP.ppt
Telnet and FTP.pptTelnet and FTP.ppt
Telnet and FTP.ppt
ssuser1774d3
 
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year JourneyNagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
Steve Jamieson
 
Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6Edgardo Scrimaglia
 

Similar to FormatConversion (20)

NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
 
Distributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container EraDistributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container Era
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container EraDistributed Logging Architecture in Container Era
Distributed Logging Architecture in Container Era
 
An introduction to Apache Camel
An introduction to Apache CamelAn introduction to Apache Camel
An introduction to Apache Camel
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Charting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine OperationsCharting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine Operations
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
 
Biztalk ESB Toolkit Introduction
Biztalk ESB Toolkit IntroductionBiztalk ESB Toolkit Introduction
Biztalk ESB Toolkit Introduction
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
 
server-side-fusion-vts
server-side-fusion-vtsserver-side-fusion-vts
server-side-fusion-vts
 
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero DowntimeMigrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
 
Using FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of DataUsing FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of Data
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
PPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdfPPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdf
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
SmartMet Server OSGeo
SmartMet Server OSGeoSmartMet Server OSGeo
SmartMet Server OSGeo
 
Telnet and FTP.ppt
Telnet and FTP.pptTelnet and FTP.ppt
Telnet and FTP.ppt
 
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year JourneyNagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6
 

Recently uploaded

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 

Recently uploaded (20)

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 

FormatConversion

  • 1. FormatConversion A handy pattern for format conversions 2017-11-27
  • 2. Overview • At exactEarth, we deal with data in a lot of different formats. • We had problems with a proliferation of converters. • I’m going to discuss a pattern we developed for dealing with situations like this that turned out well.
  • 3. What does exactEarth do? “We track all of the world’s ships using satellites.”
  • 4.
  • 6. Automatic Identification System (AIS) • Designed during the 1990s • Adopted as a standard in 2002 • Very High Frequency (VHF) radio transmissions • 27 different types of messages transmitted
  • 7. Maritime Mobile Service Identity (MMSI) Location Speed over ground Course over ground Heading Rate of Turn Message Types 1, 2, 3 • MaritimeMobile ServiceIdentity (MMSI) • Name • IMO Number • Callsign • Dimensions of the ship • Destination and ETA Message Type 5
  • 8. Effective January 1 2005, AIS transceivers are required by: • All ships of 300 gross tonnage and upwards engaged on international voyages • All cargo ships of 500 gross tonnage and upwards not engaged on international voyages • All passenger ships irrespective of size. AIS transceivers must be on at all times (with some limited exceptions)
  • 9. Top pictures: gross tonnage of 300 Right picture: gross tonnage of 500
  • 10. Many ways to store AIS messages • NMEA v3, v4 • GNM v3.1 • Internal Binary formats (with several versions) • “Adapted” formats (several variations): • CSV • XML • JSON • KML • OTH-Gold • Many third-party and “one-off” formats
  • 11. Many ways to store AIS messages • NMEA v3, v4 • GNM v3.1 • Internal Binary formats (with several versions) • “Adapted” formats (several variations): • CSV • XML • JSON • KML • OTH-Gold • Many third-party and “one-off” formats Many representations of the same data
  • 12. Conversions between formats In order to ingest data from third parties, and to satisfy customer demands for data in a particular format, we need to be able to convert between all the formats
  • 13. Lossy vs. Lossless Conversions Some conversions are lossless: • For example, both NMEAv4 and GNM v3.1 capture all the same data. But some are lossy, meaning that data is lost in the conversion: • For example, NMEAv4 to KML • KML doesn’t have all of the fields that AIS-specific formats do
  • 14. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 15. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 16. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 17. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 The other fields are either format syntax, checksums, or some trivial additional fields
  • 18. Lossy Conversion NMEAv4 -> KML Message Type 1: • MMSI (identifier) • Timestamp • Longitude/Latitude • Heading • Navigation Status • Rate of Turn • Speed Over Ground • Position Accuracy • Course over Ground • …
  • 19. Lossy Conversion NMEAv4 -> KML Message Type 1: • MMSI (identifier) • Timestamp • Longitude/Latitude • Heading • Navigation Status • Rate of Turn • Speed Over Ground • Position Accuracy • Course over Ground • … KML: <Placemark> <name>431300061</name> <TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp> <Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point> <Style><IconStyle> <Icon> <href>http://maps.google.com/mapfiles/kml/shapes/track.png</href> <w>64</w><h>64</h> </Icon><color>ffff0000</color> <heading>344.0</heading> </IconStyle> </Style> </Placemark>
  • 20. Problem: Proliferation of Converters • Code Duplication • Bug prone, not performant • Testing + optimization efforts were strained by so many implementations • Not flexible • If a component consumes GNM today, it was hard to add the ability to consume NMEA • Inadvertent use of lossy conversions
  • 21. Step 1: One format to rule them all We created a new format: EEA Built for AIS, faithfully reflects the spec. Extension fields for format-specific metadata
  • 22. Side benefit: Multi-type fields Some fields in the AIS spec are multi-typed Example: Speed over Ground (10 bits, 0-1023) From the spec: “Speed over ground in 1/10 knot steps (0-102.2 knots) 1023 = not available, 1022 = 102.2 knots or higher” Developers were often performing mathematical operations on the fields (!) In EEA, we made the types of this fields: Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
  • 23. Step 2: Standardized low-level API • tokenize(input:file_like) • deserialize_message(unparsed_message) • serialize_message(parsed_message) • merge(Iterable[unparsed_message], output:file_like)
  • 24. Step 3: Conversion Graph GNM NM4 DOF
  • 25. Step 3: Conversion Graph GNM NM4 DOF EEA
  • 26. Step 3: Conversion Graph GNM NMEA4 DOF EEA EEA EEA
  • 27. Step 3: Conversion Graph GNM DOF EEA EEA EEA NMEA4
  • 28. Step 3: Conversion Graph GNM NMEA4
  • 29. Step 3: Conversion Graph GNM NMEA4
  • 30. Step 3: Conversion Graph GNM NMEA4
  • 31. Step 3: Conversion Graph GNM NMEA4
  • 32. Step 3: Conversion Graph GNM NMEA4
  • 33. Step 3: Conversion Graph GNM NMEA4
  • 34. Step 3: Conversion Graph GNM NMEA4
  • 35. Generating the converter Now that we have a graph, to make a converter just compose the functions on the edges of the shortest path: NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input))))) Function composition in Python: https://mathieularose.com/function-composition-in-python/#solution
  • 36. NetworkX https://networkx.github.io/ 3-clause BSD license I’ve used it heavily and have no complaints with it
  • 37. Building the graph conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize) conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize) conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize) conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge) … conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop …
  • 38. Example usage NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD")) with open("my_nmea_v4_file.nm4", 'rb') as fin: with open("my_converted_file.gnm", 'wb') as fout: fout.write(NM4_to_GNM(fin))
  • 39. Prevention of lossy conversions Create 2 different conversion graphs: 1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS” 2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS” Use lossless graph by default, make users explicitly ask to use lossy conversions If the user asks for a lossy conversion without being explicit, there will be no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message: “No lossless path from NMEAv4 to KML. If you want to perform a lossy conversion, you must explicitly allow lossy conversions.”
  • 40. Extra parameters Sometimes conversions need additional information Example: Conversion from DOFv3 -> DOFv4 requires a timestamp
  • 41. Extra parameters 1. Mark edges as having required parameters: conversions.add_edge( (“DOF”, “3”, “PAYLOAD”), (“DOF”, “3”, “UNPARSED”), function=tokenize, required_params=set(["timestamp"]) ) 2. Allow the user to supply arbitrary keyword arguments to get_converter(): get_converter( (“DOF”, “3”, “PAYLOAD”), (“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id) )
  • 43. Benefits • Centralizes the conversion code • Less bugs, more performant • Simplifies the code + less duplication • Don’t need to know all of the input formats a priori • Dynamic generation of converters • Reduces chance of accidental lossy conversions
  • 44. Summary • We had a problem with multiple formats and converters between them • By modelling it as a graph problem, it was easy to dynamically generate converters • This allowed for greater flexibility, greater safety • When you have a web of conversion steps, you can use graph traversal libraries to generate the shortest path to get the answers you want.

Editor's Notes

  1. Why?: Environmental: reef protection, bilge water dumping, oil spills, but most importantly illegal fishing… Logistical: Port authorities, logistics companies, scheduling Security: surveillance, smuggling, piracy
  2. As a ship captain, how do you prevent collisions with other vessels? People tend to jump immediately to SONAR/RADAR, but there are a few major problems to that: The equipment is expensive The equipment requires a lot of power These systems are actually quite difficult to read. They require some skill to operate. The most common method was simply to visually observe the other ships and try to estimate their speed, course, heading, and acceleration. Then you would do the same for your vessel and figure out the calculus to determine if you are going to collide or not. Obviously, this I also tricky, and fails in situations like: Night time Stormy weather When you are going around a tight curve in a waterway, and can’t see what’s coming at you Ships don’t stop on a dime. In fact, some of these vessels take in the neighbourhood of 20 minutes to stop. So sometimes, you will have two ships that know they are going to collide well in advance of the collision, but they can’t turn or stop fast enough to do anything about it. So the problem of ship collisions is what triggered the creation of AIS.
  3. All vessels transmit a “Hear I am! Please don’t hit me” to the other vessels in the area.
  4. Here are some of the fields that are transmitted. In the Type 1,2,3 messages, we have position information. These are transmitted every few seconds while the vessel is moving. In other message types, we have more static information that doesn’t change very often, like registration and destination and ETA.
  5. The first thing you do when you have a lot of formats: Create a new format! We created a new internal format that we called EEA. Not only could it hold all of the AIS message fields, but it also had “extension fields” where we would shove all the fields that might be specific to a format. For example, GNM has some metadata fields that are specific to GNM. We put those into a GNMMetadata field within the EEA spec. So now we have a format that can capture all of the complexity of all the AIS formats. It was written to be faithful to the AIS spec, which means that it handles the full complexity of the AIS spec.
  6. A side benefit of redesigning our format (and our in-memory representation) with EEA is that we got to fix some of the issues developers were accidentally creating. AIS is a complicated spec. For example, look at the definition for speed over ground. There are 1024 bits, and while most of the bit values can be interpreted as doubles, there are two reserved values that can not. There is 1023 = Not Available, for when the ship doesn’t know how fast it is going. There is also the 1022 value which means 102.2 knots or greater. The problem with all the ad hoc parsers is that developers of them would often get the parsing of these fields wrong. They would often just parse the field as a double, not realizing that these special values existed, and then perform mathematical operations on the fields. So they would do things like average the values of the speed over ground, so you would end up with massive values when a large number of vessels were reporting “Not available”. With EEA, we fixed that, changing the type of the field to be either a double, or one of two special values. This prevents mathematical operations being applied on the field, and forces the developer to stop and think about how they want to actually handle the math.
  7. The next step was to define a common interface of functions we apply when parsing all formats. We eventually settled on this interface. You start with a “PAYLOAD” which is a collection of bytes representing messages, which might be the contents of a file for example. You then call tokenize() which finds the boundaries of the messages within the payload and splits the payload on those boundaries. You still haven’t parsed the messages, so you don’t know what they say yet. You only know the bytes that make up each message. We called this “UNPARSED” in this diagram. You can then call deserialize, which actually parses the bytes of the message and gives you an in-memory representation of the message. Most commonly, this was the new EEA format. Then on the reverse direction, we take individual messages and call serialize() on them to return them to the UNPARSED tokens we had before. And finally we call merge(), which writes the messages, one after another, into a payload. So these four functions are pretty much universal to format parsing. They also form a graph, perhaps a sort of state-machine where the data parse levels are the nodes, and the functions are the edges.
  8. We went and implemented these four functions for all of data formats. And when you put all the conversion graphs beside one another, you notice something.
  9. The PARSED node for most of the formats is EEA.
  10. It’s the same format. Really, it’s all the same node. Conceptually, you could add edges between them with a NO-OP function.
  11. So now if you wanted to convert between GNM and NM4, you can just follow the edges from GNM PAYLOAD to NMEA4 PAYLOAD
  12. And you suddenly have the conversion steps.
  13. In order to represent the graph in code, we need a graphing library. I came across NetworkX and have had no complaints. It allows you to create nodes and edges in a graph, and then gives you an easy way to do algorithms like shortest path across graphs.
  14. Here’s what it looks like in code: We have our conversions object, which is just a networkX graph object. Then on each line, we add an edge between each of our parse levels. On each edge, we also supply the function to perform. Notice in the last line, that this is an example where we jump between formats that are already parsed into the EEA in-memory representation. Therefore the function is a simple NO-OP.