SlideShare a Scribd company logo
1 of 45
FormatConversion
A handy pattern for format conversions
2017-11-27
Overview
• At exactEarth, we deal with data in a lot of different formats.
• We had problems with a proliferation of converters.
• I’m going to discuss a pattern we developed for dealing with situations like
this that turned out well.
What does exactEarth do?
“We track all of the world’s ships using satellites.”
Automatic Identification System (AIS)
Automatic Identification System (AIS)
• Designed during the 1990s
• Adopted as a standard in 2002
• Very High Frequency (VHF)
radio transmissions
• 27 different types of messages
transmitted
Maritime Mobile Service Identity
(MMSI)
Location
Speed over ground
Course over ground
Heading
Rate of Turn
Message Types
1, 2, 3
• MaritimeMobile ServiceIdentity
(MMSI)
• Name
• IMO Number
• Callsign
• Dimensions of the ship
• Destination and ETA
Message
Type 5
Effective January 1 2005, AIS transceivers are
required by:
• All ships of 300 gross tonnage and upwards engaged
on international voyages
• All cargo ships of 500 gross tonnage and upwards
not engaged on international voyages
• All passenger ships irrespective of size.
AIS transceivers must be on at all times
(with some limited exceptions)
Top pictures: gross tonnage of
300
Right picture: gross tonnage of
500
Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many ways to store AIS messages
• NMEA v3, v4
• GNM v3.1
• Internal Binary formats (with several versions)
• “Adapted” formats (several variations):
• CSV
• XML
• JSON
• KML
• OTH-Gold
• Many third-party and “one-off” formats
Many representations of the same data
Conversions between formats
In order to ingest data from third parties, and to satisfy customer demands
for data in a particular format, we need to be able to convert between all the
formats
Lossy vs. Lossless Conversions
Some conversions are lossless:
• For example, both NMEAv4 and GNM v3.1 capture all the same data.
But some are lossy, meaning that data is lost in the conversion:
• For example, NMEAv4 to KML
• KML doesn’t have all of the fields that AIS-specific formats do
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
Lossless Conversion: GNM and NMEAv4
GNM:
$PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20
!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
NMEAv4:
s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
The other fields are either format syntax, checksums, or some trivial additional fields
Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
Lossy Conversion NMEAv4 -> KML
Message Type 1:
• MMSI (identifier)
• Timestamp
• Longitude/Latitude
• Heading
• Navigation Status
• Rate of Turn
• Speed Over Ground
• Position Accuracy
• Course over Ground
• …
KML:
<Placemark>
<name>431300061</name>
<TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp>
<Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point>
<Style><IconStyle>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/track.png</href>
<w>64</w><h>64</h>
</Icon><color>ffff0000</color>
<heading>344.0</heading>
</IconStyle>
</Style>
</Placemark>
Problem: Proliferation of Converters
• Code Duplication
• Bug prone, not performant
• Testing + optimization efforts were strained by so many
implementations
• Not flexible
• If a component consumes GNM today, it was hard to add the ability to
consume NMEA
• Inadvertent use of lossy conversions
Step 1: One format to rule them all
We created a new format: EEA
Built for AIS, faithfully reflects the spec.
Extension fields for format-specific metadata
Side benefit: Multi-type fields
Some fields in the AIS spec are multi-typed
Example: Speed over Ground (10 bits, 0-1023)
From the spec:
“Speed over ground in 1/10 knot steps (0-102.2 knots)
1023 = not available, 1022 = 102.2 knots or higher”
Developers were often performing mathematical operations on the fields (!)
In EEA, we made the types of this fields:
Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
Step 2: Standardized low-level API
• tokenize(input:file_like)
• deserialize_message(unparsed_message)
• serialize_message(parsed_message)
• merge(Iterable[unparsed_message], output:file_like)
Step 3: Conversion Graph
GNM
NM4
DOF
Step 3: Conversion Graph
GNM
NM4
DOF
EEA
Step 3: Conversion Graph
GNM
NMEA4
DOF
EEA
EEA
EEA
Step 3: Conversion Graph
GNM
DOF
EEA
EEA
EEA
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Step 3: Conversion Graph
GNM
NMEA4
Generating the converter
Now that we have a graph, to make a converter just compose the functions
on the edges of the shortest path:
NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input)))))
Function composition in Python:
https://mathieularose.com/function-composition-in-python/#solution
NetworkX
https://networkx.github.io/
3-clause BSD license
I’ve used it heavily and have no complaints with it
Building the graph
conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize)
conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize)
conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge)
…
conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop
…
Example usage
NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD"))
with open("my_nmea_v4_file.nm4", 'rb') as fin:
with open("my_converted_file.gnm", 'wb') as fout:
fout.write(NM4_to_GNM(fin))
Prevention of lossy conversions
Create 2 different conversion graphs:
1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS”
2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS”
Use lossless graph by default, make users explicitly ask to use lossy
conversions
If the user asks for a lossy conversion without being explicit, there will be
no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a
path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message:
“No lossless path from NMEAv4 to KML. If you want to perform a
lossy conversion, you must explicitly allow lossy conversions.”
Extra parameters
Sometimes conversions need additional information
Example:
Conversion from DOFv3 -> DOFv4 requires a timestamp
Extra parameters
1. Mark edges as having required parameters:
conversions.add_edge(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “3”, “UNPARSED”),
function=tokenize,
required_params=set(["timestamp"])
)
2. Allow the user to supply arbitrary keyword arguments to get_converter():
get_converter(
(“DOF”, “3”, “PAYLOAD”),
(“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id)
)
Final API
get_converter(
source_schema,
target_schema,
graph=FORWARD_FORMAT_CONVERSIONS,
**kwargs
)
Benefits
• Centralizes the conversion code
• Less bugs, more performant
• Simplifies the code + less duplication
• Don’t need to know all of the input formats a priori
• Dynamic generation of converters
• Reduces chance of accidental lossy conversions
Summary
• We had a problem with multiple formats and converters between them
• By modelling it as a graph problem, it was easy to dynamically generate
converters
• This allowed for greater flexibility, greater safety
• When you have a web of conversion steps, you can use graph traversal
libraries to generate the shortest path to get the answers you want.
Thanks
Questions?
I’m Michael Overmeyer:
@movermeyer on every platform

More Related Content

Similar to FormatConversion

NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...The HDF-EOS Tools and Information Center
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container EraDistributed Logging Architecture in Container Era
Distributed Logging Architecture in Container EraSATOSHI TAGOMORI
 
Distributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container EraDistributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container EraGlenn Davis
 
An introduction to Apache Camel
An introduction to Apache CamelAn introduction to Apache Camel
An introduction to Apache CamelKapil Kumar
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 
Charting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine OperationsCharting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine Operationsmarketing932765
 
Biztalk ESB Toolkit Introduction
Biztalk ESB Toolkit IntroductionBiztalk ESB Toolkit Introduction
Biztalk ESB Toolkit IntroductionSaffi Ali
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
 
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero DowntimeMigrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero DowntimePaul Chandler
 
Using FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of DataUsing FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of DataSafe Software
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
PPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdfPPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdfTrngTn67
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 
SmartMet Server OSGeo
SmartMet Server OSGeoSmartMet Server OSGeo
SmartMet Server OSGeoRoope Tervo
 
Telnet and FTP.ppt
Telnet and FTP.pptTelnet and FTP.ppt
Telnet and FTP.pptssuser1774d3
 
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year JourneyNagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year JourneyNagios
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
 
Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6Edgardo Scrimaglia
 

Similar to FormatConversion (20)

NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
NWGISS: The Web GIS Software Suite for Interoperable Access and Manipulation ...
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container EraDistributed Logging Architecture in Container Era
Distributed Logging Architecture in Container Era
 
Distributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container EraDistributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container Era
 
An introduction to Apache Camel
An introduction to Apache CamelAn introduction to Apache Camel
An introduction to Apache Camel
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Charting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine OperationsCharting New Waters: Data Integration Excellence for Port & Marine Operations
Charting New Waters: Data Integration Excellence for Port & Marine Operations
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
 
Biztalk ESB Toolkit Introduction
Biztalk ESB Toolkit IntroductionBiztalk ESB Toolkit Introduction
Biztalk ESB Toolkit Introduction
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
 
server-side-fusion-vts
server-side-fusion-vtsserver-side-fusion-vts
server-side-fusion-vts
 
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero DowntimeMigrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
Migrating 500 Nodes from Rackspace to Google Cloud with Zero Downtime
 
Using FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of DataUsing FME Server and Engines to Convert Large Amounts of Data
Using FME Server and Engines to Convert Large Amounts of Data
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
PPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdfPPT_Deploying_Exchange_Server.pdf.pdf
PPT_Deploying_Exchange_Server.pdf.pdf
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
SmartMet Server OSGeo
SmartMet Server OSGeoSmartMet Server OSGeo
SmartMet Server OSGeo
 
Telnet and FTP.ppt
Telnet and FTP.pptTelnet and FTP.ppt
Telnet and FTP.ppt
 
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year JourneyNagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
Nagios Conference 2014 - Simon Finch - Monitoring Maturity A 16 Year Journey
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6Como definir un esquema de direcciones IPv6
Como definir un esquema de direcciones IPv6
 

Recently uploaded

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Recently uploaded (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

FormatConversion

  • 1. FormatConversion A handy pattern for format conversions 2017-11-27
  • 2. Overview • At exactEarth, we deal with data in a lot of different formats. • We had problems with a proliferation of converters. • I’m going to discuss a pattern we developed for dealing with situations like this that turned out well.
  • 3. What does exactEarth do? “We track all of the world’s ships using satellites.”
  • 4.
  • 6. Automatic Identification System (AIS) • Designed during the 1990s • Adopted as a standard in 2002 • Very High Frequency (VHF) radio transmissions • 27 different types of messages transmitted
  • 7. Maritime Mobile Service Identity (MMSI) Location Speed over ground Course over ground Heading Rate of Turn Message Types 1, 2, 3 • MaritimeMobile ServiceIdentity (MMSI) • Name • IMO Number • Callsign • Dimensions of the ship • Destination and ETA Message Type 5
  • 8. Effective January 1 2005, AIS transceivers are required by: • All ships of 300 gross tonnage and upwards engaged on international voyages • All cargo ships of 500 gross tonnage and upwards not engaged on international voyages • All passenger ships irrespective of size. AIS transceivers must be on at all times (with some limited exceptions)
  • 9. Top pictures: gross tonnage of 300 Right picture: gross tonnage of 500
  • 10. Many ways to store AIS messages • NMEA v3, v4 • GNM v3.1 • Internal Binary formats (with several versions) • “Adapted” formats (several variations): • CSV • XML • JSON • KML • OTH-Gold • Many third-party and “one-off” formats
  • 11. Many ways to store AIS messages • NMEA v3, v4 • GNM v3.1 • Internal Binary formats (with several versions) • “Adapted” formats (several variations): • CSV • XML • JSON • KML • OTH-Gold • Many third-party and “one-off” formats Many representations of the same data
  • 12. Conversions between formats In order to ingest data from third parties, and to satisfy customer demands for data in a particular format, we need to be able to convert between all the formats
  • 13. Lossy vs. Lossless Conversions Some conversions are lossless: • For example, both NMEAv4 and GNM v3.1 capture all the same data. But some are lossy, meaning that data is lost in the conversion: • For example, NMEAv4 to KML • KML doesn’t have all of the fields that AIS-specific formats do
  • 14. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 15. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 16. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00
  • 17. Lossless Conversion: GNM and NMEAv4 GNM: $PGHP,1,2011,9,8,18,9,6,300,,104,,1,00*20 !AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 NMEAv4: s:104,c:1315505346300*0E!AIVDM,1,1,,,16KDMo@P00:1?IpDF6i=L?v<0<27,0*00 The other fields are either format syntax, checksums, or some trivial additional fields
  • 18. Lossy Conversion NMEAv4 -> KML Message Type 1: • MMSI (identifier) • Timestamp • Longitude/Latitude • Heading • Navigation Status • Rate of Turn • Speed Over Ground • Position Accuracy • Course over Ground • …
  • 19. Lossy Conversion NMEAv4 -> KML Message Type 1: • MMSI (identifier) • Timestamp • Longitude/Latitude • Heading • Navigation Status • Rate of Turn • Speed Over Ground • Position Accuracy • Course over Ground • … KML: <Placemark> <name>431300061</name> <TimeStamp><when>2011-09-08T18:09:06Z</when></TimeStamp> <Point><coordinates>140.08116666666666,35.55616666666667</coordinates></Point> <Style><IconStyle> <Icon> <href>http://maps.google.com/mapfiles/kml/shapes/track.png</href> <w>64</w><h>64</h> </Icon><color>ffff0000</color> <heading>344.0</heading> </IconStyle> </Style> </Placemark>
  • 20. Problem: Proliferation of Converters • Code Duplication • Bug prone, not performant • Testing + optimization efforts were strained by so many implementations • Not flexible • If a component consumes GNM today, it was hard to add the ability to consume NMEA • Inadvertent use of lossy conversions
  • 21. Step 1: One format to rule them all We created a new format: EEA Built for AIS, faithfully reflects the spec. Extension fields for format-specific metadata
  • 22. Side benefit: Multi-type fields Some fields in the AIS spec are multi-typed Example: Speed over Ground (10 bits, 0-1023) From the spec: “Speed over ground in 1/10 knot steps (0-102.2 knots) 1023 = not available, 1022 = 102.2 knots or higher” Developers were often performing mathematical operations on the fields (!) In EEA, we made the types of this fields: Either[double, NOT_AVAILABLE, SPEED_102_POINT_2_KNOTS_OR_HIGHER]
  • 23. Step 2: Standardized low-level API • tokenize(input:file_like) • deserialize_message(unparsed_message) • serialize_message(parsed_message) • merge(Iterable[unparsed_message], output:file_like)
  • 24. Step 3: Conversion Graph GNM NM4 DOF
  • 25. Step 3: Conversion Graph GNM NM4 DOF EEA
  • 26. Step 3: Conversion Graph GNM NMEA4 DOF EEA EEA EEA
  • 27. Step 3: Conversion Graph GNM DOF EEA EEA EEA NMEA4
  • 28. Step 3: Conversion Graph GNM NMEA4
  • 29. Step 3: Conversion Graph GNM NMEA4
  • 30. Step 3: Conversion Graph GNM NMEA4
  • 31. Step 3: Conversion Graph GNM NMEA4
  • 32. Step 3: Conversion Graph GNM NMEA4
  • 33. Step 3: Conversion Graph GNM NMEA4
  • 34. Step 3: Conversion Graph GNM NMEA4
  • 35. Generating the converter Now that we have a graph, to make a converter just compose the functions on the edges of the shortest path: NMEA_v4_payload = merge(serialize(nop(deserialize(tokenize(GNM_input))))) Function composition in Python: https://mathieularose.com/function-composition-in-python/#solution
  • 36. NetworkX https://networkx.github.io/ 3-clause BSD license I’ve used it heavily and have no complaints with it
  • 37. Building the graph conversions.add_edge((“NMEA”, “4”, “PAYLOAD”), (“NMEA”, "4”, "UNPARSED"), function=tokenize) conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "EEA_OBJ"), function=deserialize) conversions.add_edge((“NMEA”, “4”, "EEA_OBJ"), (“NMEA”, "4”, "UNPARSED"), function=serialize) conversions.add_edge((“NMEA”, “4”, "UNPARSED"), (“NMEA”, "4”, "PAYLOAD"), function=merge) … conversions.add_edge((“NMEA”, “4”, " EEA_OBJ "), (“GNM”, “3.1”, "EEA_OBJ"), function=lambda x: x) # nop …
  • 38. Example usage NM4_to_GNM = get_converter(("NMEA", "4", "PAYLOAD"), ("GNM", "3.1", "PAYLOAD")) with open("my_nmea_v4_file.nm4", 'rb') as fin: with open("my_converted_file.gnm", 'wb') as fout: fout.write(NM4_to_GNM(fin))
  • 39. Prevention of lossy conversions Create 2 different conversion graphs: 1. Only lossless conversions: “FORWARD_FORMAT_CONVERSIONS” 2. Add lossy conversions: “ALL_FORMAT_CONVERSIONS” Use lossless graph by default, make users explicitly ask to use lossy conversions If the user asks for a lossy conversion without being explicit, there will be no path in the “FORWARD_FORMAT_CONVERSIONS” graph. Library can check for a path in “ALL_FORMAT_CONVERSIONS” and give them a nice error message: “No lossless path from NMEAv4 to KML. If you want to perform a lossy conversion, you must explicitly allow lossy conversions.”
  • 40. Extra parameters Sometimes conversions need additional information Example: Conversion from DOFv3 -> DOFv4 requires a timestamp
  • 41. Extra parameters 1. Mark edges as having required parameters: conversions.add_edge( (“DOF”, “3”, “PAYLOAD”), (“DOF”, “3”, “UNPARSED”), function=tokenize, required_params=set(["timestamp"]) ) 2. Allow the user to supply arbitrary keyword arguments to get_converter(): get_converter( (“DOF”, “3”, “PAYLOAD”), (“DOF”, “4”, “PAYLOAD”), timestamp=get_datetime_for_id(id) )
  • 43. Benefits • Centralizes the conversion code • Less bugs, more performant • Simplifies the code + less duplication • Don’t need to know all of the input formats a priori • Dynamic generation of converters • Reduces chance of accidental lossy conversions
  • 44. Summary • We had a problem with multiple formats and converters between them • By modelling it as a graph problem, it was easy to dynamically generate converters • This allowed for greater flexibility, greater safety • When you have a web of conversion steps, you can use graph traversal libraries to generate the shortest path to get the answers you want.

Editor's Notes

  1. Why?: Environmental: reef protection, bilge water dumping, oil spills, but most importantly illegal fishing… Logistical: Port authorities, logistics companies, scheduling Security: surveillance, smuggling, piracy
  2. As a ship captain, how do you prevent collisions with other vessels? People tend to jump immediately to SONAR/RADAR, but there are a few major problems to that: The equipment is expensive The equipment requires a lot of power These systems are actually quite difficult to read. They require some skill to operate. The most common method was simply to visually observe the other ships and try to estimate their speed, course, heading, and acceleration. Then you would do the same for your vessel and figure out the calculus to determine if you are going to collide or not. Obviously, this I also tricky, and fails in situations like: Night time Stormy weather When you are going around a tight curve in a waterway, and can’t see what’s coming at you Ships don’t stop on a dime. In fact, some of these vessels take in the neighbourhood of 20 minutes to stop. So sometimes, you will have two ships that know they are going to collide well in advance of the collision, but they can’t turn or stop fast enough to do anything about it. So the problem of ship collisions is what triggered the creation of AIS.
  3. All vessels transmit a “Hear I am! Please don’t hit me” to the other vessels in the area.
  4. Here are some of the fields that are transmitted. In the Type 1,2,3 messages, we have position information. These are transmitted every few seconds while the vessel is moving. In other message types, we have more static information that doesn’t change very often, like registration and destination and ETA.
  5. The first thing you do when you have a lot of formats: Create a new format! We created a new internal format that we called EEA. Not only could it hold all of the AIS message fields, but it also had “extension fields” where we would shove all the fields that might be specific to a format. For example, GNM has some metadata fields that are specific to GNM. We put those into a GNMMetadata field within the EEA spec. So now we have a format that can capture all of the complexity of all the AIS formats. It was written to be faithful to the AIS spec, which means that it handles the full complexity of the AIS spec.
  6. A side benefit of redesigning our format (and our in-memory representation) with EEA is that we got to fix some of the issues developers were accidentally creating. AIS is a complicated spec. For example, look at the definition for speed over ground. There are 1024 bits, and while most of the bit values can be interpreted as doubles, there are two reserved values that can not. There is 1023 = Not Available, for when the ship doesn’t know how fast it is going. There is also the 1022 value which means 102.2 knots or greater. The problem with all the ad hoc parsers is that developers of them would often get the parsing of these fields wrong. They would often just parse the field as a double, not realizing that these special values existed, and then perform mathematical operations on the fields. So they would do things like average the values of the speed over ground, so you would end up with massive values when a large number of vessels were reporting “Not available”. With EEA, we fixed that, changing the type of the field to be either a double, or one of two special values. This prevents mathematical operations being applied on the field, and forces the developer to stop and think about how they want to actually handle the math.
  7. The next step was to define a common interface of functions we apply when parsing all formats. We eventually settled on this interface. You start with a “PAYLOAD” which is a collection of bytes representing messages, which might be the contents of a file for example. You then call tokenize() which finds the boundaries of the messages within the payload and splits the payload on those boundaries. You still haven’t parsed the messages, so you don’t know what they say yet. You only know the bytes that make up each message. We called this “UNPARSED” in this diagram. You can then call deserialize, which actually parses the bytes of the message and gives you an in-memory representation of the message. Most commonly, this was the new EEA format. Then on the reverse direction, we take individual messages and call serialize() on them to return them to the UNPARSED tokens we had before. And finally we call merge(), which writes the messages, one after another, into a payload. So these four functions are pretty much universal to format parsing. They also form a graph, perhaps a sort of state-machine where the data parse levels are the nodes, and the functions are the edges.
  8. We went and implemented these four functions for all of data formats. And when you put all the conversion graphs beside one another, you notice something.
  9. The PARSED node for most of the formats is EEA.
  10. It’s the same format. Really, it’s all the same node. Conceptually, you could add edges between them with a NO-OP function.
  11. So now if you wanted to convert between GNM and NM4, you can just follow the edges from GNM PAYLOAD to NMEA4 PAYLOAD
  12. And you suddenly have the conversion steps.
  13. In order to represent the graph in code, we need a graphing library. I came across NetworkX and have had no complaints. It allows you to create nodes and edges in a graph, and then gives you an easy way to do algorithms like shortest path across graphs.
  14. Here’s what it looks like in code: We have our conversions object, which is just a networkX graph object. Then on each line, we add an edge between each of our parse levels. On each edge, we also supply the function to perform. Notice in the last line, that this is an example where we jump between formats that are already parsed into the EEA in-memory representation. Therefore the function is a simple NO-OP.