In this first post, we’ll be taking an exclusive look at the backend stack for data processing here at Oscar and how it continues to evolve. This is a lengthy discussion, so it will come in three parts over the next few weeks.
Building an engineering program at a new tech company is an exciting and often tricky business. There are so many amazing new technologies out there, so many battle-tested and reliable services, and so many tools that could be right for the job if you just make a choice and tweak it until it works for you. It’s fun to try new things and experiment, and it’s rewarding to build rock-solid services. At Oscar, we’re all about maintaining a high degree of curiosity and technical exploration, while relying on proven methods and technologies. Does that sound consistent to you? It’s not!
So, how do you reconcile conflicting engineering impulses while still having a good time? By making sure you’re addressing the stuff that matters first and foremost. Let’s establish some foundational premises from which arguments can proceed. Those premises will be the properties our systems must achieve by whatever means we come up with. We’ll get back to the conflict in a later post.
First, what are the properties we need? Remember, we’re talking about backend data processes alone here. Let’s start with one of the first things we had to do in engineering - connecting partner feeds. The insurance business is a complicated space, and we’re currently handling multiple data feeds from about ten different partner companies. As of this writing, we’re receiving about 60 individual data feeds and that number is likely to explode over time. Some of the feeds are trivial, requiring us to simply copy new data into our storage systems and leave it there, but that’s the exception. One of our feeds is an ASCII database transaction log representing hundreds of tables from a remote database written in Cobol. Many are fixed-width text files or CSV that must be parsed according to a schema of byte offsets and data types. For good measure, we also see EDI and HL7.
Whatever the format, the mission remains the same: to create a unified and up-to-date view of the data for users - both external and internal. And of course, don’t let a single bit get misplaced! This data really matters. It affects the lives of our customers and if we mishandle it, we can create inconvenience or even add significant stress at a time when focusing on medical treatment should be top priority. Basically, we shouldn’t ever screw up our data, and when it gets screwed up (yes, it will happen), we must be able to recover fast. Now we have our properties - whatever our choices, our feed systems must:
● Respect order.
● Break on failure
● Be idempotent
● Be low-latency
● Maintain privacy
That’s already a fair amount of properties. If we have several competing approaches, they all have to maintain the same general behavior and that’s more work.