At the 2013 Grace Hopper Conference in Minneapolis, MN, one of our mobile engineers, Kristine Delossantos, gave a lightning talk about the technology behind Current Caller ID. @WhitePages
9. 2013
Keeping Data Fresh
Network variance
Data connections
Usage Plans
Push/Pull protocols
Our solution:
• We periodically update the data on a schedule in
the background, in batch.
• Active MQ & worker machines
10. 2013
Data Transfer
Our solution:
Thrift over Http and we only
deliver objects since the last
successful request.
ThriftJSON
Serialized Contact List Size
Comparison
GZip
Thrift
HTTP
Updates
14. 2013
Got Feedback?
Rate and Review the session using the
GHC Mobile App
To download visit www.gracehopper.org
Editor's Notes
My name is Kristine Delossantos and I am a Software Engineer on the Mobile Team at Whitepages. I wanted to take this time to talk about the technical workings of an Android application we released last August called Current Caller ID and leave you with some key takeaways from our development experience.
I’ll start off with a quick overview of our app.Then I’ll show an architectural diagram.Afterwards I’ll get to the key problems and our current solutionsFirst.. What is Current Caller id?
We have a sweet call alert. Not only will it tell you who’s calling, but it’ll also integrate Facebook, LinkedIn, and Twitter data to show what is happening with them in real-time.
In the app, you can access a consolidated call/text log with one tap easy access to your top contacts.
Then, we make it visual. We have sharable infographics that show insights into your communication style. They show you who you communicate with, when you communicate, and how. Now I’ll give you more detail about the technical side of things by giving you this scenario……
Meet Spongebob.He just got a new cell phone, installed Current, and wired it up to his Facebook account.He posted a status to Facebook with his new number, telling his friends to text him.
Patrick,Squidward, MrKrabs, and Sandy are so excited that Spongebob finally has a phone, so they all text him to get Krabby Patties to celebrate.Current recognizes these as new contacts and gets to work.
Current sends the data to our servers and we store it for further processing. Our front ends deliver a message to an asynchronous messaging queue system alerting the data collection services of the new contacts.Our data collection services pick that up and reaches out to our whitepages data and social networks to collect more information about the contacts. Then we store it.Our data collection services deliver another message to our Active MQ pipeline alerting the entity resolution system that we’ve collected information that needs to be resolved together.The entity resolution system picks that up and fetches data from our contact graph store. (I’ll get into more detail about the Entity Resolution system in a bit, but) It resolves the data, stores it, and sends it back to the client. Now I’ll dive into the key takeaways we learned while trying to make all of this work.
When dealing with large data sets, you want to make sure you keep it fresh, and do it efficiently. You don’t want to violate your customer’s trust by usingt up their data plan.We first needed to decide between a push or a pull protocol. Since the client triggers updates from the server, we didn’t need realtime updates every step of the way. Whenever a change in your Call/Text log or the Address book happens, the client sends the changes over to the server and then increases the polling frequency to fetch the updates to any new associations that have been created on the server, and then the client refreshes the UI. Doing real-time lookups is not fast enough to present a rich call alert in a timely fashion. Additionally, when we first started, CDMA was more prevalent so simultaneous voice and data communication wasn’t possible. So we chose to pull to minimize our customers data usage while still responding to updates quickly.The system was designed to perform these jobs on their own by using ActiveMQ, a popular open source messaging queue system, and a scalable host of worker machines to process messages delivered to the queue and update our databases.The key takeaway here is that it was best to deliver data as its available in an asynchronous fashion and deliver only new data, that way the user experience doesn’t suffer with long wait times and loading screens.
When transferring large sets of data, you want to pay close attention to using smaller serialization schemes. Keep in mind that the mobile device may not always be connected and make sure your app can handle that. When choosing our transfer protocol, we realized that HTTP was easiest to plug into our infrastructure. Then we compared Thrift and json for the format of our data. Json can be compressed and is easy to debug, but ideally we wanted to keep payloads as small as possible, and thrift was best for the job in its compact binary form.Compact binary thrift compared to JSON, with the same data set, cut payloads in ~½. We usedGzip since the HTTP protocol supports Gzip compression so it was a widely available compression scheme, which gave us an average 30% savings under thrift. We also make sure we only deliver data that has changed in batches so the client only receives data that is necessary. Make sure to be cognizant of payload size from a serialization format perspective, compression perspective, and overall structural perspective (choice of delivering only deltas)When you’re dealing with large sets of data, you will probably need to store that data somewhere.
It is important for your storage solution to be fault tolerant, maintain consistency, and scale horizontally.You might want to consider using data partitioning for increasing I/O and maintaining scalability.We use postgres and treat it as a NoSQL key-value store. We use partitions to spread our data across multiple databases.A drawback with our solution was that it’s difficult to add more partitions without high engineering costs. We are currently exploring tools that can scale automatically so that adding capacity is a simpler task. One of the things we did that helps this effort was Early on in our development, we deliberately segmented our api and model code from underlying storage.It’s important to choose a data model that is efficient and flexible and choose a storage engine that can easily adapt to unexpected events.Make sure it meets the customer requirements, I/O requirements, and processing requirements. Keep in mind operational requirements and growth. Make sure it still works with 20x projection.If you’re developing an application that’s data centric you might need to detect separate records that refer to the same entities.
… which means you’ll need an entity resolution system. The Infolab at Stanford University defines Entity resolution as “locating and merging records that refer to the same real-world entities”. In our case, we needed one to match names.If an entity resolution system is required for your application, you want to make sure it is tunable and performs well.The obvious choice when it comes to developing large scale entity resolution systems is machine learning. We originally opted not to do machine learning because we had a predefined set of rules we thought were correct. As we tried to implement our system, we learned that it wasn’t as simple as we thought. You might want to consider machine learning upfront because in hindsight, we could have explored it more. The first step to building our entity resolution system was Defining the rules that would resolve two entities together. For example, to resolve two contacts together, they have to have a last name match from two different sources while the first name could be a nickname or complete match. We started with a decision tree to support the rules we had outlined.One drawback with the decision tree is that it scales very well vertically if you were to add additional rules but doesn't scale very well horizontally , in our example, if we were to add a few more social networks to match the contacts against, it wouldn’t be easy.We wrote tools to process sets of data that we could run user data samples against to see if we got the expected results based on the defined rules. This helped speed up iteration time significantly on further improving our match rate and the resolution engine.
I’d like to close out our talk with what to keep in mind when developing mobile applications. The mobile team at WhitePages has developed several applications in the past but this one was particularly interesting and we came out of it with several takeaways. When you are conceptualizing an idea, the first step is to evaluate the feasibility of the product by exploring various platforms and evaluating the capabilities available to you. For instance, in our case iOS doesn’t provide access to call history or any kind of call/text communication data. The platform we targeted for current was Android as it gives us most access to enable caller ID functionality. We also noticed during development that since we were working with private APIs, there was a lot of variation between implementations on different carriers and manufacturers.For example1) Annotation of call type and notifications of incoming/outgoing calls are different among devices and carriers. 2) Current allows blocking calls and texts, and for blocking calls on HTC, we had to set additional state so it would respond to pick up and hangup API calls. To avoid surprises we’d highly recommend defining your device matrix for testing well ahead of time and note this can be very different from the top devices published on the platformbased on the demographics you are targeting and the nature of your product, so do your research well ahead of time.
This is the last slide and must be included in the slide deck