Scaling to billions of people and places - QCon London 2010


Published on

Scaling to billions of people and places
QCon London 2010

Published in: Business, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hi everyone, thanks everyone for coming today. I’m Josh, I’m from Nokia in Berlin and I’m going to talk today about people, places, maps, and building location based services.As I go through the talk here, I’m happy to answer any questions you may have as well, so please feel free to just shoot up your arm or yell out if something comes up.
  • [BACKGROUND]-Probably obvious who Nokia is and what we do, or traditionally have done, we make these things!-Shift from manufacturing to services in the last few years with the Ovi brand-Currently 4 major areas for Ovi services: Store, Music, Messaging and Maps-Since I work for Maps, I can really only talk about that-More specifically, the group I work in is called Places and you can think of us as the points of interest or POI managers, we’re the onesthat deal with all of things we know about a place
  • [BUBBLE PEOPLE]-More affectionately, we’re known as “the bubble people”!-Of course there’s more to us than just bubbles, but the name certainly does make light of the fact that we are just a piece of the overall Maps puzzle-Integrate with: search, physical vector maps, devices, web, many supporting teams to deliver a complete product-More pertinent here is maybe the startup story of Ovi services within a massively scaled, efficient organization and production machine-This is an organization that does things on the scale of tens and hundreds of millions, every year[TRANSITION into talking about scale and efficiency]
  • Sources: Nokia free navigation press release from 21 January; Nokia Q4 results announcement from 28 January; CEO keynote at CES, January 2010-To put some real numbers and context to the scale of what we’re dealing with here-Let’s just say there’s lots of phones in use all over the world!
  • -Photo by Professor Quentin Ziplash - the context of this talk of course we have to mention GPS-enabled devices-More that 82M GPS enabled devices shipped so far since the N95 launched in 2007 – first GPS phone-Reach that keeps on increasing as more and more devices are built and sold that support GPS-Basically there’s a whole bunch of devices/computers out there that have the capability to do things that have changed the way we build products-Products are now built up from normal web technologies, JavaScript, AJAX, REST and so on[extra points]-Installation base covers about 10 different device models as well-Since the announcement of free walk and drive navigation at the end of January we have had over 3.5M Maps downloads-Digital maps for 180 countries75 of which have navigation covering 650K cities, 28Mkms of road and the population of about 1.5B people
  • -Whatdoes this all mean to the consumer?-In our case, as I mentioned earlier, we’re in the business of managing places-We saw the bubble on the web and here it is again-And again on the device, although tailored for the user experience you might have on the device[TRANSITION]
  • [TRANSITION to big scale]-Okay so hopefully I’ve set the stage a little bit and given some context of where the rest of the talk is coming from-But really what we’re here to talk about is stuff that’s big and the consequences of building stuff and operating at scale-I think the first thing that comes to mind for me when we talk about size and scale is probably just branding and the name Nokia-In a lot of countries, particularly in Asia, Nokia is even regarded as being one of the most trusted and well known brands and this in itself has a whole bunch of consequences-Being big naturally also leads to having a very broad public image and generally high visibility to the public and the press-Everyone is, at some point, looking at what you are doing, from the press to bloggers to other people in the organsiation-So when you set out to do something at scale with these kinds of pressures, you better make damn sure that you’ve got a plan and your head screwed on straight-For example when we do things like over the air software updates, we really have the potential, the ability to completely hose everyone's phones and that’s literally millions of consumers-We have a pretty big army of test devices that we put through the paces to make sure this doesn’t happen, but the potential is there-The last point here speaks a bit to the speed at which we try to operate-Being an existing, successful, global company means that you have not only high public visibility in one place, like Europe or North America, but really worldwide-So again when you launch some product or launch something like Ovi services, you have to do it at a massive scale for customers all over the world[cover in the first 5 mins. Ideally]
  • Photo by Genista -[TRANSITION to global presence]-Okay, speaking of getting things done, let’s dive into some details-I mentioned a minute ago that being big is hard because our services need to be accessible everywhere-And really this means not just being available but being usable everywhere even in the face of varying worldwide network latency-To do this…CDNs, “edge applications”, and data centers all around the world-We are benchmarking from around the world to try feel what a user feels, using external services like Keynote-Australia story, use case-The wire is always there, regardless of scale and performance
  • [TRANSITION to dataset talk]-So being global is one dimension of what we operate in-Something that we faced almost immediately that is a bit different from maybe typical web applications is that our dataset is really one of our biggest assets, and it also has multiple dimensions to it-Part of the immediate scale consequence is that we need to be immediately useful as well, implying that we need to start with a large dataset-Started with tens of millions of places, quite a lot of which were sourced from Navteq, our sister company-Targets are for upwards of 500M/half-billion places == 1 place per 13 people in the world
  • Photo by Dr. Jaus - scale needs constantlove and attention-Insatiable appetite of the consumer to always be adding more places and adding to the density-Both as places are created and shutdown every day around the world-Countries and cities are growing, populations are becoming more dense
  • -Speaking of density…-1 sq km of San Francisco-Nearly 10K places
  • -Photo by mikebaird - order to fulfill this growth in the dataset, rely on partners-20+ major content partners-Not just any partners but really strategic ones to help us fill in gaps internationally and cover the whole world-So we have to deal with growth of not just dataset but providers as wellConstant new suppliers – how to onboard, partnerships, non-tech and tech problems
  • [technical interlude]-Photo by twicepic -[-2 problems we face coming from these varying dimensions of the dataset – storage and matching]-geo-based datasets are not perfectly suitable to simple key value stores since you are doing a lot of range queries and such and not just a fetch on ID-need lots of indexing to make that work and not just regular indexing but real spatial indexing too-right away relied on simple MySQL, RDBMS-at startup, didn’t want to rely on bleeding edge technology or things at least that we have no experience with as developers or ops-much easier to find people with really strong MySQL experience-we knew that MySQL could handle tens of millions of records, so we at least had a starting place-now that we have established services with MySQL, we’re moving on to look at more sophisticated geo-indexing and NoSQL data stores since we plan on eventually reaching the hundreds of millions of place records-at the moment we are using Lucene, which I’ll talk more about in a minute, as a basis for some of our geo search-we’re also looking most heavily at CouchDB and Project Voldemort right now as our future data store, hopefully in production in the next 6-months or so
  • [technical interlude][-2 problems we face coming from these varying dimensions of the dataset – storage and deduplication]-One of our goals is to create not just a huge dataset, but to create a dataset that is accurate and precise-And not just in terms of metadata -- that is, making sure that the address is correct and accurate -- but also meaning that we have only one official representation of every place-So given two representations of a place, how do we go about deduplicating and matching up two of the same place representations?-The best way to show this is really with an example, so let’s have a look at some of the steps we go through to do this-STEPS: - narrow down to a geo region - narrow down to a category - coarse grained steps using indexing - normalise the address string using external geocodingservices again from our sister company Navteq - normalise the name - name comparison using things like Levenstein distance - fine grained steps in-memory-Challenges: -Super dense cities -Non-tokenizable languages
  • -Tens of millions of places in the dataset and growing daily-The dataset has special characteristics though, it’s not just any dataset-This is not an actual graph of POIs but gives you a sense of just how much the density changes from area to area-Sparse dataset and the nature of in relation to caching and sourcing and crowd-sourcing, market share, developing nations, etc.-Canada as an example (hockey loving nation): look at the difference in density and sparseness-Long tail story – if you are in town X which is not a minor city, but Nokia didn’t care about it, bad experience, bad for trust, etc.-You have to pay some attention to the entire spectrum-Not just sparseness but density – how do you design UI, algorithms, etc. to work in both dense and sparse? There is no average square kilometer – Canada vs Beijing. Very different experiences.
  • [technical interlude]-non-trivial, text based search, it adds a new dimension-best way to look at this is with a couple of examples1: Phoenix Bar in Warsaw Poland? Phoenix Bar Warsaw Indiana? semantics matter, word ordering can matter but not always, not just about stopwords or tokenization-and I’ve purposely put all of this in lower case text since, particularly on the phone, search requests are not punctuated or capitalized properly, so you can’t always rely on that as hints to the meaning of words2: is this a bar in Paris or a bar in Berlin?context matters, where are you searching from?3: Museum in New York (alt name)? Misspelled city name Momo, town in Gabon?Alternative place names matter, common spelling mistakes matter4: a Bar in The, a town in Burghundy France? Or literally a place called “The Bar” in whatever city you are currently searching from?-So, really, non-trivial problems, and can’t just apply simple rules to everythingMachine learning, human input, evaluation of search queries, and so on
  • Photo by th.omas -[TRANSITION from dataset to traffic]-Okay, so we’ve seen what the dataset looks like and some of the unique aspects of it-Of course the thing that all of us as public facing services have to deal with at some point is traffic and how to deal with it-Already mentioned the use of CDNs, edge applications for dealing with worldwide distribution, but of course these are used as well to deal with traffic and more specifically traffic spikes-Of course caching in multiple places is something that helps deal with traffic spikes as well-I don’t think there’s any real big secret there, but sometimes traffic spikes are not exactly what you expect…[NEXT]
  • -Sometimes a spike isn’t just a spike-We cause our own traffic spikes…i.e. marketing campaigns-Not only just a spike, but a big uptake in overall, long term users, bring you up to the next level, people discover just how good things are-Maybe nothing technical that we can do in addition, but people need to be informed in lots of places, data centers, etc.-Coordination with multiple parts of the organization go a long way to manage expected spikes and capacity planning
  • -So if public announcements are the predictable side of scale, the complement to that is the fact that we know our user base will grow, just don’t quite know exactly when and how fast this will happen-To try and predict some of this we do typical things like looking at device sales figures, and our market growth in various parts of the world-But this can only take you so far-When we’re looking at where we need to go in the future with speed and size of scale, we look at first trying to prove that we need to scale or use a certain technology-It’s about pragmatic optimisation and extrapolation of existing data that we have-Using real measurements and not just guessingPhoto by kenleewrites -
  • -Speaking of measuring and scale, I want to show a few slides on caching effectiveness and our geo dataset-Linear traffic histogram to show the effectiveness of caching given a mixed sparse and dense dataset-Attempts to illustrate the hits that would not benefit from caching (hit only once in this timeframe), this is the first column here-Every other column shows the number of places that were fetched a particular number of times-So basically the long tail of this shows the variety of popularity of places[NEXT]-The graph is a bit at fault as you just saw since with this scale you can’t really see the really low values [NEXT]
  • -To really see the long tail, this logarithmic projection is better-We can more easily see that the distribution of even the cache hits is not very even in the long tail-Such a variety of how the rest of the places are accessed-Are there smart things that we can do give the nature of the dataset?-Pre-seed cache? Smart cache miss algorithms?-Some of this we don’t have answers to (yet)-But this really comes down to doing good analysis and estimation
  • [technology interlude]-We collect a ton of logs containing a lot of data and this is often the basis for our analysis and estimation-We collect usage statistics for specific features, all access logs of course from Tomcat and Apache, and we have to somehow make sense of this-To comb through all of this and churn out some pretty graphs, we built a small Hadoop cluster consisting of 60 cores and about 50TB of raw storage space-The graphs you just saw were created using standard Tomcat access log files generated from our core place registry service-I ran about 17M lines of access logs through Hadoop and really it was relatively easy to do this-For looking at just standard access logs, we mostly use Pig-For those that haven’t seen or used Pig, it’s a high-level language that lets you express data analysis jobs in a far easier way than being forced to write Java MapReducejobs[NEXT]-The top bit of code here shows really how simple it is to do something like filter out specific requests from some standard Apache or Tomcat log file-Here I’m just looking for all successful GET requests[NEXT]-This next snippet shows an aggregation function which lets us count up the number of hits on each URI-You can probably already start to see how those graphs from the previous slide were generated-If anyone is interested in the nitty gritty details, I’ve got a blog post on using Pig and gnuplot to create those caching graphs
  • [photo needed]-Okay, so we’ve talked a lot about scale and being big and what exactly that means-But of course being a device company as well, we have to mention what is special about services that are consumed by mobile devices-What things do you need to take care of or at least be mindful of?[NEXT]Well, for starters, it’s important to see that mobile is not only a concern to us but really it’s something that affects everyoneAnd since we’ve been talking about big, well, I think that quote says it all
  • Photo by Mrs Logic - is maybe one of the first things that comes to mind-Mobile networks are very different that regular networks-Have to deal with things like GSM modems starting up, session startup and teardown and such things that affect the end user’s experience-In the end, it’s about optimisingthe right areas and taking a wholistic view of your service-Can make services as fast as possible, but without good smart clients that do things like deal with network sessions appropriately, it could be all for naught-The user won’t see all of the work you’ve put in on your service
  • Photo by Mathieu Ramage - per month average for users of Maps-Other maps products are 10x this amount-Why? Basically because people load their devices with maps beforehand, leaving only using bandwidth for enhanced features that require you to be online-Cost to the consumer – not everyone has flat data plans, not all networks in the world can even do this places like India, China, etc.-Data roaming is $$$
  • Photo by makerbot - are not always online, people’s connections drop out-Plus roaming costs, nobody wants to be travelling and pay roaming costs just to get a city guide or find their way to the Roman Coliseum-We can’t always rely on a connection to the internet to be there-How do we deal with this?-Like I mentioned a minute ago, for starters we let people load up the maps they want onto their phones-Second is that the maps data contains not only the actual digital vector maps, but map data as well-This lets you do things like offline search for addresses and landmarks-And really, I’ve travelled a lot with these phones and it’s a real godsend when you’re tired, fresh off the plane in a strange city
  • [TRANSITION to wrap up]-Reach of more classes of devices
  • -And just to be clear, I’ve talked a lot about big things and scale-Don’t want to give you the wrong impression about what it’s like in the trenches-Here’s a couple of people from one of our teams-And like the previous slide said, we are truly looking for the best-We try to work with the best too-In the middle there is Simon who is a Lucene committer and very active in the Hadoop community as well-We also work top notch Hadoop consultants, SpringSource, and ThoughtWorks for architecture and continuous integration and continuous deployment too
  • -And on that note…
  • Scaling to billions of people and places - QCon London 2010

    1. 1. Scaling to billionsof people and places<br />QCon London 2010<br />Josh Devins<br />Copyright 2010 Nokia<br />
    2. 2.
    3. 3.
    4. 4. Just what scale are we talking about?<br />Since the start of this talk<br />1K+ Nokia devices were made and sold (13/sec)<br />15M+ phone calls were made using Nokia phones<br />3M+ text messages were sent using Nokia phones<br />At any given moment there are<br />350 Nokia devices at TajMahal<br />525 Nokia devices at the Eiffel Tower <br />5000 Nokia devices at Disney World <br />6750 Nokia devices in the Forbidden City <br />
    5. 5. 220 countries and territories<br />
    6. 6. 46 languages<br />
    7. 7. Billions of devices<br />
    8. 8. 82 million GPS devices<br />
    9. 9. What consumers see<br />
    10. 10. With this much reach…<br />Brand expectations<br />Public visibility<br />Nearly immediate scale<br />
    11. 11. Global presence<br />
    12. 12. Legal implications<br />
    13. 13. A dataset of all the places in the world<br />
    14. 14. A dataset that is always growing<br />
    15. 15. A dense dataset<br />
    16. 16. New suppliers arrive all the time<br />
    17. 17. Technology: Storage<br />
    18. 18. Data correctness is vital<br />
    19. 19. Technology: Deduplication<br />“Hotel Adlon” “The Hotel Adlon, Berlin”<br />
    20. 20. The data is not evenly distributed<br />
    21. 21. Technology: Geospatial search<br />“phoenix bar inwarsaw”<br />“paris bar berlin”<br />“moma”<br />“the bar”<br />
    22. 22. Coping with traffic spikes<br />
    23. 23. Coordinating with public announcements<br />
    24. 24. Anticipating scale<br />(but avoiding premature optimisation)<br />
    25. 25. Caching<br />
    26. 26. Caching<br />
    27. 27. Technology: Hadoop and Pig<br />logs = FILTER logs BY method == ‘GET’ ANDstatusCode >= 200 ANDstatusCode < 300;<br />groupedByUri = GROUP logs BY uri;<br />uriCounts = FOREACH groupedByUri GENERATEgroup AS uri,COUNT(logs) AS numHits;<br />
    28. 28. What’s special about mobile?<br />“There are four times as many mobile subscribers in the world as there are installed PCs.”<br /> - Financial Times (<br />
    29. 29. Serving to mobiles: latency<br />
    30. 30. Serving to mobiles: bandwidth<br />
    31. 31. Serving to mobiles: offline<br />
    32. 32. What about the future?<br />More devices<br />Common web runtimes<br />Merger of Maemo and Moblin into MeeGo<br />All equals larger reach<br />
    33. 33. Come work with us!<br />We’re looking for the best developers, testers, architects, designers<br />Send your CV<br />
    34. 34.
    35. 35. Thanks! Any questions?<br />Josh Devins<br /><br /><br />Come work with us in Berlin<br /><br />Free beer! Whittle Room, 5:30pm<br />