Hadoop Distributed Filesystem ü Files as big as you want ü Horizontal scalability ü Failover Distributed Compu5ng ü MapReduce ü Batch oriented • Input ﬁles processed and converted in output ﬁles ü Horizontal scalability
Easier Hadoop Java API ü But keeping similar eﬃciency Common design paIerns covered ü Compound records ü Secondary sor5ng ü Joins Other improvements ü Instance based conﬁgura5on ü First class mul5ple input/output Tuple MapReduce implementaDon for Hadoop
Tuple MapReduce Our evoluDon to Google’s MapReduce Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna6onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012
Sales diﬀerence between the most selling Tuple MapReduce oﬃces per each loca6on
Tuple MapReduce Main constraint ü Group by clause must be a subset of sort by clause Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaDon • Pangool -‐> Tuple MapReduce over Hadoop
Eﬃciency Similar eﬃciency to Hadoop hIp://pangool.net/benchmark.html
Voldemort & Hadoop Beneﬁts ü Scalability & failover ü Upda5ng the database does not aﬀect serving queries ü All data is replaced at each execu5on • Providing agility/ﬂexibility § Big development changes are not a pain • Easier survival to human errors § Fix code and run again • Easy to set up new clusters with diﬀerent topologies
Basic sta5s5cs Easy to implement with Pangool/Hadoop ü One job, grouping by the dimension over which you want to calculate the sta5s5cs. Count Average Min Max Stdev CompuDng several Dme periods in the same job ü Use the mapper for replica5ng each datum for each period ü Add a period iden5ﬁer ﬁeld in the tuple and include it in the group by clause
Dis5nct count Possible to compute in a single job ü Using secondary sor5ng by the ﬁeld you want to dis5nct count on ü Detec5ng changes on that ﬁeld Example ü Group by shop, sort by shop and card Shop Card Shop 1 1234 Shop 1 1234 Shop 1 1234 Change +1 Shop 1 5678 2 dis5nct buyers for Shop 1 5678 Change +1 shop 1
Histograms Typically two-‐pass algorithm ü First pass for detec5ng the minimum and the maximum and determine the bins ranges ü Second pass to count the number of occurrences on each bin AdaptaDve histogram ü One pass ü Fixed number of bins ü Bins adapt
Op5mal histogram Calculate the be:er histogram that represents the original one using a limited number of ﬂexible width bins ü Reduce storage needs ü More representa5ve than ﬁxed width ones -‐> beIer visualiza5on
Op5mal histogram Exact Algorithm Petri Kontkanen, Petri Myllym aki ̈ MDL Histogram Density EsDmaDon hIp://eprints.pascal-‐network.org/archive/00002983/ Too slow for producDon use
Op5mal histogram Alterna5ve: Approximated algorithm Random-‐restart hill climbing ü A solu5on is just a way of grouping exis5ng bins ü From a solu5on, you can move to some close solu5ons ü Some are beIer: reduce the representa5on error Algorithm 1. Iterate N 5mes, keeping best solu5on 1. Generate a random solu5on 2. Iterate un5l no improvement 1. Move to next beIer possible movement
Op5mal histogram Alterna5ve: Approximated algorithm Random-‐restart hill climbing ü One order of magnitude faster ü 99% accuracy
Everything in one job Basic staDsDcs -‐> 1 job DisDnct count staDsDcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job We can put all together so that compu5ng all sta5s5cs for all shops ﬁts into exactly one job
Shop recommenda5ons Based on co-‐occurrences ü If somebody bought in shop A and in shop B, then a co-‐occurrence between A and B exists ü Only one co-‐occurrence is considered although a buyer bought several 5mes in A and B ü Top co-‐occurrences per each shop are the recommenda5ons Improvements ü Most popular shops are ﬁltered out because almost everybody buys in them. ü Recommenda5ons by category, by locaDon and by both ü Diﬀerent calcula5on periods
Shop recommenda5ons Implemented in Pangool ü Using its coun5ng and joining capabili5es ü Several jobs Challenges ü If somebody bought in many shops, the list of co-‐occurrences can explode: • Co-‐occurrences = N * (N – 1), where N = # of dis5nct shops where the person bought ü Alleviated by limi5ng the total number of dis5nct shops to consider ü Only uses the top M shops where the client bought the most Future ü Time aware co-‐occurrences. The client bought in A and B and he did it in a close period of 5me.
Some numbers EsDmated resources needed with 1 year data 270 GB of stats to serve 24 large instances ~ 11 hours of execu5on $3500 month ü Op5miza5ons s5ll possible ü Cost without the use of reserved instances ü Probably cheaper with an in-‐house Hadoop cluster
Conclusion It was possible to develop a Big Data soluDon for a Bank ü With low use of resources ü Quickly ü Thanks to the use of technologies like Hadoop, Amazon Web Services and NoSQL databases The soluDon is ü Scalable ü Flexible/agile. Improvements easy to implement ü Prepared to stand human failures ü At a reasonable cost Main advantage: doing always everything
Future: Splout Key/value datastores have limitaDons ü Only accept querying by the key ü Aggrega5ons no possible ü In other words, we are forced to pre-‐compute everything ü Not always possible -‐> data explode ü For this par5cular case, 5me ranges are ﬁxed Splout: like Voldemort but SQL! ü The idea: to replace Voldemort by Splout SQL ü Much richer queries: real-‐5me aggrega5ons, ﬂexible 5me ranges ü It would allow to create some kind of Google Analy5cs for the sta5s5cs discussed in this presenta5on ü Open Sourced!!! hIps://github.com/datasalt/splout-‐db
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.