0
MapReduce In The Cloud          Infinispan Distributed Task Execution FrameworkManik SurtaniMay 3rd 2011, JUDCon - Boston
Welcome to the Data Deluge
Background• Emergence of data beyond human scale• Outgrows current platforms in scale, structure, processing time• Abundan...
Big Data• How did Big Data happen?• Exponential changes in storage, bandwidth and data  creation• Data exhaust• Drowning i...
Hard disk cost & capacity history         1000000                                                                         ...
“There was 5 exabytes of                information created between the               dawn of civilization through 2003......
“There was 5 exabytes of                information created between the               dawn of civilization through 2003......
Big Data challenges• Powerful platforms available today, however...• Need to be genius to build systems on top of them• Eq...
“Infinispan withoutMapReduce is like owning a  Ferrari without aDriving License”           - Vladimir Blagojevic
Infinispan Big Data goals• Why waste a Ferrari?• Humble beginnings• Simplicity without sacrificing power• Capitalize on ex...
Infinispan Distributed Execution           Framework• Leverage familiar ExecutorService, Callable abstractions• Expand it ...
So simple it fits on one slidepublic interface DistributedExecutorService extends ExecutorService {      <T, K> Future<T> ...
However,behind the scenes..
Do not forget Gene Amdahl                                                                   Speedup = 1/(p/n)+(1-p)       ...
π approximation
Infinispan MapReduce• We already have a data grid!• Leverages Infinispan’s DIST mode• Cache data is input for MapReduce ta...
MapReduce modelSource: http://labs.google.com/papers/mapreduce.html
Mapper, Reducer, Collatorpublic interface Mapper<KIn, VIn, KOut, VOut> extends Serializable {     void map(KIn key, VIn va...
Roadmap• Improve task execution container• Failover, execution policies• Make sure it scales to terabytes and petabytes• I...
Parting thoughts• Data revolution is here, today!• Profound socio-economic impact• Do not sleep through it• Infinispan as ...
Questions?    http://www.infinispan.org    http://blog.infinispan.orghttp://www.twitter.com/infinispan    #infinispan on F...
MapReduce In The Cloud Infinispan Distributed Task Execution Framework
Upcoming SlideShare
Loading in...5
×

MapReduce In The Cloud Infinispan Distributed Task Execution Framework

2,553

Published on

Having a data grid without an ability to execute distributed computational tasks on that data is like having a Ferrari without a drivers license!

Do you have or plan to use a data grid which contains massive amount of information? Introducing one of the most exciting and long awaited Infinispan 5 features - distributed data processing framework. Being tightly integrated into each node it leverages Infinispan data grid into a distributed large scale parallel data processing system. The two core contributors of the data processing framework show how one can perform various simple and complex processing such as Map-Reduce over a data grid in Infinispan with minimal effort, using various static and dynamic languages such as Scala and Ruby. This presentation includes a live demonstration.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,553
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
75
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • \n
  • Why Vladimir can&amp;#x2019;t make it\n\n
  • Not into surfing, although I believe this picture captures a feeling about the data deluge quite well.\n* Big and scary\n* Inevitable\n* Kinda interesting, challenging\n* And full of opportunity\n\n\n
  • We are in the era of big data, data beyond human scale\nBeyond the scale of most simple systems\nCloud helps, with cheap, on-demand storage and processing\nAnd while social networks are a good example of this, it isn&amp;#x2019;t just tied to social networks. We see aspects of big data crop up everywhere.\n \n\n\n
  • Even prior to cloud and cheap,on-demand storage/processing, the cost of traditional storage has been dropping so fast that you may as well store everything you find out. Just in case. Like GMail. :)\n\n
  • Wikipedia sums this up well, on this exponential scale of how the price of storage has been dropping.\n
  • \n
  • And this quote by Eric Schmidt gives you an idea of the scale and volume of data we&amp;#x2019;re dealing with today.\n
  • We have things that can deal with big data.\nWe have the tools, but they are non-trivial.\nHadoop, for example, is tough to set up, hard to maintain, use, etc.\nM/R: is that enough? Do we need a higher level language to express this? IMO, yes. But for now, this is what we have.\n
  • \n
  • Infinispan and big data\nwrapped around 2 separate but related APIs:\nDistributed Executors and Map/Reduce\n(Map/Reduce implemented over DistExec)\n
  • Start with DistExec. This is a simplistic API, extending familiar JDK abstractions.\nIt encapsulates the use case of wanting to distribute workloads on the grid, potentially with some data affinity. \nLets take a look at the API.\n\n
  • \n
  • This allows you to parallelise tasks across the grid. Essentially it takes JDK7&amp;#x2019;s Fork/Join concepts and distributes it across multi-nodes.\n
  • But this suffers from the same issues that Fork/Join suffers from - at some point it is just too expensive to break up tasks, distribute it, etc.\nSo care must always be taken when considering distributing task execution.\n
  • Example: Pi approximation.\nThis is an old brute-force algorithm using random points on the circle. The more &amp;#x201C;tasks&amp;#x201D; that get kicked off, the more accurate our approximation.\n\nLook at Impl, try with 1 node and try with 6 nodes\n\n
  • \n
  • \n
  • \n
  • That was distexec. So that&amp;#x2019;s cool to some degree, lets now look at Map/Reduce.\nM/R based on Google&amp;#x2019;s paper, allows you to perform a task over a data set in parallel, and pull back &amp;#x201C;reduced&amp;#x201D; results.\nIn the case of Infinispan, data is DIST anyway. So your global data set is disjoint.\n\n
  • Google&amp;#x2019;s Map/Reduce model, from Google&amp;#x2019;s paper\nInput -&gt; Mapped -&gt; Reduced -&gt; Collated.\n
  • 3 interfaces in Infinispan&amp;#x2019;s map/reduce API\n\nExplain word-count demo\nDemo on 1 node\nDemo on 6 nodes\n
  • \n
  • \n
  • * Currently in an experimental phase\n* We need to do failover - behaviour as cluster size changes\n* Test on very large volumes\n* OGM - could make use of this for distributed queries\n* Integration with analysis tools\n* Data language? LINQ?\n
  • \n
  • \n
  • Transcript of "MapReduce In The Cloud Infinispan Distributed Task Execution Framework"

    1. 1. MapReduce In The Cloud Infinispan Distributed Task Execution FrameworkManik SurtaniMay 3rd 2011, JUDCon - Boston
    2. 2. Welcome to the Data Deluge
    3. 3. Background• Emergence of data beyond human scale• Outgrows current platforms in scale, structure, processing time• Abundance of unstructured, machine generated data• Does not fit into current software paradigms• Not confined to Twitter, Facebook and Google only• Need new platforms built for Big Data from ground up
    4. 4. Big Data• How did Big Data happen?• Exponential changes in storage, bandwidth and data creation• Data exhaust• Drowning in data and not knowing what to do with it• Leverage, not discard Big Data• We are not even aware of data revolution around us
    5. 5. Hard disk cost & capacity history 1000000 1000000 100000 100000 10000 10000 1000 1000 100 100 10 1 10 0.1 1 1980 1985 1990 1995 2000 2005 2010 Cost per GB in US$ Capacity in MBSource: http://en.wikipedia.org/wiki/History_of_hard_disk_drives
    6. 6. “There was 5 exabytes of information created between the dawn of civilization through 2003.... Eric Schmidt, August 4th, 2010Source: http://www.readwriteweb.com/archives/google_ceo_schmidt_people_arent_ready_for_the_tech.php
    7. 7. “There was 5 exabytes of information created between the dawn of civilization through 2003.... ... that much information is now created every two days” Eric Schmidt, August 4th, 2010Source: http://www.readwriteweb.com/archives/google_ceo_schmidt_people_arent_ready_for_the_tech.php
    8. 8. Big Data challenges• Powerful platforms available today, however...• Need to be genius to build systems on top of them• Equivalent to programming client/server apps in assembly language• Is there a need for a new language?• Confusion is as big as Big Data• Simpler yet equally powerful solutions needed
    9. 9. “Infinispan withoutMapReduce is like owning a Ferrari without aDriving License” - Vladimir Blagojevic
    10. 10. Infinispan Big Data goals• Why waste a Ferrari?• Humble beginnings• Simplicity without sacrificing power• Capitalize on existing Infinispan infrastructure• Reuse of current programming abstractions• Two frameworks: distributed executors and MapReduce
    11. 11. Infinispan Distributed Execution Framework• Leverage familiar ExecutorService, Callable abstractions• Expand it to distributed, parallel computing paradigm• Looks like a regular ExecutorService• Feels like a regular ExecutorService• The magic that goes on within Infinispan is completely transparent to users• Nevertheless, users can experience it :-)
    12. 12. So simple it fits on one slidepublic interface DistributedExecutorService extends ExecutorService { <T, K> Future<T> submit(Callable<T> task, K... input); <T> List<Future<T>> submitEverywhere(Callable<T> task); <T, K > List<Future<T>> submitEverywhere(Callable<T> task, K... input); } public interface DistributedCallable<K, V, T> extends Callable<T> { void setEnvironment(Cache<K, V> cache, Set<K> inputKeys);}
    13. 13. However,behind the scenes..
    14. 14. Do not forget Gene Amdahl Speedup = 1/(p/n)+(1-p) However, problems that increase the percentage of parallel time with their size are more scalable than problems with fixed percentage of parallel time p = parallel fraction n = number of processorsSource: https://computing.llnl.gov/tutorials/parallel_comp/
    15. 15. π approximation
    16. 16. Infinispan MapReduce• We already have a data grid!• Leverages Infinispan’s DIST mode• Cache data is input for MapReduce tasks• Task components: Mapper, Reducer, Collator• MapReduceTask cohering them together
    17. 17. MapReduce modelSource: http://labs.google.com/papers/mapreduce.html
    18. 18. Mapper, Reducer, Collatorpublic interface Mapper<KIn, VIn, KOut, VOut> extends Serializable { void map(KIn key, VIn value, Collector<KOut, VOut> collector);}public interface Reducer<KOut, VOut> extends Serializable { VOut reduce(KOut reducedKey, Iterator<VOut> iter);}public interface Collator<KOut, VOut, R> { R collate(Map<KOut, VOut> reducedResults);}
    19. 19. Roadmap• Improve task execution container• Failover, execution policies• Make sure it scales to terabytes and petabytes• Integration with Hibernate OGM• Analytics and BI tools• Do we need data analysis language?
    20. 20. Parting thoughts• Data revolution is here, today!• Profound socio-economic impact• Do not sleep through it• Infinispan as a platform for Big Data• Join us in these exciting endeavours
    21. 21. Questions? http://www.infinispan.org http://blog.infinispan.orghttp://www.twitter.com/infinispan #infinispan on FreeNode Rate this talk!http://spkr8.com/t/7384
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×