If the data cannot come to the algorithm...

251 views
197 views

Published on

Session four of my series on many cores turns to data, both big and small. Looks at MapReduce but approaches sideways from a classic computer science perspective.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
251
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

If the data cannot come to the algorithm...

  1. 1. If the Data Cannot Come to the Algorithm... many cores with java session four data locality copyright 2013 Robert Burrell Donkin robertburrelldonkin.name this work is licensed under a Creative Commons Attribution 3.0 Unported License
  2. 2. Pre-emptive multi-tasking operating systems use involuntary context switching to provide the illusion of parallel processes even when the hardware supports only a single thread of execution. Take Away from Session One
  3. 3. Even on a single core, there's no escaping parallelism. Take Away from Session Two
  4. 4. Take Away from Session Three Code executing on different cores uses copies held in registers and caches, so memory shared is likely to be incoherent unless the program plays by the rules of the software platform.
  5. 5. Gustafson's Law S(p) = p - a (p-1) ● S(p) is the speedup for pprocessors ● a is the non-parallelizable fraction "in practice, the problem size scales with the number of processors" John L. Gustafson
  6. 6. ● Think about Gustafson's Law... ● The quantity of data processed... ● ...scales linearly as processors added. ● Throwing processors at the problem works... ● ...at least sometimes. Scales and Scaling
  7. 7. Divide and Conquer ● Back to the future ● Partition the data... ○ ...apply the same algorithm to each part and then ○ ...collate the answers. ● Natural to parallelise ● No contended shared memory
  8. 8. Data Locality ● When the algorithm is small ○ it's more efficient ■ to bring the algorithm to the data ■ than the data to the algorithm ● Whether the data is in ○ caches on cores in a many core computer, or in ○ disc storage in a distributed data store
  9. 9. Map and Reduce ● Partition the data ● The map algorithm ○ works in parallel ○ on local data ○ independently ● The reduce algorithm ○ collates output from map algorithms ● More complex systems built from these blocks
  10. 10. Map-Reduce As a Query Language ● NoSQL ● A popular alternative to SQL ○ for distributed data stores ● Why...? ○ Easy to ■ read and write ■ parallelize ○ Rich and full programming model
  11. 11. Map-Reduce Crunching Big Data ● Commodity hardware ● Scales up to Terabyte and Petabyte ○ smoothly by adding new nodes ● Map-Reduce platforms typically provide ○ fault tolerance eg. retry ○ orchestration ○ redundant data storage ● Statistical resilience
  12. 12. Take Away When you want to be able to process big data tomorrow by adding cores or computers, adopt an appropriate architecture today.

×