Multi-core Parallelization in Clojure - a Case Study
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Multi-core Parallelization in Clojure - a Case Study

  • 10,176 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
10,176
On Slideshare
10,116
From Embeds
60
Number of Embeds
3

Actions

Shares
Downloads
179
Comments
0
Likes
12

Embeds 60

http://www.slideshare.net 57
http://a0.twimg.com 2
http://paper.li 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Multi-core Parallelization in Clojure - a Case Study Johann M. Kraus and Hans A. Kestler AG Bioinformatics and Systems Biology Institute of Neural Information Processing University of Ulm 29.06.2009
  • 2. Outline 1. Concepts of parallel programming 2. Short introduction to Clojure 3. Multi-core parallel K-means - the case study 4. Analysis and Results 5. Summary
  • 3. Parallel Programming Definition: Parallel programming is a form of programming where many calculations are performed simultaneously. • Physical constraints prevent frequency scaling of processors • This led to an increasing interest in parallel hardware and parallel programming • Multi-core hardware is standard on desktop computers • Parallel software can use this hardware to the full capacity
  • 4. • Large problems are divided into smaller ones and the sub- problems are solved simultaneously • Speedup S is limited by the fraction of parallelizable code P 1 • Amdahl’s law: S= 1−P + P N Amdahl's law 20 18 Fraction of parallelizable code 16 0.95 % 0.90 % 14 0.75 % 0.50 % 12 Speedup 10 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Number of processors
  • 5. Concepts of Parallel Programming Explicit vs. implicit parallelization • Explicitly define communication and synchronization details for each task: • MPI • Java Threads • Functional programming allows implicit parallelization: • Parallel processing of functions • Functions are free of side-effects • Data is immutable
  • 6. Distributed vs. local hardware • Master - Slave parallelization • Shared memory parallelization (e.g. Message Passing Interface) (e.g. Open Multi-Processing) CPU Master 0 Slave Slave Slave CPU Shared CPU 0 1 2 4 Memory 1 Slave Slave 3 4 CPU CPU2 3 send data read send result write
  • 7. Thread programming • Threads are refinements of a process that share the same memory and can be processed separately and simultaneously • Available in many languages, e.g. PThreads (C), Java Threads (Java), OpenMP Threads (C, Fortran) • Execution of threads is handled by a scheduler that manages the available processing time • Communication between new start runnable awake threads is faster than communication between processes schedule waiting • Invoking threads is also end block faster than fork/join terminated running processes
  • 8. Concurrency control via locking and synchronizing • Concurrency control ensures that threads can access shared memory without violating data integrity • The most popular approach to concurrency is locking and synchronizing public c l a s s Counter { private int v a l u e = 0 ; public synchronized void i n c r { value = value + 1; } } Counter c o u n t e r = new Counter ( ) ; counter . incr ( ) ; • Problems might occur when using too many locks, too few locks, wrong locks, or locks in the wrong order • Using locks can be fatally error-prone, e.g. dead-locks
  • 9. Concurrency control via transactional memory • Transactional memory offers a flexible alternative to lock-based concurrency control • Functionality is analogous to controlling simultaneous access to database management systems • Transactions ensure properties: • Atomicity: Either all changes of a transaction occur or none do • Consistency: Only valid changes are committed • Isolation: No transaction sees the effect of other transactions • Durability: Changes from transactions will be persistent
  • 10. • Software transactional memory maps transactional memory to concurrency control in parallel programming TIME :Transaction 0 :Data :Transaction 1 get data get data [consistent data] send modified data [consistent data] send modified data get data [consistent data] send modified data
  • 11. Clojure • Functional programming language hosted on the JVM • Extends the code-as-data paradigm to maps and vectors • Based on immutable data structures • Provides built-in concurrency support via software transactional memory • Completely symbiotic to Java, e.g. easy access to Java libraries • Platform independent
  • 12. • Java interaction ( import ’ ( c e r n . j e t . random . s a m p l i n g RandomSamplingAssistant ) ) ( defn sample [n k] ( seq ( . RandomSamplingAssistant ( sampleArray k ( i n t −a r r a y ( range n ) ) ) ) ) ) • Dynamic typing and multi-methods • An object is defined as the sum of what it can do (methods), rather than the sum of what it is (type hierarchy) • Add type hints to speed up code ( defn da+ [#ˆ doubles a s #ˆdoubles bs ] (amap a s i r e t (+ ( aget a s i ) ( aget bs i ) ) ) )
  • 13. Transactional references and STM • Transactional references ensure safe coordinated synchronous changes to mutable storage locations • Are bound to a single storage location for their lifetime • Only allow mutation of that location to occur within transactions • Available operations are ref-set, alter, and commute • No explicit locking is required ( def c o u n t e r ( r e f 0 ) ) ( dosync ( a l t e r c o u n t e r inc ) )
  • 14. Agents • Agents allow independent asynchronous change of mutable locations • Are bound to a single storage location for their lifetime • Only allow mutation of that location to a new state to occur as a result of an action • Actions are functions that are asynchronously applied to the state of an Agent • The return value of an action becomes new state of the Agent • Agents are integrated with the STM ( def c o u n t e r ( agent 0 ) ) ( send c o u n t e r inc )
  • 15. Cluster analysis • Given a data set X compute a partition of X into k disjoint clusters C, such that: k (1) Ci = X i=1 (2) Ci = ∅ and Ci ∩ Cj = ∅ • How many clusters are in the data set? 3 cluster 9 cluster
  • 16. Cluster algorithms • For all possible partitions evaluate the objective function f and search the optimum. Number of data points 30 • The cardinality of the set of all possible 35 25 Runtime (nanosecond) 30 20 partitions is given by: 25 15 20 15 10 k 1 10 Stirling numbers of k−i k N k = (−1) 5 the second kind SN i 5 k! i 0 0 i=0 0 5 10 15 20 25 30 35 Number of clusters Cluster algorithms provide a heuristic for this search: • Partitional clustering (K-means, Neuralgas, SOM, Fuzzy C-means, ...) • Hierarchical clustering (Divisive/agglomerative, Complete linkage, ...) • Graph-based clustering (Spectral clustering, NMF, Affinity propagation, ...) • Model-based clustering, Biclustering, Semi-supervised clustering
  • 17. K-means algorithm Function KMeans Input : X = { x 1 , . . . , x n } ( Data t o be c l u s t e r e d ) k ( Number o f c l u s t e r s ) Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s ) m: X −> C ( C l u s t e r a s s i g n m e n t s ) I n i t i a l i z e C ( e . g . random s e l e c t i o n from X) While C h a s changed For e a c h x i i n X m( x i ) = a r g m i n j d i s t a n c e ( x i , c j ) End For e a c h c j i n C c j = c e n t r o i d ( { x i | m( x i ) = j } ) End End
  • 18. Cluster Validation • Evaluation requires repeated runs of clustering, e.g.: • Resampled data sets • Different parameters • MCA-index: mean proportion of samples being consistent over different clusterings k M CA = 1 n maxπ i=1 |Ai ∩ Bj |
  • 19. Estimation of the expected value of a validation index 1.0 Random label: randomly assign each item to a cluster k 0.8 Random partition: choose a mean mca index 0.6 random partition 0.4 Random prototype: assign each item to its next prototype 0.2 0.0 0 10 20 30 40 50 Mean value from 100 runs cluster
  • 20. Multi-core K-means with Clojure • Split the data set into smaller pieces that are handled by agents • Each cluster is represented by an agent • Add a commutative list of cluster members within a transactional reference to accelerate the centroid update step Data Data Data Data Data Agent 0 Agent 1 Agent 2 Agent 3 Agent n Member Cluster Ref 0 Agent 0 Cluster Member Agent 1 Ref 1 Cluster Member Agent k Ref k read write
  • 21. simultaneous read Cluster Data Agent 0 Agent 0 Cluster Data Agent 1 Agent 1 Cluster Agent k Data Agent n simultaneous write Data Member Agent 0 Ref 0 Data Member Agent 1 Ref 1 Data Agent n Member Ref 2
  • 22. read: (nearest-cluster) write: (commute) (assoc) ( defn a s s i g n m e n t [ ] (map #(send % update−d a t a a g e n t ) DataAgents ) ( defn update−d a t a a g e n t [ d a t a p o i n t s ] (map update−d a t a p o i n t d a t a p o i n t s ) ) ( defn update−d a t a p o i n t [ d a t a p o i n t ] ( l e t [ newass ( n e a r e s t −c l u s t e r d a t a p o i n t ) ] ( dosync (commute ( nth MemberRefs newass ) conj ( : d a t a d a t a p o i n t ) ) ) ( assoc d a t a p o i n t : a s s i g n m e n t newass ) ) )
  • 23. Benchmark results Large data sets (artificial): • Each data point is sampled from N(0,1) • Summary for 10 runs of K-means 10.000 cases, 100 dimensions 1.000.000 cases, 200 dimensions 20 Cluster 20 Cluster 150 450 runtime (seconds) runtime (minutes) 100 300 150 50 0 0 ParaKMeans K-means R McKmeans K-means R McKmeans
  • 24. • Number of computer cores used • Number of data agents used 100.000 x 500 100.000 x 500 20 cluster 20 cluster 800 1500 600 runtime (seconds) runtime (seconds) 1000 400 500 200 0 0 1 4 8 4 6 8 10 number of computer cores number of data agents
  • 25. Large data sets with cluster structure • Data sampled from a multi-variate normal distribution • 100000 samples, 200/500 dimensions, 10/20 cluster K-means R McKmeans 2000 1500 runtime (seconds) 1000 500 0 200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20 Number of samples / Number of clusters
  • 26. Accuracy compared to the known grouping of data • Measured with the MCA index • Red bars indicate the random-prototype baseline 100.000 x 200 100.000 x 200 100.000 x 500 100.000 x 500 10 cluster 20 cluster 10 cluster 20 cluster 1.0 0.8 _ _ _ _ _ _ _ _ MCA index 0.4 0.6 0.2 0.0 McKmeans K-means R McKmeans K-means R McKmeans K-means R McKmeans K-means R
  • 27. Real world data set • Microarray data (Radiation-induced changes in human gene expression) • 22277 samples (genes) and 465 features (profiles) K-means R McKmeans 350 runtime (seconds) 250 150 50 0 2 Cluster 5 Cluster 10 Cluster 20 Cluster 2 Cluster 5 Cluster 10 Cluster 20 Cluster Number of clusters Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiation-induced changes in human gene expression. Nature 2009, 459:587–591
  • 28. Application to Cluster Number Estimation • Repeated clustering with different subsets of data • Repeated for different number of clusters k • Most stable clustering is produced for the ‘real’ cluster number • Jackknife resampling 1.0 • _ _ _ _ 0.8 Evaluation with MCA index _ _ 0.6 • Data set:100000 samples, MCA index 100 features, 3 cluster 0.4 • 0.2 10 runs per cluster number 0.0 • 49.26 minutes on dual-quad 2 3 4 5 6 7 core 3.2 GHz number of clusters
  • 29. Java GUI ( import ’ ( j a v a x . s w i n g JFrame J L a b e l J T e x t F i e l d JButton ) ’ ( j a v a . awt . e v e n t A c t i o n L i s t e n e r ) ’ ( j a v a . awt GridLayout ) ) ( let [ frame ( new JFrame ” H e l l o , World ! ” ) h e l l o b u t t o n ( new JButton ” Say h e l l o ” ) h e l l o l a b e l ( new J L a b e l ” ” ) ] ( . h e l l o button ( addActionListener ( proxy [ A c t i o n L i s t e n e r ] [ ] ( actionPerformed [ evt ] ( . hello label ( s e t T e x t ” H e l l o , World ! ” ) ) ) ) ) ) ( d o t o frame ( . s e t L a y o u t ( new GridLayout 1 1 3 3 ) ) ( . add h e l l o b u t t o n ) ( . add h e l l o l a b e l ) ( . s e t S i z e 300 8 0 ) ( . s e t V i s i b l e true )))
  • 30. Summary • Writing parallel programs usually requires a careful software design and a deep knowledge about thread-safe programming • Concurrency control via transactional memory circumvents problems of lock-based concurrency strategies • Immutable data structures play a key role to software transactional memory • Clojure combines Lisp, Java and a powerful STM system • This enables fast parallelization of algorithms, even for rapid prototyping • Our simulations show a good performance of the parallelized code
  • 31. Thank you for your attention.
  • 32. Statistical computing library • http://wiki.github.com/liebke/incanter • Clojure-based statistical computing • R-like semantics • COLT library for numerical computation • JFreeChart library for graphics