Multi-core Parallelization in Clojure - a Case Study

  • 7,581 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,581
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
179
Comments
0
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Multi-core Parallelization in Clojure - a Case Study Johann M. Kraus and Hans A. Kestler AG Bioinformatics and Systems Biology Institute of Neural Information Processing University of Ulm 29.06.2009
  • 2. Outline 1. Concepts of parallel programming 2. Short introduction to Clojure 3. Multi-core parallel K-means - the case study 4. Analysis and Results 5. Summary
  • 3. Parallel Programming Definition: Parallel programming is a form of programming where many calculations are performed simultaneously. • Physical constraints prevent frequency scaling of processors • This led to an increasing interest in parallel hardware and parallel programming • Multi-core hardware is standard on desktop computers • Parallel software can use this hardware to the full capacity
  • 4. • Large problems are divided into smaller ones and the sub- problems are solved simultaneously • Speedup S is limited by the fraction of parallelizable code P 1 • Amdahl’s law: S= 1−P + P N Amdahl's law 20 18 Fraction of parallelizable code 16 0.95 % 0.90 % 14 0.75 % 0.50 % 12 Speedup 10 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Number of processors
  • 5. Concepts of Parallel Programming Explicit vs. implicit parallelization • Explicitly define communication and synchronization details for each task: • MPI • Java Threads • Functional programming allows implicit parallelization: • Parallel processing of functions • Functions are free of side-effects • Data is immutable
  • 6. Distributed vs. local hardware • Master - Slave parallelization • Shared memory parallelization (e.g. Message Passing Interface) (e.g. Open Multi-Processing) CPU Master 0 Slave Slave Slave CPU Shared CPU 0 1 2 4 Memory 1 Slave Slave 3 4 CPU CPU2 3 send data read send result write
  • 7. Thread programming • Threads are refinements of a process that share the same memory and can be processed separately and simultaneously • Available in many languages, e.g. PThreads (C), Java Threads (Java), OpenMP Threads (C, Fortran) • Execution of threads is handled by a scheduler that manages the available processing time • Communication between new start runnable awake threads is faster than communication between processes schedule waiting • Invoking threads is also end block faster than fork/join terminated running processes
  • 8. Concurrency control via locking and synchronizing • Concurrency control ensures that threads can access shared memory without violating data integrity • The most popular approach to concurrency is locking and synchronizing public c l a s s Counter { private int v a l u e = 0 ; public synchronized void i n c r { value = value + 1; } } Counter c o u n t e r = new Counter ( ) ; counter . incr ( ) ; • Problems might occur when using too many locks, too few locks, wrong locks, or locks in the wrong order • Using locks can be fatally error-prone, e.g. dead-locks
  • 9. Concurrency control via transactional memory • Transactional memory offers a flexible alternative to lock-based concurrency control • Functionality is analogous to controlling simultaneous access to database management systems • Transactions ensure properties: • Atomicity: Either all changes of a transaction occur or none do • Consistency: Only valid changes are committed • Isolation: No transaction sees the effect of other transactions • Durability: Changes from transactions will be persistent
  • 10. • Software transactional memory maps transactional memory to concurrency control in parallel programming TIME :Transaction 0 :Data :Transaction 1 get data get data [consistent data] send modified data [consistent data] send modified data get data [consistent data] send modified data
  • 11. Clojure • Functional programming language hosted on the JVM • Extends the code-as-data paradigm to maps and vectors • Based on immutable data structures • Provides built-in concurrency support via software transactional memory • Completely symbiotic to Java, e.g. easy access to Java libraries • Platform independent
  • 12. • Java interaction ( import ’ ( c e r n . j e t . random . s a m p l i n g RandomSamplingAssistant ) ) ( defn sample [n k] ( seq ( . RandomSamplingAssistant ( sampleArray k ( i n t −a r r a y ( range n ) ) ) ) ) ) • Dynamic typing and multi-methods • An object is defined as the sum of what it can do (methods), rather than the sum of what it is (type hierarchy) • Add type hints to speed up code ( defn da+ [#ˆ doubles a s #ˆdoubles bs ] (amap a s i r e t (+ ( aget a s i ) ( aget bs i ) ) ) )
  • 13. Transactional references and STM • Transactional references ensure safe coordinated synchronous changes to mutable storage locations • Are bound to a single storage location for their lifetime • Only allow mutation of that location to occur within transactions • Available operations are ref-set, alter, and commute • No explicit locking is required ( def c o u n t e r ( r e f 0 ) ) ( dosync ( a l t e r c o u n t e r inc ) )
  • 14. Agents • Agents allow independent asynchronous change of mutable locations • Are bound to a single storage location for their lifetime • Only allow mutation of that location to a new state to occur as a result of an action • Actions are functions that are asynchronously applied to the state of an Agent • The return value of an action becomes new state of the Agent • Agents are integrated with the STM ( def c o u n t e r ( agent 0 ) ) ( send c o u n t e r inc )
  • 15. Cluster analysis • Given a data set X compute a partition of X into k disjoint clusters C, such that: k (1) Ci = X i=1 (2) Ci = ∅ and Ci ∩ Cj = ∅ • How many clusters are in the data set? 3 cluster 9 cluster
  • 16. Cluster algorithms • For all possible partitions evaluate the objective function f and search the optimum. Number of data points 30 • The cardinality of the set of all possible 35 25 Runtime (nanosecond) 30 20 partitions is given by: 25 15 20 15 10 k 1 10 Stirling numbers of k−i k N k = (−1) 5 the second kind SN i 5 k! i 0 0 i=0 0 5 10 15 20 25 30 35 Number of clusters Cluster algorithms provide a heuristic for this search: • Partitional clustering (K-means, Neuralgas, SOM, Fuzzy C-means, ...) • Hierarchical clustering (Divisive/agglomerative, Complete linkage, ...) • Graph-based clustering (Spectral clustering, NMF, Affinity propagation, ...) • Model-based clustering, Biclustering, Semi-supervised clustering
  • 17. K-means algorithm Function KMeans Input : X = { x 1 , . . . , x n } ( Data t o be c l u s t e r e d ) k ( Number o f c l u s t e r s ) Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s ) m: X −> C ( C l u s t e r a s s i g n m e n t s ) I n i t i a l i z e C ( e . g . random s e l e c t i o n from X) While C h a s changed For e a c h x i i n X m( x i ) = a r g m i n j d i s t a n c e ( x i , c j ) End For e a c h c j i n C c j = c e n t r o i d ( { x i | m( x i ) = j } ) End End
  • 18. Cluster Validation • Evaluation requires repeated runs of clustering, e.g.: • Resampled data sets • Different parameters • MCA-index: mean proportion of samples being consistent over different clusterings k M CA = 1 n maxπ i=1 |Ai ∩ Bj |
  • 19. Estimation of the expected value of a validation index 1.0 Random label: randomly assign each item to a cluster k 0.8 Random partition: choose a mean mca index 0.6 random partition 0.4 Random prototype: assign each item to its next prototype 0.2 0.0 0 10 20 30 40 50 Mean value from 100 runs cluster
  • 20. Multi-core K-means with Clojure • Split the data set into smaller pieces that are handled by agents • Each cluster is represented by an agent • Add a commutative list of cluster members within a transactional reference to accelerate the centroid update step Data Data Data Data Data Agent 0 Agent 1 Agent 2 Agent 3 Agent n Member Cluster Ref 0 Agent 0 Cluster Member Agent 1 Ref 1 Cluster Member Agent k Ref k read write
  • 21. simultaneous read Cluster Data Agent 0 Agent 0 Cluster Data Agent 1 Agent 1 Cluster Agent k Data Agent n simultaneous write Data Member Agent 0 Ref 0 Data Member Agent 1 Ref 1 Data Agent n Member Ref 2
  • 22. read: (nearest-cluster) write: (commute) (assoc) ( defn a s s i g n m e n t [ ] (map #(send % update−d a t a a g e n t ) DataAgents ) ( defn update−d a t a a g e n t [ d a t a p o i n t s ] (map update−d a t a p o i n t d a t a p o i n t s ) ) ( defn update−d a t a p o i n t [ d a t a p o i n t ] ( l e t [ newass ( n e a r e s t −c l u s t e r d a t a p o i n t ) ] ( dosync (commute ( nth MemberRefs newass ) conj ( : d a t a d a t a p o i n t ) ) ) ( assoc d a t a p o i n t : a s s i g n m e n t newass ) ) )
  • 23. Benchmark results Large data sets (artificial): • Each data point is sampled from N(0,1) • Summary for 10 runs of K-means 10.000 cases, 100 dimensions 1.000.000 cases, 200 dimensions 20 Cluster 20 Cluster 150 450 runtime (seconds) runtime (minutes) 100 300 150 50 0 0 ParaKMeans K-means R McKmeans K-means R McKmeans
  • 24. • Number of computer cores used • Number of data agents used 100.000 x 500 100.000 x 500 20 cluster 20 cluster 800 1500 600 runtime (seconds) runtime (seconds) 1000 400 500 200 0 0 1 4 8 4 6 8 10 number of computer cores number of data agents
  • 25. Large data sets with cluster structure • Data sampled from a multi-variate normal distribution • 100000 samples, 200/500 dimensions, 10/20 cluster K-means R McKmeans 2000 1500 runtime (seconds) 1000 500 0 200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20 Number of samples / Number of clusters
  • 26. Accuracy compared to the known grouping of data • Measured with the MCA index • Red bars indicate the random-prototype baseline 100.000 x 200 100.000 x 200 100.000 x 500 100.000 x 500 10 cluster 20 cluster 10 cluster 20 cluster 1.0 0.8 _ _ _ _ _ _ _ _ MCA index 0.4 0.6 0.2 0.0 McKmeans K-means R McKmeans K-means R McKmeans K-means R McKmeans K-means R
  • 27. Real world data set • Microarray data (Radiation-induced changes in human gene expression) • 22277 samples (genes) and 465 features (profiles) K-means R McKmeans 350 runtime (seconds) 250 150 50 0 2 Cluster 5 Cluster 10 Cluster 20 Cluster 2 Cluster 5 Cluster 10 Cluster 20 Cluster Number of clusters Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiation-induced changes in human gene expression. Nature 2009, 459:587–591
  • 28. Application to Cluster Number Estimation • Repeated clustering with different subsets of data • Repeated for different number of clusters k • Most stable clustering is produced for the ‘real’ cluster number • Jackknife resampling 1.0 • _ _ _ _ 0.8 Evaluation with MCA index _ _ 0.6 • Data set:100000 samples, MCA index 100 features, 3 cluster 0.4 • 0.2 10 runs per cluster number 0.0 • 49.26 minutes on dual-quad 2 3 4 5 6 7 core 3.2 GHz number of clusters
  • 29. Java GUI ( import ’ ( j a v a x . s w i n g JFrame J L a b e l J T e x t F i e l d JButton ) ’ ( j a v a . awt . e v e n t A c t i o n L i s t e n e r ) ’ ( j a v a . awt GridLayout ) ) ( let [ frame ( new JFrame ” H e l l o , World ! ” ) h e l l o b u t t o n ( new JButton ” Say h e l l o ” ) h e l l o l a b e l ( new J L a b e l ” ” ) ] ( . h e l l o button ( addActionListener ( proxy [ A c t i o n L i s t e n e r ] [ ] ( actionPerformed [ evt ] ( . hello label ( s e t T e x t ” H e l l o , World ! ” ) ) ) ) ) ) ( d o t o frame ( . s e t L a y o u t ( new GridLayout 1 1 3 3 ) ) ( . add h e l l o b u t t o n ) ( . add h e l l o l a b e l ) ( . s e t S i z e 300 8 0 ) ( . s e t V i s i b l e true )))
  • 30. Summary • Writing parallel programs usually requires a careful software design and a deep knowledge about thread-safe programming • Concurrency control via transactional memory circumvents problems of lock-based concurrency strategies • Immutable data structures play a key role to software transactional memory • Clojure combines Lisp, Java and a powerful STM system • This enables fast parallelization of algorithms, even for rapid prototyping • Our simulations show a good performance of the parallelized code
  • 31. Thank you for your attention.
  • 32. Statistical computing library • http://wiki.github.com/liebke/incanter • Clojure-based statistical computing • R-like semantics • COLT library for numerical computation • JFreeChart library for graphics