Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Upcoming SlideShare
Loading in...5
×
 

Paper on experimental setup for verifying - "Slow Learners are Fast"

on

  • 344 views

This paper contain findings of implementing an online machine learning algorithm on Cell Broadband Engine v/s Intel Dual Core

This paper contain findings of implementing an online machine learning algorithm on Cell Broadband Engine v/s Intel Dual Core

Statistics

Views

Total Views
344
Views on SlideShare
344
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Paper  on experimental setup for verifying  - "Slow Learners are Fast" Paper on experimental setup for verifying - "Slow Learners are Fast" Document Transcript

    • Machine Learning on Cell ProcessorSubmitted By: Supervisor:Robin Srivastava Dr. Eric McCreathUni ID: U4700252Course: COMP8740
    • AbstractThe technique of delayed stochastic gradient given in the paper titled – “Slow Learners areFast” theoretically shows how online learning process could be parallelized. However, withthe real experimental setup, given in the paper, the parallelization does not improve theperformance. In this project we implement and evaluate this algorithm on Cell and an Inteldual core processor with a target to obtain speedup with its outlined real experimentalsetup. We also discuss the limitations of Cell processor pertaining to this algorithm alongwith suggestion on CPU architectures for which it is better suited. 1
    • 1. INTRODUCTION 32. BACKGROUND 52.1 MACHINE LEARNING 52.2 ALGORITHM (REFERENCED FROM [LANGFORD, SAMOLA AND ZINKEVICH, 2009]) 62.3 POSSIBLE TEMPLATES FOR IMPLEMENTATION 6A) ASYNCHRONOUS OPTIMIZATION 6B) PIPELINED OPTIMIZATION 7C) RANDOMIZATION 72.4 CELL PROCESSOR 72.5 EXPERIMENTAL SETUP 93. DESIGN AND IMPLEMENTATION 113.1 PRE-PROCESSING TREC DATASET 113.1.1 INTEL DUAL CORE 113.1.2 CELL PROCESSOR 113.1.3 REPRESENTATION OF EMAILS AND LABELS 123.2 IMPLEMENTATION OF LOGISTIC REGRESSION 123.3 IMPLEMENTATION OF LOGISTIC REGRESSION WITH DELAYED UPDATE 133.3.1 IMPLEMENTATION ON A DUAL CORE INTEL PENTIUM PROCESSOR 143.3.2 IMPLEMENTATION ON CELL BROADBAND ENGINE 154. RESULTS 175. CONCLUSION AND FUTURE WORK 19APPENDIX I 20BAG OF WORDS REPRESENTATION 20APPENDIX II 21HASHING 21REFERENCES 22 2
    • 1. IntroductionThe inherent properties exhibited by the online learning algorithm suggest that it is anexcellent way of making the machines learn. This type of learning uses the observationseither one at a time or in small batches and discard them before the next set ofobservations are considered. They are found to be a suitable candidate for real-timelearning where data arrives in the form of stream and predictions are required to be madebefore the whole dataset has been seen. Online algorithms are also useful in the case oflarge dataset because they do not require the whole dataset to be loaded into the memoryat once.On the flip side this very suitable property of sequentiality turns out to be a curse for itsperformance. The algorithm in itself is a sequential one and with the advent of multi-coreprocessors it leads to severe under-utilization of resources put forward by these high-endmachines.In Langford et. al. [1], the authors gave a parallel version of online learning algorithm alongwith its performance data when it was run on a machine with eight cores and 32 GB ofmemory. They did the implementation of the algorithm in Java. The simulation results werepromising and they obtained speedup with the increase in number of threads as shown in(Figure 1). However, their efforts to parallelize the exact experiments resulted in a failurebecause of the high speed of serial implementation which was capable to handle over150,000 examples/second. Based on the facts that the mathematical calculations involvedin this algorithm can be accelerated by the use of SIMD operations and Java does nothave any programming support for SIMD, we have implemented and evaluated thisalgorithm on Cell processor to exploit the SIMD capabilities of its specialized co-processors in the view to obtain the speedup for the real experimental setup. Animplementation of this algorithm was also done for a machine having Intel dual coreprocessor and 1.86 GB of RAM. Figure 1 From Langford et. al. [1]The Cell processor is the first implementation on Cell Broadband Engine Architecture(CBEA) having a primary processor of 64-bit IBM PowerPC architecture and eightspecialized SIMD supported co-processors. The communication amongst theseprocessors, their dedicated local store and main memory is done through a very highspeed communication channel which has a capability to transfer at a theoretical peak rateof 96 B/cycle. The communication of data plays very crucial role for the implementation ofthis algorithm on Cell, the primary reason being the large gap between the amounts ofdata to be processed (approx. 76 MB) and memory available with the co-processors ofCell (256 KB). An efficient approach to bridge this gap is discussed in section of designand implementation. This section also gives the design of how the data was pre-processed 3
    • for implementation on Intel dual core and Cell processor. The section on backgrounddiscusses about the gradient descent and delayed stochastic gradient descent algorithm,the possible templates for the latter’s implementation, an overview of Cell processor andthe real experimental setup suggested by the designers of this algorithm. The resultsection shows comparative study of this algorithm on both the machines and we finallyconclude in the last section of conclusion and future work. This section also provides asuggestion on the CPU architecture for which this algorithm would be better suited and wemight expect a better performance in terms of speedup and reduced coding complexity. 4
    • 2. Background2.1 Machine LearningMachine learning is a technique by which a machine modifies its own behaviour on thebasis of past experiences and performance. The collection of data of past experiences andperformance is called training set. One of the methods to make a machine learn is to passthe entire training set in one go. This method is known as batch learning. The genericsteps for batch learning are as follows: Step 1: Initialize the weights. Step 2: For each batch of training data Step 2a: Process all the training data Step 2b: Update the weightA popularly known batch learning algorithm is gradient descent in which after every stepthe weight vector of the function moves in the direction of greatest decrease of the errorfunction. Mathematically this is feasible due to the observation that if any real valuedfunction F (x) is defined and differentiable in a neighbourhood of point a , then F (x)decreases fastest in the direction of negative gradient of function F (x) at point a − ∇F (a ) .Therefore if b = a − η∇F (a ) for η > 0 being a small number then F (a ) ≥ F (b) . To perform theactual steps, the algorithm goes as follows: Step 1: Initialize the weight vector w 0 with sum arbitrary values Step 2: Update the weight vector as follows w (τ +1) = w (τ ) −η∇E  w (τ )      Where ∇E is the gradient of error function and η is the learning rate. Step 3: Follow step 2 for all the batches of dataThis algorithm, however, does not prove to be a very efficient one (discussed in Bishopand Nabney, 2008). Two major weaknesses of gradient descent are: 1. The algorithm can take many iterations to converge towards a local minimum, if the curvature in different directions is very different. 2. Finding the optimal η per step can be time-consuming. Conversely, using a fixed η can yield poor results. 5
    • Some of the other more robust and faster batch learning algorithms are conjugategradients and quasi-Newton methods. In gradient-based methods the algorithms arerequired to run multiple numbers of times to obtain an optimal solution. This proves to becomputationally very costly for large datasets. There exists yet another method to makethe machines learn. It involves passing records from training set one at a time (onlinelearning). To overcome the aforementioned weakness in gradient-based methods there isan online gradient descent algorithm that has proved useful in practice for training neuralnetworks on large data sets (Le Cun et al. 1989). It is also called sequential or stochasticgradient descent and it involves updating the weight vector of the function based on onerecord at a time. The update of weight vector is done for each record either in consecutiveorder or randomly. The algorithm steps of stochastic gradient descent are similar to thesteps outlined above for batch gradient descent with a difference of considering one datapoint per iteration.The algorithm given in (2.2) is a parallel version of stochastic gradient descent through theconcept of delayed update.2.2 Algorithm (Referenced from [Langford, Samola and Zinkevich, 2009]) Input: Input: Feasible space W ⊆ R n , annealing schedule η t and delay τ ∈ N Initialization: Set w1 ......wτ = 0 and compute corresponding g t = ∇f ( wt ) For t = τ + 1 to T + τ do Obtain f t and incur loss f t ( wt ) Compute g t = ∇f t ( wt ) Update wt +1 = arg min w∈W w − ( wt − η t g t −τ ) End for Where f i : χ  R is a convex function, χ is Banach space →The goal here is to find some parameter vector w such that the sum over functions f itakes the smallest possible value. In the algorithm if τ = 0, it becomes the standardstochastic gradient descent algorithm. Here, instead of updating the parameter vector wtby the current gradient g t , it is updated by a delayed gradient g t −τ .2.3 Possible templates for implementationThere are three suggested implementation models for delayed stochastic gradientdescent. Following any of these three model would lead to an effective implementation othe algorithm. Each model follow some assumptions based on the dataset being used.A model could be chosen on the basis of the constraints matching with the assumptionshighlighted in a specific model. a) Asynchronous Optimization Assume a machine with n cores. We further assume that the time taken to compute the gradient f t is at least n times higher than that to update the value of weight vector. We run the stochastic gradient descent on all the n cores of the machine on different instances of f t while sharing a common instance of weight vector. Each 6
    • core is allowed to update the shared copy of weight vector in a round-robin fashion. This would result in a delay of τ = n – 1 between when a core sees f t and when it gets to update the shared copy of weight vector. This template is primarily suitable when computation of f t takes a large time. This implementation requires explicit synchronization for update of weight vector as it is an atomic operation. Based on the architecture of CPU significant amount of bandwidth could be exclusively used for the purpose of synchronization. b) Pipelined Optimization In this form of optimization we parallelize the computation of f t instead of running the same instance on different cores. In this case the delay occurs in the second stage of processing of results. While the second stage is still busy processing the result of the first, the latter has already moved on to the processing of f t +1 . Even in this case the weight vector is computed with a delay of τ . c) Randomization This form of optimization is used when there is a high correlation between τ and f t such that we cannot treat data as i.i.d. The observations are de-correlated by doing random permutations of the instances. The delay in this case occurs during the update of model parameters because range of de-correlation needs to exceed τ considerably.2.4 Cell Processor Cell processor is the first implementation of Cell Broadband Engine Architecture (CBEA)(Figure 2) which emerged from a joint venture of IBM, Sony and Toshiba. It’s a fullycompatible extension of 64-bit PowerPC Architecture. The design of CBEA was based onthe analysis of workloads in wide variety of areas such as cryptography, graphic transformand lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientificworkloads. The Cell processor is a multicore, heterogeneous chip carrying one 64-bit powerprocessor element (PPE), eight specialized single-instruction multiple-data (SIMD)architecture co- processors called synergistic processing element (SPE) and a high-bandwidth bus interface (Element Interconnect Bus), all integrated on-chip.The PPE consists of a power processing unit (PPU) connected to a 512 KB of L2 cache. Itis the main processor of Cell and is responsible for running the OS as well as managingthe workload amongst the SPE. The PPU is a dual-issue, in-order processor with dual-thread support. The PPU can fetch four instructions at a time and issue two. To better theperformance of in-order issue, the PPE utilizes delayed-execution pipelines and allowslimited out-of-order execution.An SPE (Figure 4) consists of a synergistic processing unit (SPU) and a synergisticmemory flow controller (SMF). It is used for data-intensive applications found readily incryptography, media and high performance scientific applications. Each SPE runs anindependent application thread. The SPE design is optimized for computation-intensiveapplications. It has SIMD support, as mentioned above, and 256 KB of its local store. Thememory flow controller consists of a DMA controller along with a memory management 7
    • unit (MMU) and atomic unit to facilitate synchronization issues with other SPEs and withthe PPE. SPU is also a dual-issue, in-order processor like PPU.SPU works on the data that exists in its dedicated local store which in turn depends on Figure 2, Cell Broadband Engine Architecturechannel interface for accessing main memory and local stores in other SPEs. The channelinterface runs independently of SPU and resides in MFC. In parallel an SPU can performoperations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or foursingle-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable ofperforming up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.The PPE and SPEs communicate through an internal high-speed element interconnectbus (EIB) [2] (Figure 3). Apart from these processors EIB also allows communicationamong off-chip memory and external IO.The EIB is implemented as a circular ring consisting of four 16B-wide unidirectionalchannels. Two of them rotate clockwise and two anti-clockwise. These channels arecapable of giving a performance of three concurrent transactions. The EIB runs at half therate of system clock and thus have an effective channel rate of 16 bytes every two systemclocks. At maximum concurrency, with three active transactions on each of the four rings,the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 8
    • Interconnect Figure 3 Element Interconnect Bus, from [3]bytes wide / 2 system clocks per transfer). The theoretical peak of EIB at 3.2 GHz is204.8GB/s. Figure 4 SPE, from [4]The memory interface controller (MIC) in the Cell BE chip is connected to the externalRAMBUS XDR memory through two XIO channels operating at a maximum effectivefrequency of 3.2GHz. The MIC has separate read and write request queues for each XIOchannel operating independently. For each channel, the MIC arbiter alternates thedispatch between read and write queues after a minimum of every eight dispatches fromeach queue or until the queue becomes empty, whichever is shorter. High-priority readrequests are given priority over reads and writes. With both XIO channels operating at3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memoryoperations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about1GB/s.2.5 Experimental SetupThe experiment is done using the asynchronous optimization (section 2.3). Figure 5schematically describes the optimization. Each core computes its own error gradient andupdates a shared copy of weight vector, shared amongst all the cores. This update iscarried out in a round-robin fashion. The delay in computation and gradient and update ofweight vector is of τ =n-1. Explicit synchronization is required for the atomic update of 9
    • weight vector.The experiment is run on the complete dataset involving all the available cores. Data Error Gradient Data Error GradientIn WeightP Vectorar Data Error Gradientall Data Error Gradient Figure 5 Asynchronous Optimization 10
    • 3. Design and ImplementationThere were three stages in the implementation of the project 1. Pre-processing of TREC dataset 2. Implementation of logistic regression algorithm 3. Implementation of logistic regression in accordance to the methodologies suggested in delayed stochastic gradient technique3.1 Pre-processing TREC Dataset3.1.1 Intel Dual CoreThe dataset contains 75, 419 emails. These emails were tokenized by a list of symbols(white spaces( ); comma(,); back slash(); period(.); semi-colon(;); colon(:); single(‘) anddouble inverted comma(“); open and close parenthesis(( )), brace({ }) and bracket([ ]);greater(>) and less(<) than sign; hyphen(-); at the rate of symbol(@); equals(=); newline(n); carriage return(r); and tab (t)). Tokenization with the aforementioned symbol listyielded 2,218,878 different tokens. A dictionary of tokens, containing the token name alongwith a unique index for each token, was created and stored in a file (dictionary). Complete Convert mails Create files for Dictionary to vectors each mail vector Set F1 Save to Raw Dataset Disk Set F2 Condensed Convert mails Create files for Dictionary to vectors each mail vector Figure 6 Pre-processing TREC dataset Pre-3.1.2 Cell ProcessorDue to memory limitations on Cell processor a condensed form of dictionary was used.This condensed form contained first hundred features from the complete dictionary. Onone hand the reduced size affected the performance of the algorithm in terms of accuracyand on the other it became more suitable for implementation on Cell. With the condensedform we transferred 32 mails vectors (discussion of vector form of mail representation 11
    • follows the current discussion) per MFC operation unlike the MFCs in the order of 10s fortransfer of one mail vector if complete dictionary is used.3.1.3 Representation of Emails and LabelsThe emails were represented as linear vectors by using a simple bag of wordsrepresentation (Appendix I). The representation of emails was done in a struct data-typehaving an unsigned int for the index value and a short for the weight of respective index.Since the dimensionality of the complete dataset comes out to be very high thereforehashing (Appendix II) was used with 218 bins. While constructing the dictionary it tookapproximately 3 hours to process ~6000 emails. This estimate was drastically reduced bythe use of hashing and finally it took approximately half an hour to process all the emails inthe dataset. Once the dictionary was in place along with a working framework for hashinga second pass on the entire dataset was carried out. In this pass each email wasconverted to a bag of words representation and stored in separate file. The format of thefile was in the following pattern: Figure 7 Email Files after pre-processing pre-The labels were provided separately in an array of short time. A label ‘1’ signified that theemail is a ‘ham’ and label ‘-1’ signifies that it is a ‘spam’.Since each mail was stored as a vector form in a file therefore on an average it took only0.03 ms (on Intel dual core 2GHz) for parsing the emails and loading them into thememory for logistic regression.3.2 Implementation of logistic regressionFor a two class problem (C1 and C2), the posterior probability of class C1 given the inputdata x and a set of fixed basis function φ = φ (x) is define by the softmax transformation exp(a1 ) p(C1 | φ ) = y (φ ) = 3.1 exp(a1 ) + exp(a 2 )Where the activations a1 is given as follows T a k = wk φ 3.2with p(C 2 | φ ) = 1 − p(C1 | φ ) , w being the weight vector.The likelihood function for input data x and target data T (coded in the 1-of-K codingscheme) is then N N ( ) p (T | w1 , w2 ) = ∏ p (C1 | φ n ) tn1 . p (C 2 | φ n ) tn 2 = ∏ y nn11 . y nn22 t t 3.3 n =1 n =1where y nk = y k (φ ( x n )) , and T is the N x 2 matrix of target variables with elements tnk.The error function is determined by taking the negative logarithm of the equation oflikelihood and its gradient could be written as 12
    • N ∇ w j E ( w1 , w2 ) = ∑ ( y nj − t nj )φ n 3.4 n =1The weight vector wk for the given class Ck is updated as follows: wτ +1 = wτ − η∇ wk E ( w1 , w2 ) k k 3.5where η is the learning rate.In this project we have defined the first class as an email being a ‘Ham’ and second classis for it being a ‘Spam’. The feature man φ is the identity function, φ ( x) = x . The weightvectors are initialized as zero.For the purpose of comparison two version of implementation of logistic regression wasprovided. The first version was sequential and the second version of implementation wasparallel. As per the claim by the authors of the delayed stochastic gradient technique, theparallel version gave a better performance compared to the sequential version withoutaffecting the correctness of the result. The comparison of performance is given in Section4.3.3 Implementation of Logistic Regression with delayed updateTo incorporate the concept of delayed update the equation (3.5) mentioned above waschanged according to the algorithm described in Section 2.2. This required processing ofcomputing the error gradient on divided set of input separately. The division of input wascarried out differently for Intel Dual core and Cell processor. For the former case thisdivision was more direct with less programming complexity, however, for the latter thedivision had to be carried out explicitly and it involved a significant complexity in terms ofprogramming. The division of data is explained in detail in the following discussion.The representation chosen for the mail helps in improving the time performance of thealgorithm. Since we are storing the indices of the vectors therefore while updating a weightvector in accordance to the contributions made by a specific mail vector we do not need toiterate through the complete dimension of the weight vector and error gradient. This isbecause the contributions by a particular mail vector would affect indices which arepresent in it. This Figure 8, below shows this concept pictorially. 13
    • 1 1 count index 6 6 13 13 1 6 3 13 2 73 73 Error 73 Weight 5 88 Gradient Vector Mail Vector 88 88 D: Dimension of weight vector and error gradient D D Figure 83.3.1 Implementation on a Dual Core Intel Pentium processorFor implementation on the Intel dual core machine (2 GHz with 1.86 GB of main memory)the processed email from the complete dictionary was used. The mail-vectors werecreated as and when they were required. The first core processed all the odd emails andsecond one all the even ones. Each core computed the error gradient separately alongwith updating a private copy of weight vectors for each core. The shared copy of weightvectors were updated atomically by both the cores.This implementation used OpenMP constructs for parallelization of the algorithm. UsingOpenMP helped in the division of email. The thread number was augmented with acounter to determine the mail number. This ensured that no two thread would access thesame data. Mail 1 Computes Mail 2 error gradient Mail 3 Core 1 Update Set F1 Mail 4 weight vector (atomically) Core 2 Computes error gradient Mail N Note: N is odd Figure 9 Implementation on Intel Dual Core 14
    • 3.3.2 Implementation on Cell Broadband EngineImplementation of the algorithm on Cell processor used the processed mails generatedfrom the condensed dictionary. The data was divided sequentially into chunks for eachSPE. The PPE was responsible for constructing the labels and the array of mail vectors.Using the MFC operation the data was made available to SPE. Each MFC operationtransferred data for 32 mails. This value was chosen because of the limited capacity (256KB) of local store of SPEs.The SIMD implementation on Cell could not benefit from the implementation model shownin Figure 10. This is because for a full scale SIMD implementation there are operationsinvolved for converting the data in the __vector form specialized for SPEs. Since we arestoring the indices separately therefore converting the data to appropriate __vector wouldrequire rearranging them according to the indices. This rearrangement would require largenumber of load operations and would affect the overall benefits from SIMD operation. Thecomplexity for time converting the data to __vector would be O(N2) where N is thedimension of mail-vector. Mail 1 Note: N is odd Mail 2 Mail 3Set F2 Mail 4 PPE Main memory Mail N SPE-1 SPE-2 SPE-6 Computes error Computes error Computes error gradient gradient gradient Update weight Update weight Update weight vector vector vector Figure 10 Implementation on Cell 15
    • For the parallel version of the algorithm each SPE required a maximum of four weightvectors to be stored in the local store. Two among them were supposed to be ownedprivately by the SPE and the remaining two was shared among all the SPEs. Along withthe weight vector each SPE would also be required to store two error gradients. The data-type for each of these quantities is float. Considering the dictionary containing 2,218,878features the requirement of memory tend to be the order of MBs. Following two datastructures were considered for storing these quantities: a) Storing the complete data as an array of required dimension. This data structure is straight forward and easy to implement but there is possibility of potential wastage of memory. For the original dimension of 2,218,878 this data structure would require approx. 50 MB of memory for each instance of SPE. This is obviously not feasible as the local store on SPEs are only of 256 KB. b) The second data structure is to use struct having an index and a count value for each entry. Since most of the values in weight vector and error gradient are not required (refer to discussion pertaining to Figure 8), therefore by using this data structure the required memory was significantly reduced and theoretically came in the order of few MBs(approx. 3). This is also not feasible because of the limited size of local store in SPE.With the use of data generated by condensed dictionary and the latter data structureproposed above, the requirement got reduced to 2400 bytes. Rest of the memory availablewith the local store was used for storing the mail-vectors and the target labels.To hide the latency of transfer of data from the main memory to the local store of the SPE,the technique of double-buffering could be used. While the SPU is performing thecomputation on the data, the MFC could be used to bring more data from main memory ofthe system to the local store of the Cell. Therefore the wait for transfer of data is reducedand latency of transfer is hidden (either partly or completely). The algorithm for processingwhile doing double buffering is as follows: 1. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #1. 2. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #2. 3. The SPU waits for buffer #1 to finish filling. 4. The SPU processes buffer #1. 5. The SPU (a) queues a DMA PUT to transmit the contents of buffer #1 and then (b) queues a DMA GETB to execute after thePUT to refill the buffer with the next portion of data from main memory. 6. The SPU waits for buffer #2 to finish filling. 7. The SPU processes buffer #2. 8. The SPU (a) queues a DMA PUT to transmit the contents of buffer #2 and then (b) queues a DMA GETB to execute after thePUT to refill the buffer with the next portion of data from main memory. 9. Repeat starting at step 3 until all data has been processed. 10. Wait for all buffers to finish. 16
    • 4. ResultsThe experiments on Intel dual core machine was run by using the mails processed with thecomplete dictionary. The time taken on this machine is significantly higher than that onCell. Serial implementation of logistic regression on Intel dual core for two simultaneousrun takes 36.93 sec and 36.45 sec.The taken by the parallel implementation using delayed stochastic gradient method isgiven in Table 1. Number Time in Time in of seconds seconds (run Threads (run 1) 2) 1. 113.09 47.09 2. 20.85 20.92 Table 1For a single thread in the first run the time taken is very large compared to any other time.This is because most of the memory load operation would be resulting in a cache miss.Since all these runs were performed consecutively therefore the time is drastically reducedbecause of reduction in the cache miss rate. It is also observed that this algorithm rendersa poorer performance when run with single thread as compared to that of serialimplementation. This time should be theoretically same; however, in the case of delayedstochastic process extra time is spent in division of data which end up not being usedanywhere.The Table 2 below shows the performance of the algorithm on multiple SPEs. Theperformance with respect to time gets better with the increase in the number of SPEs.These values are plotted in the graph given below. The performance on SPE althoughshows better results, however, the accuracy is suffered to a great extent (results not shownhere). Perform ance 48000 Time in Number 47000 micro of SPE seconds 46000 1 47398 2 44419 45000 3 42407 44000 4 42384 5 42144 43000 6 41966 42000 Table 2 41000 0 1 2 3 4 5 6 7The use of condensed dictionary N um be r of S P E comeswith a severe penalty of 17
    • accuracy.The issue of accuracy could be solved by the use of complete dictionary. However, thememory limitation on Cell processor constrained the use of complete dictionary. 18
    • 5. Conclusion and Future WorkThe approach of delayed update shows better time performance with the increase inparallelization. The improvement was shown for Intel dual core as well as for Cellprocessor. The former machine being SMP capable had less overhead of data division ascompared to that on latter. The use of Cell processor posed several limitations on theimplementation of this algorithm, the primary one being the memory limitation. Thelimitation of memory caused extra overhead due to communication. A dataset having lessfeature vectors might be expected to perform with a better speedup on this machine. For adata set with large feature vectors, this algorithm might perform better on a symmetricmultiprocessing machine (SMP). A study of this algorithm could be done on more powerfulSMP capable machine with large amount of main memory as the amount of memoryrequired to store the data doubles with a unit increase in level of parallelization. 19
    • Appendix IBag of Words RepresentationA bag-of-words representation is a model to represent a sentence in the form of vector. Itis frequently used in natural language processing and information retrieval. This modelrepresents a sentence as an unordered collection without any regard for grammar.To form a vector for a sentence, firstly, all the distinct words are identified in it. Eachdistinct word is given a unique identifier called index. Each index serves as a dimension ina D-dimensional vector space, where D is total number of unique words. The magnitude ofthe vector in a particular dimension is determined by the count of words having that index.This process requires two passes through the entire dataset. In the first pass a dictionarycontaining the unique words along with their unique indices is created. In the second passthe vectors are formed by referencing to the dictionary.For example:Consider the following sentenceWhat do you think you are doing? Word Index what 0 do 1 you 2 think 3 are 4 doing 5The resulting vector for the above sentence would be as follows:1(0) + 1(1) + 2(2) + 1(3) + 1(4) + 1(5)The vector dimension is given in parenthesis and the respective magnitudes are givenalong side. The magnitude of dimension 2 is 2 because the word you is repeated twice inthe sentence. Others are one for the similar reason. 20
    • Appendix IIHashingHashing is the transformation of a string of characters into a usually shorter fixed-lengthvalue or key that represents the original string. Hashing is used to index and retrieve itemsin a database because it is faster to find the item using the shorter hashed key than to findit using the original value. It is also used in many encryption algorithms.The hashing function used in this project is same as the one used by Oracle’s JVM. Thecode snippet performing hashing is pasted below unsigned int hashCode(char *word, int n) { unsigned int h = 0; int i; for(i=0;i<n;i++) h += word[i] * pow(31, n-i-1); return h%SIZE; } 21
    • References[1] John Langford, Alexander J. Samola and Martin Zinkevich. Slow learners are fast published in Journal of Machine Learning Research 1(2009)[2] Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor Communication Network: Built for Speed.[3] Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine Architecture and its first implementation[4] Jonathan Bartlett. Programming high-performance applications on the Cell/B.E. processor, Part 6: Smart buffer management with DMA transfers[5] Introduction to Statistical Machine Learning, 2010 course assignment 1[6] Christopher Bishop, Pattern Recognition and Machine Learning. 22