This document discusses parallelizing the Berlekamp-Massey algorithm (BMA) using SIMD instructions and GPU streams. It first provides background on the BMA and its use in linear feedback shift registers. It then describes implementing the BMA using SSE and AVX instructions to perform computations on multiple data elements simultaneously. Finally, it discusses parallelizing the BMA on a GPU using multiple streams to concurrently execute kernels, achieving speedups over the serial CPU implementation for long input lengths. Evaluation results show the SIMD CPU implementation outperforms GPU with one stream for inputs under 2^23 bits, while the GPU with 32 streams is fastest for longer inputs.