Muda Proposal

3,448 views
3,324 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,448
On SlideShare
0
From Embeds
0
Number of Embeds
149
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Muda Proposal

  1. 1. MUDA MUltiple Data Accelerator language Project Overview Feb 24, 2008 Syoyo FUJITA
  2. 2. ?
  3. 3. Nikkei 225 index
  4. 4. ?
  5. 5. GPU slumps CPU soars Geforce 9800 GX2 rumor 1 TFlops?( 3x of G80) 500 GFlops? (+50% of G80) ? No update ! PS3 Mac Pro octa 179.2 Gflops +800 % 204 Gflops 2007 Feb/2008
  6. 6. Nikkei 225 index
  7. 7. Subprime shock! Nikkei 225 index Credit boom ends! US economy declines! Green IT! Future of GPU trend
  8. 8. Accelerated computing many-core GPGPU CPU GPU
  9. 9. Accelerated computing many-core GPGPU NO! CPU GPU GPGPU was dead!! GPU will be dead soon!!
  10. 10. Why GPU -> GPGPU is BAD • Larger latency : host <-> PCI-ex • Internal architecture is black box • Only GPU maker knows it • Larger cost of branching • Debugger? • Program only runs on specific GPU maker’s GPU • Not portable.
  11. 11. Why CPU -> Accelerated computing is GOOD • Easy to program • CPU maker provides good internal spec documentation • Fast execution of branching • gdb :-) • Portable & Versatile
  12. 12. Accelerated computing many-core MUDA CPU
  13. 13. MUDA’s goal • Withdraw CPU’s maximum floating point performance for large data • SIMD • Cache optimized computation
  14. 14. MUDA example MUDA code vec sqrtmu(vec x) { vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001); y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }
  15. 15. __m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }
  16. 16. Why MUDA?
  17. 17. No unified way to describe SIMD op • SSE: _mm_add_ps() • AltiVec: vec_add • SPE: spu_add
  18. 18. CPU ISA changes frequently • SSE2(2000), SSE3(2004), SSE4(2006) • SSE5 and Coming New CPU design(?) • 8-element SIMD?, no SIMD in the future CPU? • Keeping up with them is hard and not productive. Waste of your time.
  19. 19. SSE2 C code SSE4 C code MUDA MUDA compiler VMX C code Portable, CPU independent description LLVM IR CPU or Arch dependent code
  20. 20. Status • SSE2 backend : 75 % • SSE4 backend : 0 % • VMX backend : 20 % • LLVM IR backend : 30 % • SIMD math function for MUDA : 5 % • Automatic optimizer : TODO = I’m currently working on
  21. 21. Future direction • Cache miss analysis and memory access optimization • Valgrind, Cache Miss Equation(CME) • Automatic optimization • Such like FFTW, ATLAS and Spiral are doing • Automatic error measurement for floating point computation • Interval Arithmetic, Affine Arithmetic, Gappa
  22. 22. Performance gap 100 75 Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory
  23. 23. Performance gap 100 Optimizing memory access is much 75 more important than SIMDization Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory

×