Muda Proposal

MUDA
MUltiple Data Accelerator language

Project Overview
Feb 24, 2008
Syoyo FUJITA

GPU slumps
CPU soars
Geforce 9800 GX2 rumor

1 TFlops?( 3x of G80)
500 GFlops? (+50% of G80)

?
No
update !

PS3 Mac Pro octa
179.2 Gﬂops
+800 %
204 Gﬂops

2007 Feb/2008

Subprime shock!
Nikkei 225 index Credit boom ends!
US economy declines!
Green IT!

Future of GPU trend

Accelerated
computing

many-core GPGPU

CPU GPU

Accelerated
computing

many-core GPGPU

NO!
CPU GPU

GPGPU was dead!!
GPU will be dead soon!!

Why GPU -> GPGPU is
BAD
• Larger latency : host <-> PCI-ex
• Internal architecture is black box
• Only GPU maker knows it
• Larger cost of branching
• Debugger?
• Program only runs on speciﬁc GPU maker’s
GPU
• Not portable.

Why CPU -> Accelerated computing is
GOOD

• Easy to program
• CPU maker provides good internal spec
documentation
• Fast execution of branching
• gdb :-)
• Portable & Versatile

Accelerated
computing

many-core

MUDA
CPU

MUDA’s goal

• Withdraw CPU’s maximum
ﬂoating point performance for
large data
• SIMD
• Cache optimized computation

MUDA example
MUDA code
vec sqrtmu(vec x)
{
vec y0, y0x, y0xhalf;
vec oneish = bit(0x3f800001);

y0 = rsqrt(x);
y0x = y0 * x;
y0xhalf = 0.5 * y0x;

return ((oneish - y0 * y0x) * y0xhalf + y0x);
}

__m128 sqrtmu (const __m128 * x)
{
x86/SSE output
__m128 y0 ;

__m128 y0x ;

__m128 y0xhalf ;

const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ;
__m128 oneish = t_vec4 ;

const __m128 t_vec6 = (*x) ;
const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ;
y0 = t_vec5 ;

const __m128 t_vec8 = y0 ;
const __m128 t_vec9 = (*x) ;
const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ;
y0x = t_vec7 ;

const float t_float13 = 0.5 ;
const float t_float12 = t_float13 ;
const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ;
const __m128 t_vec14 = y0x ;
y0xhalf = t_vec11 ;

const __m128 t_vec19 = oneish ;
const __m128 t_vec20 = y0 ;
const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ;
const __m128 t_vec22 = y0xhalf ;
const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ;
return t_vec18 ;
}

No uniﬁed way to
describe SIMD op

• SSE: _mm_add_ps()
• AltiVec: vec_add
• SPE: spu_add

CPU ISA changes
frequently
• SSE2(2000), SSE3(2004), SSE4(2006)
• SSE5 and Coming New CPU design(?)
• 8-element SIMD?, no SIMD in the future
CPU?
• Keeping up with them is hard and
not productive. Waste of your
time.

SSE2 C code

SSE4 C code
MUDA
MUDA
compiler
VMX C code
Portable,
CPU independent
description
LLVM IR

CPU or Arch dependent
code

Status
• SSE2 backend : 75 %
• SSE4 backend : 0 %
• VMX backend : 20 %
• LLVM IR backend : 30 %
• SIMD math function for MUDA : 5 %
• Automatic optimizer : TODO
= I’m currently working on

Future direction
• Cache miss analysis and memory access
optimization

• Valgrind, Cache Miss Equation(CME)

• Automatic optimization
• Such like FFTW, ATLAS and Spiral are doing
• Automatic error measurement for
ﬂoating point computation

• Interval Arithmetic, Afﬁne Arithmetic, Gappa

Performance gap
100

75

Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Performance gap
100

Optimizing memory access is much
75
more important than SIMDization
Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Muda Proposal

Recommended

Recommended

More Related Content

Similar to Muda Proposal

Similar to Muda Proposal (20)

Recently uploaded

Recently uploaded (20)

Muda Proposal