Lecture 13 Local Optimization on Mobile Devices

Давидов М.В.
Лекції 13-14. Локальна оптимізація програм для
мобільних пристроїв
"The First Rule of Program Optimization:
Don't do it.
The Second Rule of Program Optimization
(for experts only!): Don't do it yet." —
Michael A. Jackson
(British computer scientist)
ADVA School

2
Критерії оптимізації
З точки зору користувача:
1. Зручність використання (комфорт)
2. Надійність (час роботи без глюків)
3. Час роботи від батареї
4. Об’єм даних (локально і в хмарі)
5. Ціна пристрою

3
З точки зору менеджера продукту:
1. “Крутіше” від конкурентів
2. Щоб гарно виглядало
3. Щоб не глючило
4. Менша собівартість

4
З точки зору програміста:
1. Зрозумілість коду програми
2. Гнучка і зрозуміла архітектура
3. Переносимість
4. Час виконання
5. Використання пам’яті

5
Засоби оптимізації програм
Глобальні:
1. Продумана архітектура програми
2. Використання крос-платформних засобів
3. Використання оптимальних алгоритмів
Локальні:
1. Використання ключів оптимізації
2. Паралелізація
3. Використання графічного процесора

6
1. Ключі оптимізації

7
2. Паралелізація
2.1. Конвеєрне виконання
2.2. Паралелізація даних
2.3. Паралельне виконання

8
2.1. Конвеєрне виконання
IF (англ. Instruction Fetch) — отримання інструкції
ID (англ. Instruction Decode) — декодування інструкції
EX (англ. Execute) — виконання (обчислення адреси)
MEM (англ. Memory access) — читання пам’яті
WB (англ. Register write back) — запис

9
2.2. Паралелізація даних
bit63=0 bit0=1
...A:
bit63=1 bit0=1
...B:
A&B=
byte7=128 byte0=1
...A:
byte7=127 byte0=127
...B:
A+B=
0 1
...
255 (?!) 128
...

10
Приклад – середнє по масиву
template<class E, class S>
void mean(E * p, size_t n, S& res)
{
S sum = 0;
for(size_t i=0; i<n; ++i) sum += p[i];
res = sum/n;
}
void meanV(E * p, size_t n, S& res)
{
volatile S sum = 0;
res = sum/n;
}

11
NEON для ARM
Підтримується:
iPhone 3GS+
iPod Touch gen 3+
iPad 1+
#ifdef __ARM_NEON
#include <arm_neon.h>
#endif
Основні типи:
[u]int8x8_t, [u]int8x16_t
[u]int16x4_t, [u]int16x8_t;
float16x4_t, float16x8_t;
float32x2_t, float32x4_t;

12
Реалізація NEON
#ifdef __ARM_NEON
void meanNEON(uint8 * p, size_t n, int& res)
{
int sum = 0; size_t total_n = n;
while( (long(p)&15) && n ) { sum += *p; ++p; --n; } // add unaligned data (seems it is not
necessary)
uint8_t * addr = (uint8_t *)p;
while(n>16)
{
size_t ln = std::min( n,size_t(16*128) ) & ~15; n-=ln;
uint8x16_t zero;
uint16x8_t sum16_1;
zero = veorq_u8(zero, zero);
sum16_1 = veorq_u16(sum16_1, sum16_1);
uint16_t res_arr[8];
while(ln>0)
{
uint8x16_t data8 = vld1q_u8(addr); // load aligned data
sum16_1 = vaddq_u16( sum16_1, vpaddlq_u8(data8) );
addr+=16; ln-=16;
}
vst1q_u16( res_arr, sum16_1);
for(int i=0;i<16;++i) sum += res_arr[i];
}
p = (uint8 *)addr;
res = sum/total_n;
}
#endif

13
Результати (1 потік) 
1.3 GHz iPad mini2
meanNEON uint8->int 0.314 ( 4852.794 mpps)
mean uint8->int32 1.063 ( 1435.463 mpps)
mean uint8->int 1.071 ( 1424.135 mpps)
mean int->int 1.856 ( 822.160 mpps)
meanV uint8->int 22.701 ( 67.216 mpps)
5200 mpps = 4 елементи/такт
67 mpps = 20 тактів/елемент

14
Спеціальні набори команд Intel 
#include <immintrin.h>

15
Реалізація SSE2
#ifdef __SSE2__
void meanSSE2(uint8 * p, size_t n, int& res)
{
int sum = 0;
size_t total_n = n;
while( (long(p)&15) && n ) { sum += *p; ++p; --n; } // add unaligned data
__m128i * addr = (__m128i *)p;
while(n>16)
{
size_t ln = std::min( n,size_t(16*256) ) & ~15; n-=ln;
__m128i zero, sum16_1, sum16_2;
zero = _mm_andnot_si128(zero, zero);
sum16_1 = _mm_andnot_si128(sum16_1, sum16_1); sum16_2 = _mm_andnot_si128(sum16_2,
sum16_2);
uint16 res_arr[16];
while(ln>0)
{
__m128 data8 = _mm_load_si128(addr); // load aligned data
++addr; ln-=16;
__m128 data16_hi = _mm_unpackhi_epi8(data8, zero);
__m128 data16_lo = _mm_unpacklo_epi8(data8, zero);
sum16_1 = _mm_add_epi16(sum16_1, data16_hi);
sum16_2 =_mm_add_epi16(sum16_2, data16_lo);
}
_mm_storeu_si128( (__m128i*)res_arr, sum16_1);
_mm_storeu_si128( (__m128i*)(res_arr+8), sum16_2);
for(int i=0;i<16;++i) sum += res_arr[i];
}
p = (uint8 *)addr; for(size_t i=0; i<n; ++i) sum += p[i]; // add unaligned data
res = sum/total_n;
}
#endif

16
Результати (1 потік) i5 x 2.4 GHz
meanSSE2 uint8->int 0.178 ( 8574.948 mpps)
mean int->int 0.421 ( 3620.991 mpps)
7200 mpps = 3 елементи/такт
336 mpps = 7 тактів/елемент

17
Використання потоків
pthreads – відносно багато коду
OpenMP – не підтримується Xcode/LLVM-iOS
C++ 11 threads – гнучко
Algotest PARALLEL_FOR (C++ 11 based)
PARALLEL_FOR(0, repeat, j)
{
S res=0;
meanF(&v[0], v.size(), res);
} PARALLEL_END;
// замість
// for(int j=0;j<repeat;++j) meanF(&v[0],

18
Реалізація PARALLEL_FOR
#define PARALLEL_FOR(from, to, local_var_name)
PARALLEL_FOR_N(sysutils::KNumThreadsAuto, from, to,
local_var_name)
#define PARALLEL_FOR_N(n_threads, from, to, local_var_name)

sysutils::runForThreads( n_threads, (from), (to),
[&](int local_var_name##parallel_for_beg, int
local_var_name##parallel_for_end)
{
for (int local_var_name =
local_var_name##parallel_for_beg;
local_var_name <
local_var_name##parallel_for_end;
++local_var_name)
// тут іде вміст користувацького циклу

19
Реалізація runForThreads
template<class Fn>
void runForThreads(int num_parts, int beg, int end, Fn&& Fx)
{
std::vector<std::thread> threads;
if (num_parts == KNumThreadsAuto) num_parts =
getOptimalParallelThreads();
if (num_parts <= 1)
{
Fx(beg, end);
return;
}
for (int i = 0; i < num_parts; ++i)
{
int begi = beg + (end - beg) *i / num_parts;
int endi = beg + (end - beg) *(i + 1) / num_parts;
threads.push_back(std::thread([begi, endi, &Fx](){Fx(begi, endi); }));
}
for (auto& thread : threads) thread.join();
}

20
Проблеми
// неправильно!!!
void meanThreads(E * p, size_t n, S& res)
{
S sum = 0;
PARALLEL_FOR(0,n,i)
{
sum += p[i];
} PARALLEL_END
res = sum/n;
}

21
Проблеми
// Довго
template<class E>
void meanThreads(E * p, size_t n, int& res)
{
std::atomic_int sum(0);
PARALLEL_FOR(0,n,i)
{
for(size_t i=0; i<n;++i) sum += p[i];
} PARALLEL_END
res = int(sum)/n;
}

22
OpenMP solution
int sum=0;
#pragma omp parallel for reduction(+:sum)
for (int i=0; i < n; i++)
sum += p[i];

23
Результати (2 потоки) i5 x 2.4 GHz
meanSSE2 uint8->int 0.094 (15153.367 mpps)
mean uint8->int32 0.144 (12622.245 mpps)
mean uint8->int 0.108 (14120.135 mpps)
mean uint16->int 0.132 (11589.456 mpps)
mean int->int 0.228 ( 6705.375 mpps)
14400 mpps = 6 елементів/такт

24
Результати (2 потоки) 
1.3 GHz iPad mini2
meanNEON uint8->int 0.186 ( 8192.625 mpps)
mean int->int 1.297 ( 1176.455 mpps)
7800 mpps = 6 елементів/такт

25
Чи можна простіше?
void meanVect4(E * p, size_t n, S& res)
{
vect4<S> sum(0);
size_t i=0;
for(; i<n; i+=4) sum += vect4<S>(p[i],
p[i+1], p[i+2], p[i+3]);
for(; i<n; ++i) sum.x += p[i];
res = (sum.x + sum.y + sum.z + sum.w)/n;
}

26
Або так?
void meanVect4_2(E * p, size_t n, S& res)
{
vect4<S> sum1(0);
vect4<S> sum2(0);
size_t i=0;
for(; i<n; i+=8)
{
sum1 += vect4<S>(p[i], p[i+1], p[i+2], p[i+3]);
sum2 += vect4<S>(p[i+4], p[i+5], p[i+6], p[i+7]);
}
for(; i<n; ++i) sum1.x += p[i];
res = (sum1.x + sum1.y + sum1.z + sum1.w +
sum2.x + sum2.y +
sum2.z + sum2.w)/n;
}

27
Результат
iPad2, 2 threads
meanVect4_2 float->float 1.118 (1364.401 mpps)
meanVect4_2 float->double 1.774 ( 860.051 mpps)
meanVect4 float->float 2.970 ( 513.785 mpps)
meanVect4 float->double 2.980 ( 511.982 mpps)
mean float->float 10.873 ( 140.339 mpps)
mean float->double 13.431 ( 113.607 mpps)
i5, 2.4 GHz, 2 threads
meanVect4 float->float 0.248 ( 6160.736 mpps)
meanVect4_2 float->float 0.267 ( 5719.991 mpps)
meanVect4_2 float->double 0.313 ( 4870.904 mpps)
meanVect4 float->double 0.326 ( 4683.014 mpps)
mean float->float 0.461 ( 3310.013 mpps)
mean float->double 0.470 ( 3245.769 mpps)
int 8,16,32 -> int32, 64 - No speedup!!!

28
Приклад - розмивання Гауса
G(x)=
1
√2πσ
e
−
x
2
2σ
2
G(x , y)=
1
2πσ2
e
−
x
2
+ y
2
2σ
2
= G(x)G( y)
Io(x , y)= ∑
dx= − k
k
∑
dy= − k
k
G(dx ,dy) I (x+ dx , y+ dy)
I1(x , y)= ∑
dx= − k
k
G(dx) I (x+ dx , y)
I 2(x , y)= ∑
dy= − k
k
G(dy)I1(x , y+ dy)

29
Реалізація 1 (k=3)
void symmetricBlur2D(PlainImage<uint8>& image, PlainImage<uint8>& temp,
PlainImage<uint8>& out_image,fvect4 koefs,
void (*symmetricBlur)(const vect4<uint8> * data, vect4<uint8> * out, int
n, int di, fvect4 koefs) )
{
assert(image.isSameSize(temp) && image.getNumChannels()==4 &&
temp.getNumChannels()==4);
int w = image.getWidth();
int h = image.getHeight();
for(int y = 0;y<h;++y)
{
symmetricBlur( (const vect4<uint8> *)image.at(0, y),
(vect4<uint8> *)temp.at(0, y), w, 1, koefs );
}
for(int x = 0;x<w;++x)
{
symmetricBlur( (const vect4<uint8> *)temp.at(x, 0),
(vect4<uint8> *)out_image.at(x, 0), h, w, koefs );
}
}

30
Реалізація 1 (k=3) продовження
void symmetricBlur2(const vect4<uint8> * data, vect4<uint8> * out, int n, int
di, fvect4 koefs)
{
for(int i=0;i<3;++i)
{
...
}
for(int i=3;i<n-3;++i)
{
fvect4 res =
fvect4(data[i*di]) * koefs.x +
(fvect4(data[(i+1)*di])+fvect4(data[(i-1)*di]))*koefs.y +
(fvect4(data[(i+2)*di])+fvect4(data[(i-2)*di]))*koefs.z +
(fvect4(data[(i+3)*di])+fvect4(data[(i-3)*di]))*koefs.w;
out[i*di] = vect4<uint8>(res);
}
for(int i=n-3;i<n;++i)
{
...
}
}

31
Реалізація 1 (k=3) продовження
di, fvect4 koefs)
{
for(int i=0;i<3;++i)
{
...
}
for(int i=3;i<n-3;++i)
{
fvect4 res =
fvect4(data[i*di]) * koefs.x +
(fvect4(data[(i+1)*di])+fvect4(data[(i-1)*di]))*koefs.y +
(fvect4(data[(i+2)*di])+fvect4(data[(i-2)*di]))*koefs.z +
(fvect4(data[(i+3)*di])+fvect4(data[(i-3)*di]))*koefs.w;
out[i*di] = vect4<uint8>(res);
}
for(int i=n-3;i<n;++i)
{
...
}
}
iPad mini retina – 4.423 mpps
i5 – 8.568 mpps

32
di, fvect4 koefs)
{
fvect4 dvect[8];
dvect[0] = dvect[1] = dvect[2] = dvect[3] = fvect4(*data); data += di;
dvect[4] = fvect4(*data); data += di; dvect[5] = fvect4(*data);
int bp = 6;
for(int i = 3; i<n;++i, out+=di, bp = (bp+1)&7)
{
dvect[bp] = fvect4(* (data+=di) );
fvect4 res = dvect[(bp-3)&7 ] * koefs.x +
(dvect[(bp-2)&7] + dvect[ (bp-4)&7 ]) * koefs.y +
(dvect[(bp-1)&7] + dvect[ (bp-5)&7 ]) * koefs.z +
(dvect[bp] + dvect[ (bp-6)&7 ]) * koefs.w;
*out = vect4<uint8>(res);
}
for(int i = 0; i<3;++i, out += di, bp = (bp+1)&7)
{ // last N pixels...
}
}

33
di, fvect4 koefs)
{
fvect4 dvect[8];
dvect[0] = dvect[1] = dvect[2] = dvect[3] = fvect4(*data); data += di;
dvect[4] = fvect4(*data); data += di; dvect[5] = fvect4(*data);
int bp = 6;
for(int i = 3; i<n;++i, out+=di, bp = (bp+1)&7)
{
dvect[bp] = fvect4(* (data+=di) );
fvect4 res = dvect[(bp-3)&7 ] * koefs.x +
(dvect[(bp-2)&7] + dvect[ (bp-4)&7 ]) * koefs.y +
(dvect[(bp-1)&7] + dvect[ (bp-5)&7 ]) * koefs.z +
(dvect[bp] + dvect[ (bp-6)&7 ]) * koefs.w;
*out = vect4<uint8>(res);
}
for(int i = 0; i<3;++i, out += di, bp = (bp+1)&7)
{ // last N pixels...
}
}
i5 – 14.358 mpps

34
template<class TChannelType>
class ImageIndexer
{
// ...
public:
inline ChannelType* at(int x, int y)
{
return (ChannelType*)
((char*)(m_row_starts[y]) + m_column_shifts[x]);
}
inline const ChannelType* at(int x, int y) const
{
return (const ChannelType*)
((const char*)(m_row_starts[y])+m_column_shifts[x]);
}
}

35
void symmetricBlur2DImage(const ImageIndexer<uint8>& image, ImageIndexer<uint8>& temp,
ImageIndexer<uint8>& out_image,fvect4 k )
{
assert(image.isSameSize(temp) && image.getNumChannels()==4 &&
temp.getNumChannels()==4);
int w = image.getWidth(); int h = image.getHeight();
for(int y = 0;y<h;++y) for(int x = 0;x<w;++x) {
fvect4 res = fvect4( image.at(x,y) ) * k.x +
( fvect4( image.at(x,y-1) ) + fvect4( image.at(x,y+1) ) )*k.y +
( fvect4( image.at(x,y-2) ) + fvect4( image.at(x,y+2) ) )*k.z +
( fvect4( image.at(x,y-3) ) + fvect4( image.at(x,y+3) ) )*k.w;
vect4<uint8> resu(float(res.x), float(res.y), float(res.z),
float(res.w) );
*(uint32*)temp.at(x,y) = *(uint32*)&resu.x;
}
for(int y = 0;y<h;++y) for(int x = 0;x<w;++x) {
fvect4 res =
fvect4( temp.at(x,y) ) * k.x +
( fvect4( temp.at(x+1,y) ) + fvect4( temp.at(x-1,y) ) )*k.y +
( fvect4( temp.at(x+2,y) ) + fvect4( temp.at(x-2,y) ) )*k.z +
( fvect4( temp.at(x+3,y) ) + fvect4( temp.at(x-3,y) ) )*k.w;
vect4<uint8> resu( float(res.x), float(res.y), float(res.z),
float(res.w) );
*(uint32*)out_image.at(x,y) = *(uint32*)&resu.x;
}
}
i5 – 17.358 mpps
Якщо поміняти x і y буде в 7
разів повільніше!!!

Use OpenGL ES to do it faster!
ES = embedded systems

Особливості OpenGL
● OpenGL – уніфіковане API лише для
формування роботи з графікою
● Залежність від платформи зводиться лише
до підтримуваної версії, наявних розширень,
доступної пам’яті і максимальних розмірів
об’єктів
● Не містить функцій читання з файлів,
збереження зображень, створення вікон,
роботи з мишею та клавіатурою, і т.п. (для
цього є GLUT, EGL, GLEW, GLSDK etc.)

OpenGL Versions
Рік OpenGL OpenGL ES (OpenGL for
Embedded Systems)
Pipeline
1997 1.1
fixed2001 1.3 1.0
2003 1.5 1.1
2004 2.0
shader
2007 2.1 2.0 = OpenGL2.0 without Fixed
Functions Pipeline
2009 3.1, 3.2
2010 4.1 (ES 2.0 compatible)
2012 4.3 (ES 3.0 compatible) 3.0 = 2.0 + occlusion queries,
transform feedback + new shader
language (v 300 es) etc.2014 4.5 (ES 3.1 compatible)
3.1 = 3.0 + Compute shaders
Independent vertex and fragment
shaders
Indirect draw commands

The OpenGL ES 2.0 Pipeline with
Shaders
Vertex Shader
Rasterize + Interpolate
Fragment Shader
Small programs that run on
the graphics card

Shaders
Vertex Shader
Fragment Shader
Vertices with:
Colors
Texture coords
Normals

Shaders
Vertex Shader
Fragment Shader
Transformed Vertices
with:
(Anything you want here, eg
normals, colors, texture coords)
Texture
Memory

Shaders
Vertex Shader
Fragment Shader
Fragments with:
(Interpolated values from previous
stage)

Shaders
Vertex Shader
Fragment ShaderFragments with:
Colors
Texture
Memory

How many times are the fragment and vertex
shaders executed on this scene

Vertex shader runs once per vertex

Vertex shader runs once per vertex
Fragment shader runs
once per pixel

Vertex Shader
void main() {
gl_Position = gl_Vertex;
}
Fragment Shader
void main() {
gl_FragColor = vec4(1, 0, 0, 1);
}

Vertex Shader
void main() {
}
Fragment Shader
void main() {
gl_FragColor = vec4(1, 0, 0, 1);
}
Note:
Ignores modelview and
projection matrices

GLSL Types
• Float types:
– float, vec2, vec3, vec4, mat2, mat3, mat4
• Bool types:
– bool, bvec2, bvec3, bvec4
• Int types:
– int, ivec2, ivec3, ivec4
• Texture types:
– sampler1D, sampler2D, sampler3D

GLSL Types
• A variable can optionally be:
– Const
– Uniform
• Can be set from c, outside glBegin/glEnd
– Attribute
• Can be set from c, per vertex
– Varying
• Set in the vertex shader, interpolated and used in the
fragment shader

Vertex Shader
attribute float vertexGrayness;
varying float fragmentGrayness;
void main() {
fragmentGrayness = vertexGrayness
}
Fragment Shader
uniform float brightness;
void main() {
const vec4 red = vec4(1, 0, 0, 1);
const vec4 gray = vec4(0.5, 0.5, 0.5, 1);
gl_FragColor = brightness * (red * (1 – fragmentGrayness) +
gray * fragmentGrayness);
}

Vertex Shader
void main() {
}
Fragment Shader
void main() {
}
Set once per vertex,
like color, normal, or
texcoord.

Vertex Shader
void main() {
}
Fragment Shader
void main() {
}
Set once per polygon,
outside glBegin/glEnd

Vertex Shader
void main() {
}
Fragment Shader
void main() {
}
Matching Declarations

Vertex Shader
void main() {
}
Fragment Shader
void main() {
}
Set at each vertex in
vertex shader
Interpolated in hardware
Available in fragment
shader

57
Реалізація OpenGL (k=3)
#ifdef GL_ES
precision mediump float;
#endif
varying vec2 v_tex_coord;
uniform sampler2D s_texture;
uniform vec4 u_koefs;
uniform vec2 u_dir;
void main()
{
vec4 p1 = texture2D(s_texture, v_tex_coord);
vec4 p2 = texture2D(s_texture, v_tex_coord + u_dir) +
texture2D(s_texture, v_tex_coord - u_dir);
vec4 p3 = texture2D(s_texture, v_tex_coord + 2.0*u_dir) +
texture2D(s_texture, v_tex_coord - 2.0*u_dir);
vec4 p4 = texture2D(s_texture, v_tex_coord + 3.0*u_dir) +
texture2D(s_texture, v_tex_coord - 3.0*u_dir);
gl_FragColor = mat4(p1,p2,p3,p4) * u_koefs;
}

58
OpenGL (k=3) продовження
m_gaussian_blur_program =
MyGL::createProgramFromResources("single_tex_vert.vsh","gaussian_blur_frag.fsh");
///…..
MyGL::writeRGBATexture(tex_in, tex_w, tex_h, m_image.getData());
MyGL::useProgram(m_gaussian_blur_program);
glUniform2f( m_gaussian_blur_program->getUniformLocator("u_dir"), 1.0f/tex_w, 0 );
glUniform4fv( m_gaussian_blur_program->getUniformLocator("u_koefs"), 1, &koefs.x );
{
MyGL::PushTextureRender tr(m_tex1, tex_w, tex_h);
glBindTexture(GL_TEXTURE_2D, tex_in);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
MyGL::displayRectI(0,0,tex_w,tex_h, 0, tex_in);
}
{
MyGL::PushTextureRender tr(m_tex2, tex_w, tex_h);
glBindTexture(GL_TEXTURE_2D, m_tex1);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glUniform2f( m_gaussian_blur_program->getUniformLocator("u_dir"), 0, 1.0f/tex_h );
MyGL::displayRectI(0,0,tex_w,tex_h, 0, m_tex1);
}
MyGL::useProgram(0); glFinish(); MyGL::readRGBATexture(m_tex2, tex_w, tex_h,
m_result.getData());

59
OpenGL (k=3) швидкодія
Запис
i5 – 526.217 mpps
Обчислення
i5 – 559.397 mpps
Читання
i5 – 176.354 mpps
Total
iPad mini retina – 100 mpps
i5 – 100 mpps

What is Metal?
Metal is Apple’s framework to provide the lowest overhead, highest
performance access to the GPU. Although GPU programming is often
associated with rendering and texturing 3D scenes, Metal supports
compute shaders which allow massively parallel computation over
datasets such as image or bitmap data. This allows developers to
manipulate still and moving images with existing ﬁlters such as blur and
convolution or even create their own ﬁlters from scratch.
Metal is now available under both iOS and OS X and the same Metal
code can be used across both classes of device. Metal is coded in a
language based on C++ and its shaders, which are the small programs
that are executed on the GPU, are precompiled to offer the highest
performance.

MetalKit
The easiest way to integrate Metal into a Swift project is with Apple’s M
etalKit framework. Its MTKView class is an extension of UIView and
offers an simple API to display textures to the user.
Furthermore, MetalKit includes MTKTextureLoader which allows
developers to generate Metal textures from image assets.
There are a few steps required to get up and running with image
processing in Metal which, brieﬂy, are:

Main Metal Objects
Creating a device
A Metal device (MTLDevice) is the interface to the GPU. It
supports methods for creating objects such as function libraries
and textures.
Creating a library
The library (MTLLibrary) is a repository of kernel functions (in
our case, compute shaders) that are typically compiled into your
application by Xcode.
Creating a Command Queue
The command queue (MTLCommandQueue) is the object that
queues and submits commands to the device for execution.

Vertex, Fragment and Compute
Shaders
GPU programming has typically been based on two different types of
shader or program.
A vertex shader is responsible for taking each vertex of each object in
a 3D scene and translating its position from its 3D universe to a screen
location.
The second type of shader, a "fragment shader, takes the data from
the vertex shader and calculates the colour for every on- screen pixel.
It’s responsible for shading, texturing, reﬂections and refractions and so
on.
However, for image processing, we need to use a third type of shader,
a compute shader. Compute shaders aren’t concerned with 3D space
at all, rather they accept a dataset of any type( vector, image, tensor,
etc. )

Image editing with Metal Compute
Shaders
The simplest compute shader for image processing would simply pass
through the colour of each pixel from a source image and write it to a
target image and look something like this:
kernel void customFunction(
texture2d inTexture [[texture(0)]],
texture2d outTexture [[texture(1)]],
uint2 gid [[thread_position_in_grid]])
{
const uint2 pixellatedGid = uint2((gid.x / 50) *
50, (gid.y / 50) * 50);
const float4 colorAtPixel =
inTexture.read(pixellatedGid);
outTexture.write(colorAtPixel, gid);
}

Metal Performance Shaders (since
iOS 9.0)
There are some very common image processing functions, such as
Gaussian blur and Sobel edge detection, that Apple have written as
Metal Performance Shaders.
These have been optimised to extraordinary lengths. For example their
version of Gaussian blur actually consists of 821 different
implementations that Apple tested for different conditions (e.g. kernel
radius, pixel format, memory hierarchy, etc.) and the shader
dynamically selects to give the best performance.
These shaders are easy to incorporate into existing Swift code that
runs Metal. After instantiating a Metal Performance Shader, here we’ll
use a Gaussian blur, simply invoking encodeT oCommandBuffer
creates a new texture based on a ﬁltered version of the source texture.

Gaussian Blur with Metal
Performance Shaders
let blur = MPSImageGaussianBlur
(device: device!, sigma: 50)
blur.encodeToCommandBuffer(commandBuffer,
sourceTexture: sourceTexture,
destinationTexture: destinationTexture)

func myBlurTextureInPlace(inTexture: MTLTexture, blurRadius: Float, queue: MTLCommandQueue)
{
// Create the usual Metal objects.
// MPS does not need a dedicated MTLCommandBuffer or MTLComputeCommandEncoder.
// This is a trivial example. You should reuse the MTL objects you already have, if you have them.
let device = queue.device;
let buffer = queue.makeCommandBuffer();
// Create a MPS ﬁlter.
let blur = MPSImageGaussianBlur(device: device, sigma: blurRadius)
// Defaults are okay here for other MPSKernel properties (clipRect, origin, edgeMode).
// Attempt to do the work in place. Since we provided a copyAllocator as an out-of-place
// fallback, we don’t need to check to see if it succeeded or not.
// See the "Minimal MPSCopyAllocator Implementation" code listing for a sample myAllocator.
let inPlaceTexture = UnsafeMutablePointer<MTLTexture>.allocate(capacity: 1)
inPlaceTexture.initialize(to: inTexture)
blur.encode(commandBuffer: buffer, inPlaceTexture: inPlaceTexture,
fallbackCopyAllocator: myAllocator)
// The usual Metal enqueue process.
buffer.commit()
buffer.waitUntilCompleted()
}

Basic Categories of Metal
Performance Shader Functions
Working with Convolutional Neural Networks
Image Filter Base Classes
Morphological Image Filters
Convolution Image Filters
Histogram Image Filters
Image Threshold Filters
Image Integral Filters
Converting, Transforming and Transposing Images
Working with Matrices
Working with Images

Metal can be used for rendering in
3D graphics pipeline

Vulkan compartibility
Initial specifications stated that Vulkan will work on
hardware that currently supports OpenGL ES 3.1 or
OpenGL 4.x and up. As Vulkan support requires new
graphics drivers, this does not necessarily imply that
every existing device that supports OpenGL ES 3.1 or
OpenGL 4.x will have Vulkan drivers available.
Android 7.0 Nougat supports Vulkan. The software was
released in August 2016.
Vulkan support for iOS and macOS has not been
announced by Apple, but at least one company
provides a Vulkan implementation that runs on top of
Metal on iOS and macOS devices.

OpenGL vs Vulkan
OpenGL Vulkan
One single global state
machine
Object-based with no global state
State is tied to a single context
All state concepts are localized to a
command buffer
Operations can only be
executed sequentially
Multi-threaded programming is
possible
GPU memory and
synchronization are usually
hidden
Explicit control over memory
management and synchronization
Extensive error checking
Vulkan drivers do no error checking
at runtime;
there is a validation layer for
developers

Optimization and compartibility
Technology Compartible devices
Neon only ARM7+ devices (almost all
Androids and all iOS devices
SSE Mostly AMD, Intel x86 devices
OpenGL (ES) All moderm Android, iOS, OSX,
Linux based devices
Vulkan Android 7.0, New AMD and NVidea
graphic cards
Metal iPhone 5S+, iPad Air+, iPad Mini
Retina, almost all Macs after 2013

Контрольна
11 травня

Перелік тем для підготовки до залікової контрольної з предмету
“Програмне забезпечення мобільних пристроїв”
Давидов М.В.
1. Основні операційні системи мобільних пристроїв і їх версії.
Android, iOS, WatchOS, ROS.
2. Основні магазини мобільних додатків.
3. Апаратне забезпечення мобільних пристроїв: екрани,
процесори, засоби зв'язку, роз'єми.
4. Процес розроблення мобільного додатку: MVP, UX-піраміда,
місія і цілі бізнесу.
5. Human centered design і UX.
6. Персони, сценарії, варіанти використання продукту.
7. Користувачі, задачі, контекст.
8. Основні питання, на які треба відповісти перед розробленням
мобільного додатку.
9. Основні проблеми, які виникають при розробленні мобільних
додатків.
10. Інтерфейсні рішення, основні моделі навігації у інтерфейсах
мобільних додатків.

11. Вплив особливостей роботи з touch screen на інтерфейсне
рішення.
12. Основні засоби прототипування мобільних додатків.
13. Основні елементи програми для ОС Android.
14. Файл AndroidManifest.xml.
15. Android Monitor → LogCat.
16. Програма adb.
17. Програма Graddle.
18. Елементи керування ОС Android.
19. Опрацювання подій на ОС Android.
20. Ресурси програми на ОС Android.
21. Розміщення елементів інтерфейсу користувача на ОС Android.
22. Створення інтерфейсних елементів з ресурсів на ОС Android.
23. Анімування розміщення елементів програми на ОС Android.
24. Зберігання даних на платформі Android.

25. Робота з SQLITE базою даних на ОС Android.
26. Засоби роботи з мережею на ОС Android.
27. Хмарні сервіси для мобільних програм: Firebase, Microsoft
Azure Mobile Services
28. Засоби розроблення крос-платформного мобільного ПЗ (на
прикладі Xamarin, Xamarin Forms)
29. Мова програмування Objective C – основні поняття, типи
файлів, об’єкти, стрічки, селектори.
30. Мова програмування Objective C – класи, методи,
повідомлення.
31. Мова програмування Objective C – особливості підрахунку
кількості посилань на об’єкти, @autoreleasepool.
32. Мова програмування Swift – основні поняття і конструкції мови
33. Мова програмування Swift – цикли, масиви, асоціативні масиви,
optionals, класи, методи
34. Взаємодія коду на Swift і Objective C.
33. Оптимізація програм для мобільних додатків.
34. NEON, MMX, OpenGL, Metal, Metal Performance Shaders,
Vulkan.

Lecture 13 Local Optimization on Mobile Devices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lecture 13 Local Optimization on Mobile Devices

Similar to Lecture 13 Local Optimization on Mobile Devices (20)

More from Maksym Davydov

More from Maksym Davydov (20)

Lecture 13 Local Optimization on Mobile Devices