SlideShare a Scribd company logo
Optimizing computer vision problems on mobile platforms
Looksery.com
Fedor Polyakov
Software Engineer, CIO
Looksery, INC
fedor@looksery.com
+380 97 5900009 (mobile)
www.looksery.com
Optimize algorithm first
• If your algorithm is suboptimal, “technical” optimizations won’t
be as effective as just algo fixes
• When you optimize the algorithm, you’d probably have to
change your technical optimizations too
• Single instruction - multiple data
• On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles)
• Uses a bit more cycles per instruction, but can operate on a lot more data
• Can ideally give the performance boost of up to 4x times (typically, in my
practice ~2-3x)
• Can be used for many image processing algorithms
• Especially useful at various linear algebra problems
SIMD operations
• The easiest way - you just use the library and it does everything for you
• Eigen - great header-only library for linear algebra
• Ne10 - neon-optimized library for some image processing/DSP on android
• Accelerate.framework - lots of image processing/DSP on iOS
• OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though,
they’ve optimized ~40 low-level functions in OpenCV 3.0)
• There are also some commercial libraries
• + Everything is done without any your efforts
• - You should still profile and analyze the ASM code to verify that everything
is vectorized as you expect
Using computer vision/algebra/DSP libraries
using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES)));
v4si x, y;
• All common operations with x are now vectorized
• Written once and for all architectures
• Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons
• Loading from memory in a way like this x = *((v4si*)ptr);
• Loading back to memory in a way like this *((v4si*)ptr) = x;
• Supports subscript operator for accessing individual elements
• Not all SIMD operations supported
• May produce suboptimal code
GCC/clang vector extensions
• Provide a custom data types and a set of c functions to vectorize code
• Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b);
• Generally, are similar to previous approach though give you a better control and
full instruction set.
• Cons:
• Have to write separate code for each platform
• In all the above approaches, compiler may inject some instructions which
can be avoided in hand-crafted code
• Compiler might generate code that won’t use the pipeline efficiently
SIMD intrinsics
• Gives you the most control - you know what code will be generated
• So, if created carefully, can sometimes be up to 2 times faster than the code
generated by compiler using previous approaches (usually 10-15% though)
• You need to write separate code for each architecture :(
• Need to learn
• Harder to create
• In order to get the maximum performance possible, some additional steps may
be required
Handcrafted ASM code
• Reduce data types to as small as possible
• If you can change double to int16_t, you’ll get more than 4x performance boost
• Try using pld intrinsic - it “hints” CPU to load some data into caches which will be
used in a near future (can be used as __builtin_prefetch)
• If you use intrinsics, watch out for some extra loads/stores which you may be
able to get rid of
• Use loop unrolling
• Interleave load/store instructions and arithmetical operations
• Use proper memory alignment - can cause crashes/slow down performance
Some other tricks
• Sum of matrix rows
• Matrices are 128x128, test is repeated 10^5 times
Some benchmarks
// Non-vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j++) {
rowSum[j] += testMat[i][j];
}
}
// Vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j += vectorSize) {
VectorType x = *(VectorType*)(testMat[i] + j);
VectorType y = *(VectorType*)(rowSum + j);
y += x;
*(VectorType*)(rowSum + j) = y;
}
}
Some benchmarks
Tested on iPhone 5, results on other phones show pretty much the same
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized
Time,s int float short
Got more than 2x performance boost, mission completed?
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized Loop unroll
Time,s
int float short
Got another ~15%
for (int i = 0; i < matSize; i++) {
auto ptr = testMat[i];
for (int j = 0; j < matSize; j += 4 * xSize) {
auto ptrStart = ptr + j;
VT x1 = *(VT*)(ptrStart + 0 * xSize);
VT y1 = *(VT*)(rowSum + j + 0 * xSize);
y1 += x1;
VT x2 = *(VT*)(ptrStart + 1 * xSize);
VT y2 = *(VT*)(rowSum + j + 1 * xSize);
y2 += x2;
VT x3 = *(VT*)(ptrStart + 2 * xSize);
VT y3 = *(VT*)(rowSum + j + 2 * xSize);
y3 += x3;
VT x4 = *(VT*)(ptrStart + 3 * xSize);
VT y4 = *(VT*)(rowSum + j + 3 * xSize);
y4 += x4;
*(VT*)(rowSum + j + 0 * xSize) = y1;
*(VT*)(rowSum + j + 1 * xSize) = y2;
*(VT*)(rowSum + j + 2 * xSize) = y3;
*(VT*)(rowSum + j + 3 * xSize) = y4;
}
}
Some benchmarks
Let’s take a look at profiler
Some benchmarks
// Non-vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j++) {
rowSum[i] += testMat[j][i];
}
}
// Vectorized, loop-unrolled code
for (int i = 0; i < matSize; i+=4 * xSize) {
VT y1 = *(VT*)(rowSum + i);
VT y2 = *(VT*)(rowSum + i + xSize);
VT y3 = *(VT*)(rowSum + i + 2*xSize);
VT y4 = *(VT*)(rowSum + i + 3*xSize);
for (int j = 0; j < matSize; j ++) {
x1 = *(VT*)(testMat[j] + i);
x2 = *(VT*)(testMat[j] + i + xSize);
x3 = *(VT*)(testMat[j] + i + 2*xSize);
x4 = *(VT*)(testMat[j] + i + 3*xSize);
y1 += x1;
y2 += x2;
y3 += x3;
y4 += x4;
}
*(VT*)(rowSum + i) = y1;
*(VT*)(rowSum + i + xSize) = y2;
*(VT*)(rowSum + i + 2*xSize) = y3;
*(VT*)(rowSum + i + 3*xSize) = y4;
}
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vect + Loop
Time,s
int float Short
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized Vect + Loop Eigen SumOrder Asm
Time,s
float
Using GPGPU
• Around 1.5 orders of magnitude bigger theoretical performance
• On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops
• On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops !
• Can be very hard to utilize efficiently
• CUDA, obviously, isn’t available on mobile devices
• OpenCL isn’t available on iOS and is hardly available on android
• On iOS, Metal is available for GPGPU but only starting with iPhone 5S
• On Android, Google promotes Renderscript for GPGPU
• So, the only cross-platform way is to use OpenGL ES (2.0)
Common usage of shaders for GPGPU
Shader 1
Image
Data
Texture containing processed data
Shader 2
…
Data
Results
Display on screen
Read back to cpu
Common problems
• Textures were designed to hold RGBA8 data
• On almost all phones starting 2012, half-float and float textures are supported as
input
• Effective bilinear filtering for float textures may be unsupported or ineffective
• On many devices, writing from fragment shader to half-float (16 bit) textures is
supported.
• Emulating the fixed-point arithmetic is pretty straightforward
• Emulating floating-point is possible, but a bit tricky and requires more operations
• Change of OpenGL states may be expensive
• For-loops with non-const number of iterations not supported on older devices
• Reading from GPU to CPU is very expensive
• There are some platform-dependent way to make it faster
Tasks that can be solved on OpenGL ES
• Image processing
• Image binarization
• Edge detection (Sobel, Canny)
• Hough transform (though, some parts can’t be implemented on GPU)
• Histogram equalization
• Gaussian blur/other convolutions
• Colorspace conversions
• Much more examples in GPUImage library for iOS
• For other tasks, it depends on many factors
• We tried to implement our tracking on GPU, but didn’t get the expected
performance boost
Questions?
Thanks for attention!

More Related Content

What's hot

Challenges in Embedded Development
Challenges in Embedded DevelopmentChallenges in Embedded Development
Challenges in Embedded Development
SQABD
 
Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?
Anton Arhipov
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Intel® Software
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
Dragos Sbîrlea
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
Owen Wu
 
GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3
Aries Cs
 
[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist
Owen Wu
 
Minimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded SoftwareMinimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded Software
Lionel Briand
 
Engineering show and tell
Engineering show and tellEngineering show and tell
Engineering show and tell
rasen58
 
Event Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQEvent Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQ
Luke Luo
 
Memory Leak Analysis in Android Games
Memory Leak Analysis in Android GamesMemory Leak Analysis in Android Games
Memory Leak Analysis in Android Games
Heghine Hakobyan
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studio
Owen Wu
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
Domino Data Lab
 
BruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid ComputingBruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid Computing
tomaszmiklas
 
Tw2010slide2
Tw2010slide2Tw2010slide2
Tw2010slide2
s1150036
 
openmp
openmpopenmp
openmp
Neel Bhad
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
Hiroshi Doyu
 
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
University of Zurich
 

What's hot (20)

Challenges in Embedded Development
Challenges in Embedded DevelopmentChallenges in Embedded Development
Challenges in Embedded Development
 
Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® Graphics
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
 
GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3
 
[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist
 
Minimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded SoftwareMinimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded Software
 
Engineering show and tell
Engineering show and tellEngineering show and tell
Engineering show and tell
 
Event Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQEvent Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQ
 
Memory Leak Analysis in Android Games
Memory Leak Analysis in Android GamesMemory Leak Analysis in Android Games
Memory Leak Analysis in Android Games
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studio
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
BruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid ComputingBruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid Computing
 
Tw2010slide2
Tw2010slide2Tw2010slide2
Tw2010slide2
 
openmp
openmpopenmp
openmp
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
 

Viewers also liked

Michael Norel - High Accuracy Camera Calibration
Michael Norel - High Accuracy Camera Calibration Michael Norel - High Accuracy Camera Calibration
Michael Norel - High Accuracy Camera Calibration
Eastern European Computer Vision Conference
 
Andrii Babii - Application of fuzzy transform to image fusion
Andrii Babii - Application of fuzzy transform to image fusion Andrii Babii - Application of fuzzy transform to image fusion
Andrii Babii - Application of fuzzy transform to image fusion
Eastern European Computer Vision Conference
 
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
Eastern European Computer Vision Conference
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Sayed Abulhasan Quadri
 
Image quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion techniqueImage quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion technique
Sayed Abulhasan Quadri
 
Real-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes RecognitionReal-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes Recognition
Jia-Bin Huang
 
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie KrasnienkoTargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Provectus
 
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Black Sea Summit — IT-conference in Odessa
 
Teruel Emprende, ¿y Tú? 2015
Teruel Emprende, ¿y Tú? 2015Teruel Emprende, ¿y Tú? 2015
Teruel Emprende, ¿y Tú? 2015
Antena Aguja Agujama
 
ProEvents Team presentation
ProEvents Team presentationProEvents Team presentation
ProEvents Team presentation
Elisabeta Ionita
 
Retailing
RetailingRetailing
Retailing
Sathya Narayanan
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...
Nandana Mihindukulasooriya
 
3 arte romano
3 arte romano3 arte romano
3 arte romano
gorbea
 
Eerm mapping c++
Eerm mapping c++Eerm mapping c++
Eerm mapping c++
Ramrao Desai
 
Vétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres InternationalVétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres International
FAO
 
Cómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidosCómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidos
chicadieta
 
Ventas y compras internacionales
Ventas y compras internacionalesVentas y compras internacionales
Ventas y compras internacionales
Ravaventas
 

Viewers also liked (18)

Michael Norel - High Accuracy Camera Calibration
Michael Norel - High Accuracy Camera Calibration Michael Norel - High Accuracy Camera Calibration
Michael Norel - High Accuracy Camera Calibration
 
Andrii Babii - Application of fuzzy transform to image fusion
Andrii Babii - Application of fuzzy transform to image fusion Andrii Babii - Application of fuzzy transform to image fusion
Andrii Babii - Application of fuzzy transform to image fusion
 
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
 
Image quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion techniqueImage quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion technique
 
Real-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes RecognitionReal-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes Recognition
 
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie KrasnienkoTargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
 
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
 
Teruel Emprende, ¿y Tú? 2015
Teruel Emprende, ¿y Tú? 2015Teruel Emprende, ¿y Tú? 2015
Teruel Emprende, ¿y Tú? 2015
 
ProEvents Team presentation
ProEvents Team presentationProEvents Team presentation
ProEvents Team presentation
 
Retailing
RetailingRetailing
Retailing
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...
 
3 arte romano
3 arte romano3 arte romano
3 arte romano
 
Eerm mapping c++
Eerm mapping c++Eerm mapping c++
Eerm mapping c++
 
Vétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres InternationalVétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres International
 
Cómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidosCómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidos
 
Ventas y compras internacionales
Ventas y compras internacionalesVentas y compras internacionales
Ventas y compras internacionales
 

Similar to Fedor Polyakov - Optimizing computer vision problems on mobile platforms

Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
Wei-Ta Wang
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
Roberto Agostino Vitillo
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
Takuya ASADA
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
Codemotion Tel Aviv
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
corehard_by
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
Valeriia Maliarenko
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
Alcides Fonseca
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
CUSTIS
 
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
UA Mobile
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
St1X
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
ssuserc28b3c
 
Practical C++ Generative Programming
Practical C++ Generative ProgrammingPractical C++ Generative Programming
Practical C++ Generative Programming
Schalk Cronjé
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
Yoav Avrahami
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Unity Technologies
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
敬倫 林
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
RCCSRENKEI
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
Roberto Agostino Vitillo
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
PVS-Studio
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
Andrey Karpov
 

Similar to Fedor Polyakov - Optimizing computer vision problems on mobile platforms (20)

Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
Practical C++ Generative Programming
Practical C++ Generative ProgrammingPractical C++ Generative Programming
Practical C++ Generative Programming
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Fedor Polyakov - Optimizing computer vision problems on mobile platforms

  • 1. Optimizing computer vision problems on mobile platforms Looksery.com
  • 2. Fedor Polyakov Software Engineer, CIO Looksery, INC fedor@looksery.com +380 97 5900009 (mobile) www.looksery.com
  • 3. Optimize algorithm first • If your algorithm is suboptimal, “technical” optimizations won’t be as effective as just algo fixes • When you optimize the algorithm, you’d probably have to change your technical optimizations too
  • 4. • Single instruction - multiple data • On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles) • Uses a bit more cycles per instruction, but can operate on a lot more data • Can ideally give the performance boost of up to 4x times (typically, in my practice ~2-3x) • Can be used for many image processing algorithms • Especially useful at various linear algebra problems SIMD operations
  • 5. • The easiest way - you just use the library and it does everything for you • Eigen - great header-only library for linear algebra • Ne10 - neon-optimized library for some image processing/DSP on android • Accelerate.framework - lots of image processing/DSP on iOS • OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though, they’ve optimized ~40 low-level functions in OpenCV 3.0) • There are also some commercial libraries • + Everything is done without any your efforts • - You should still profile and analyze the ASM code to verify that everything is vectorized as you expect Using computer vision/algebra/DSP libraries
  • 6. using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES))); v4si x, y; • All common operations with x are now vectorized • Written once and for all architectures • Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons • Loading from memory in a way like this x = *((v4si*)ptr); • Loading back to memory in a way like this *((v4si*)ptr) = x; • Supports subscript operator for accessing individual elements • Not all SIMD operations supported • May produce suboptimal code GCC/clang vector extensions
  • 7. • Provide a custom data types and a set of c functions to vectorize code • Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b); • Generally, are similar to previous approach though give you a better control and full instruction set. • Cons: • Have to write separate code for each platform • In all the above approaches, compiler may inject some instructions which can be avoided in hand-crafted code • Compiler might generate code that won’t use the pipeline efficiently SIMD intrinsics
  • 8. • Gives you the most control - you know what code will be generated • So, if created carefully, can sometimes be up to 2 times faster than the code generated by compiler using previous approaches (usually 10-15% though) • You need to write separate code for each architecture :( • Need to learn • Harder to create • In order to get the maximum performance possible, some additional steps may be required Handcrafted ASM code
  • 9. • Reduce data types to as small as possible • If you can change double to int16_t, you’ll get more than 4x performance boost • Try using pld intrinsic - it “hints” CPU to load some data into caches which will be used in a near future (can be used as __builtin_prefetch) • If you use intrinsics, watch out for some extra loads/stores which you may be able to get rid of • Use loop unrolling • Interleave load/store instructions and arithmetical operations • Use proper memory alignment - can cause crashes/slow down performance Some other tricks
  • 10. • Sum of matrix rows • Matrices are 128x128, test is repeated 10^5 times Some benchmarks // Non-vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[j] += testMat[i][j]; } } // Vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j += vectorSize) { VectorType x = *(VectorType*)(testMat[i] + j); VectorType y = *(VectorType*)(rowSum + j); y += x; *(VectorType*)(rowSum + j) = y; } }
  • 11. Some benchmarks Tested on iPhone 5, results on other phones show pretty much the same 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Time,s int float short Got more than 2x performance boost, mission completed?
  • 12. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Loop unroll Time,s int float short Got another ~15% for (int i = 0; i < matSize; i++) { auto ptr = testMat[i]; for (int j = 0; j < matSize; j += 4 * xSize) { auto ptrStart = ptr + j; VT x1 = *(VT*)(ptrStart + 0 * xSize); VT y1 = *(VT*)(rowSum + j + 0 * xSize); y1 += x1; VT x2 = *(VT*)(ptrStart + 1 * xSize); VT y2 = *(VT*)(rowSum + j + 1 * xSize); y2 += x2; VT x3 = *(VT*)(ptrStart + 2 * xSize); VT y3 = *(VT*)(rowSum + j + 2 * xSize); y3 += x3; VT x4 = *(VT*)(ptrStart + 3 * xSize); VT y4 = *(VT*)(rowSum + j + 3 * xSize); y4 += x4; *(VT*)(rowSum + j + 0 * xSize) = y1; *(VT*)(rowSum + j + 1 * xSize) = y2; *(VT*)(rowSum + j + 2 * xSize) = y3; *(VT*)(rowSum + j + 3 * xSize) = y4; } }
  • 13. Some benchmarks Let’s take a look at profiler
  • 14. Some benchmarks // Non-vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[i] += testMat[j][i]; } } // Vectorized, loop-unrolled code for (int i = 0; i < matSize; i+=4 * xSize) { VT y1 = *(VT*)(rowSum + i); VT y2 = *(VT*)(rowSum + i + xSize); VT y3 = *(VT*)(rowSum + i + 2*xSize); VT y4 = *(VT*)(rowSum + i + 3*xSize); for (int j = 0; j < matSize; j ++) { x1 = *(VT*)(testMat[j] + i); x2 = *(VT*)(testMat[j] + i + xSize); x3 = *(VT*)(testMat[j] + i + 2*xSize); x4 = *(VT*)(testMat[j] + i + 3*xSize); y1 += x1; y2 += x2; y3 += x3; y4 += x4; } *(VT*)(rowSum + i) = y1; *(VT*)(rowSum + i + xSize) = y2; *(VT*)(rowSum + i + 2*xSize) = y3; *(VT*)(rowSum + i + 3*xSize) = y4; }
  • 15. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vect + Loop Time,s int float Short
  • 16. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Vect + Loop Eigen SumOrder Asm Time,s float
  • 17. Using GPGPU • Around 1.5 orders of magnitude bigger theoretical performance • On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops • On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops ! • Can be very hard to utilize efficiently • CUDA, obviously, isn’t available on mobile devices • OpenCL isn’t available on iOS and is hardly available on android • On iOS, Metal is available for GPGPU but only starting with iPhone 5S • On Android, Google promotes Renderscript for GPGPU • So, the only cross-platform way is to use OpenGL ES (2.0)
  • 18. Common usage of shaders for GPGPU Shader 1 Image Data Texture containing processed data Shader 2 … Data Results Display on screen Read back to cpu
  • 19. Common problems • Textures were designed to hold RGBA8 data • On almost all phones starting 2012, half-float and float textures are supported as input • Effective bilinear filtering for float textures may be unsupported or ineffective • On many devices, writing from fragment shader to half-float (16 bit) textures is supported. • Emulating the fixed-point arithmetic is pretty straightforward • Emulating floating-point is possible, but a bit tricky and requires more operations • Change of OpenGL states may be expensive • For-loops with non-const number of iterations not supported on older devices • Reading from GPU to CPU is very expensive • There are some platform-dependent way to make it faster
  • 20. Tasks that can be solved on OpenGL ES • Image processing • Image binarization • Edge detection (Sobel, Canny) • Hough transform (though, some parts can’t be implemented on GPU) • Histogram equalization • Gaussian blur/other convolutions • Colorspace conversions • Much more examples in GPUImage library for iOS • For other tasks, it depends on many factors • We tried to implement our tracking on GPU, but didn’t get the expected performance boost