Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.”

REAL-TIME FACE TRACKING
NOV 2014

LOOKSERY
+ +
VIDEO SELFIES FACE FILTERS INTEGRATED CHAT

REAL-TIME FACE TRACKING DEMO
3

- Algorithm based on Active Appearance Model.
- Algorithm complexity is independent from image size.
- You can control balance between tracking quality and tracking speed
using only two constants.
- Algorithm is iterative. Solve Least-Square problem at each iteration.
- Average 5 iterations per frame. Maximum 10, minimum 1.
- If you want run on 30 fps you have to perform about 150 iterations per second.
4
TRACKING ALGORITHM

Optimisation flow
—— : Algorithm asymptotic optimisation
3 FPS: First implementation
8 FPS: Memory preallocation
10 FPS: Algorithm parameters optimisation
13 FPS: Matrix storage optimisation and removing OOP code
18 FPS: Rewrite bottleneck code at assembler
24 FPS: Asymptotic optimisation of matrices multiplication
27 FPS: Replacing operations with float to operations with int
30 FPS: Multithreading
5

From float to int
6
G[i][j] = (X[i][j] - Y[i][j]) / d[j];
We had to build so-called pseudo-inverse, that is
So we have to perform many multiplication operations. Multiplication of two int
is much faster then multiplication of two float. Lets create int matrix V:
V[i][j] = X[i][j] - Y[i][j];
And float matrix D:
D[i][j] = ( i== j ? d[i] : 0); // diagonal matrix
Then G = V * D. From linear algebra:

7
CODE TIME
const int ITERATIONS = 2000000000;
long long sum = 0;
for (int i = 0; i < ITERATIONS; i++)
sum += i * (long long)i;
cout<<sum<<endl;
0.00 sec
long long sum = 0;
sum += i * (long long)i / 3;
cout<<sum<<endl;
2.10 sec
float sum = 0;
sum += i * (float)i / 3;
cout<<sum<<endl;
4.29 sec
Demo benchmarks

Matrices multiplication optimisations
1) Don’t create a matrix with power of two size. Cache uses simple hash function to
select a cash line in which the memory will be cached. This hash is just
a some low (i.e. 16) bits of the memory address.
When you use the matrix with the size power of two, each of the row has the same
lowest bits, so you contain only one row in your cache instead of nearly a whole
matrix.
2) Change the order of matrices multiplication: to multiply two matrix n x m and m x s
you have to perform n * m * s operations. If you want to multiply the matrices
A(n x m) * B(m x s) * C(s x k), you can do it in two ways with the same result:
(A * B) * C with n*m*s + n*s*k operations.
or
A * (B * C) with m*s*k + n*m*k operations.
n*m*s + n*s*k != m*s*k + n*m*k in general case, choose the smallest one.
8

Hello assembler
9
int *row = GT[i];
for (int j = i, pos = (int)(i * GT.columnCount()); j < GT.rowCount(); j++)
{
int curr = 0;
for (int k = 0; k < GT.columnCount(); k++, pos++)
curr += row[k] * GT.val[pos];
GTG[i][j] = GTG[j][i] = curr;
}
It looks optimised enough. Is there anything we can improve?
Well, let’s have a look at ASM code..
0x149ac2: ldr.w lr, [r5, r9, lsl #2]
0x149ac6: add.w r9, r9, #0x1
0x149aca: cmp r9, r2
0x149acc: ldr r8, [r12], #4
0x149ad0: mla r11, lr, r8, r11
0x149ad4: blo 0x149ac2 ;at AppearanceTracker.cpp:555
No SIMD instructions there :(

Let’s add some SIMD
10
int *row = GT[i];
int *rowInit = row;
int *rowPos = GT.val + i * GT.columnCount();
int *rowEnd = row + processedCnt;
for (int j = i; j < GT.rowCount(); j++)
{
row = rowInit;
int accum[8] = {0};
__asm__ volatile
(
"vld1.32 {d8-d11}, [%[accum]] nt"
"L_mulStart%=:nt"
"vld1.32 {d0-d3}, [%[row]]! nt"
"vld1.32 {d4-d7}, [%[val]]! nt"
"vmla.i32 q4, q2, q0 nt"
"vmla.i32 q5, q3, q1 nt"
"cmp %[row], %[rowEnd]nt"
"blo L_mulStart%=nt"
"vst1.32 {d8-d11}, [%[accum]]nt"
: [row] "+r" (row), [val] "+r" (rowPos)
: [rowEnd] "r" (rowEnd), [accum] "r" (accum)
);
//собирание 8 значений из accum
//допроцесс остатка mod 8
}
int *row = GT[i];
for (int j = i, pos = (int)(i * GT.columnCount());
j < GT.rowCount(); j++)
{
int curr = 0;
for (int k = 0; k < GT.columnCount();
k++, pos++)
curr += row[k] * GT.val[pos];
GTG[i][j] = GTG[j][i] = curr;
}

Practical difference?
11
Let’s profile it
Before:
After:
Approx. 2-2.5 times faster

12
Some issue about hardware
Task: Crop a square from CMSampleBuffer(that contains CVImageBufferRef)
and write it using AVAssetWriterInputPixelBufferAdaptor
Input buffer address
Target image address
Create CMSampleBuffer by
just moving base address and new
setting height.
O(1) operation.
BAD
Create CMSampleBuffer by
creating new CVPixelBufferRef
from CVTextureCache and copy
image.
O(Height*Width) operation
GOOD

13
iOS 8 strikes back
iPhone 5S iOS 7.1 - 30 FPS
iPhone 5S iOS 8.0 - 15 FPS O_o
Possible reasons:
1) Memory corruption at C++ core code
2) iOS 8 QOS:
Wrong queue priority: QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED
3) Blinking of this guy

CONTACT INFORMATION
FEDOR POLYAKOV
Mobile: +38 097 59 0000 9
E-Mail: fedor@looksery.com
YURII MONASTYRSHYN
Mobile: +38 067 482 60 97
E-Mail: yurii@looksery.com
VICTOR SHABUROV, FOUNDER
Mobile: +1 650 575 9359
Fax: +1 866 626 9582
E-Mail: victor@looksery.com
WEB
looksery.com
facebook.com/looksery
twitter.com/looksery

Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.”

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.”

Similar to Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.” (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (8)

Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.”