Successfully reported this slideshow.
Upcoming SlideShare
×

# Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.”

1,001 views

Published on

November 1, 2014

Published in: Mobile
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реального времени.”

1. 1. REAL-TIME FACE TRACKING NOV 2014
2. 2. LOOKSERY + + VIDEO SELFIES FACE FILTERS INTEGRATED CHAT
3. 3. REAL-TIME FACE TRACKING DEMO 3
4. 4. - Algorithm based on Active Appearance Model. - Algorithm complexity is independent from image size. - You can control balance between tracking quality and tracking speed using only two constants. - Algorithm is iterative. Solve Least-Square problem at each iteration. - Average 5 iterations per frame. Maximum 10, minimum 1. - If you want run on 30 fps you have to perform about 150 iterations per second. 4 TRACKING ALGORITHM
5. 5. Optimisation flow —— : Algorithm asymptotic optimisation 3 FPS: First implementation 8 FPS: Memory preallocation 10 FPS: Algorithm parameters optimisation 13 FPS: Matrix storage optimisation and removing OOP code 18 FPS: Rewrite bottleneck code at assembler 24 FPS: Asymptotic optimisation of matrices multiplication 27 FPS: Replacing operations with float to operations with int 30 FPS: Multithreading 5
6. 6. From float to int 6 G[i][j] = (X[i][j] - Y[i][j]) / d[j]; We had to build so-called pseudo-inverse, that is So we have to perform many multiplication operations. Multiplication of two int is much faster then multiplication of two float. Lets create int matrix V: V[i][j] = X[i][j] - Y[i][j]; And float matrix D: D[i][j] = ( i== j ? d[i] : 0); // diagonal matrix Then G = V * D. From linear algebra:
7. 7. 7 CODE TIME const int ITERATIONS = 2000000000; long long sum = 0; for (int i = 0; i < ITERATIONS; i++) sum += i * (long long)i; cout<<sum<<endl; 0.00 sec const int ITERATIONS = 2000000000; long long sum = 0; for (int i = 0; i < ITERATIONS; i++) sum += i * (long long)i / 3; cout<<sum<<endl; 2.10 sec const int ITERATIONS = 2000000000; float sum = 0; for (int i = 0; i < ITERATIONS; i++) sum += i * (float)i / 3; cout<<sum<<endl; 4.29 sec Demo benchmarks
8. 8. Matrices multiplication optimisations 1) Don’t create a matrix with power of two size. Cache uses simple hash function to select a cash line in which the memory will be cached. This hash is just a some low (i.e. 16) bits of the memory address. When you use the matrix with the size power of two, each of the row has the same lowest bits, so you contain only one row in your cache instead of nearly a whole matrix. 2) Change the order of matrices multiplication: to multiply two matrix n x m and m x s you have to perform n * m * s operations. If you want to multiply the matrices A(n x m) * B(m x s) * C(s x k), you can do it in two ways with the same result: (A * B) * C with n*m*s + n*s*k operations. or A * (B * C) with m*s*k + n*m*k operations. n*m*s + n*s*k != m*s*k + n*m*k in general case, choose the smallest one. 8
9. 9. Hello assembler 9 int *row = GT[i]; for (int j = i, pos = (int)(i * GT.columnCount()); j < GT.rowCount(); j++) { int curr = 0; for (int k = 0; k < GT.columnCount(); k++, pos++) curr += row[k] * GT.val[pos]; GTG[i][j] = GTG[j][i] = curr; } It looks optimised enough. Is there anything we can improve? Well, let’s have a look at ASM code.. 0x149ac2: ldr.w lr, [r5, r9, lsl #2] 0x149ac6: add.w r9, r9, #0x1 0x149aca: cmp r9, r2 0x149acc: ldr r8, [r12], #4 0x149ad0: mla r11, lr, r8, r11 0x149ad4: blo 0x149ac2 ;at AppearanceTracker.cpp:555 No SIMD instructions there :(
10. 10. Let’s add some SIMD 10 int *row = GT[i]; int *rowInit = row; int *rowPos = GT.val + i * GT.columnCount(); int *rowEnd = row + processedCnt; for (int j = i; j < GT.rowCount(); j++) { row = rowInit; int accum[8] = {0}; __asm__ volatile ( "vld1.32 {d8-d11}, [%[accum]] nt" "L_mulStart%=:nt" "vld1.32 {d0-d3}, [%[row]]! nt" "vld1.32 {d4-d7}, [%[val]]! nt" "vmla.i32 q4, q2, q0 nt" "vmla.i32 q5, q3, q1 nt" "cmp %[row], %[rowEnd]nt" "blo L_mulStart%=nt" "vst1.32 {d8-d11}, [%[accum]]nt" : [row] "+r" (row), [val] "+r" (rowPos) : [rowEnd] "r" (rowEnd), [accum] "r" (accum) ); //собирание 8 значений из accum //допроцесс остатка mod 8 } int *row = GT[i]; for (int j = i, pos = (int)(i * GT.columnCount()); j < GT.rowCount(); j++) { int curr = 0; for (int k = 0; k < GT.columnCount(); k++, pos++) curr += row[k] * GT.val[pos]; GTG[i][j] = GTG[j][i] = curr; }
11. 11. Practical difference? 11 Let’s profile it Before: After: Approx. 2-2.5 times faster
12. 12. 12 Some issue about hardware Task: Crop a square from CMSampleBuffer(that contains CVImageBufferRef) and write it using AVAssetWriterInputPixelBufferAdaptor Input buffer address Target image address Create CMSampleBuffer by just moving base address and new setting height. O(1) operation. BAD Create CMSampleBuffer by creating new CVPixelBufferRef from CVTextureCache and copy image. O(Height*Width) operation GOOD
13. 13. 13 iOS 8 strikes back iPhone 5S iOS 7.1 - 30 FPS iPhone 5S iOS 8.0 - 15 FPS O_o Possible reasons: 1) Memory corruption at C++ core code 2) iOS 8 QOS: Wrong queue priority: QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED 3) Blinking of this guy
14. 14. CONTACT INFORMATION FEDOR POLYAKOV Mobile: +38 097 59 0000 9 E-Mail: fedor@looksery.com YURII MONASTYRSHYN Mobile: +38 067 482 60 97 E-Mail: yurii@looksery.com VICTOR SHABUROV, FOUNDER Mobile: +1 650 575 9359 Fax: +1 866 626 9582 E-Mail: victor@looksery.com WEB looksery.com facebook.com/looksery twitter.com/looksery