Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

507 views

Published on

By Adrián Pérez

The MIPS processor cores are widely used in embedded platforms, including TVs and set-top-boxes. In most of those platforms dedicated graphics hardware exists but it may be specialized for its use in audio and video signal processing: rendering of web content has to be done in software. We implemented optimizations for the software-based QPainter renderer to improve the performance of Qt —including QtWebKit— in MIPS processors. The target platform was the modern 74kf cores, which include new SIMD instructions suitable for graphics operations (alpha blending, color space conversion and JPEG image decoding), and also for non-graphics operations: string functions were also improved. Our figures estimate that web pages are rendered up to 30% faster using hand-coded assembler fast-paths for those operations.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
507
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

  1. 1. IMPROVING $PORT PERFORMANCE ON $ARCH PLATFORM-BASED PERFORMANCE TUNING OF WEBKIT (PORT=QT ARCH=MIPS74KF) Embedded Linux Conference April 29 — May 1, 2014 Adrián Pérez de Castro
  2. 2. WHOAMI aperez@igalia.com +AdrianPerezDeCastro @aperezdc
  3. 3. THE CHALLENGE MAKE A QTWEBKIT-BASED BROWSER USEABLE ON LIMITED HARDWARE MIPS 74Kf @500 MHz RAM: 256 MB No GPU
  4. 4. MIPS74KF “Classic” MIPS32 + FPU + MMU + DSP
  5. 5. DSP? No. Not really a DSP. SIMD instructions suitable for signal processing.
  6. 6. CAN WE USE THIS TO IMPROVE PERFORMANCE? CHALLENGE ACCEPTED
  7. 7. THE PLAN PROFILE →OPTIMIZE →VALIDATE
  8. 8. WHAT TO OPTIMIZE Video/audio decoding. Image operations.
  9. 9. WHERE TO OPTIMIZE? Can we improve the platform overall, not just WebKit? Yes! QtWebKit uses the Qt drawing functions. A/V decoding uses GStreamer, which uses Orc. Good candidates for SIMD code.
  10. 10. LIMITATIONS No Valgrind. No GDB. No perf. No performance counters. ↓ qemu + gdbserver. gperftools. CLOCK_PROCESS_CPUTIME_ID
  11. 11. ROLL YOUR OWN TOOLS (WITH HELP FROM EXISTING ONES)
  12. 12. GNU HAMMER^WTIME! #Usefullpathtoavoidusingtheshell'stimebuiltin #Onelineperrunwithuser/systemtimeandpagefaults /usr/bin/time-a-otimings.txt -f'%U%S%F%x%C'$COMMAND #Forexample,measuringtheqtdemuxGStreamercomponent /usr/bin/time-a-otimings.txt -f'%U%S%F%x%C'gst-launch-q filesrc=file.mp4!qtdemux!video/x-h264!fakesink
  13. 13. TIMING Beware of CLOCK_PROCESS_CPUTIME_ID's resolution! #defineCLOCK_MAX_RESOLUTION_DELTA (10000.0*1e-9) boolusePosixClock(){ staticboolchecked=false; staticbooluseposix; if(!checked){ if(posixClockAvailable()){ doubleres_theorical=posixClockTheoricalResolution(); doubleres_empirical=posixClockEmpiricalResolution(); useposix=fabs(res_theorical-res_empirical) <=CLOCK_MAX_RESOLUTION_DELTA; }else{ useposix=false; } checked=true; } returnuseposix; } clock.cc
  14. 14. WEBSNAP %g++-DMAIN-oclockclock.cc %./clock CLOCK_PROCESS_CPUTIME_IDissupported Resolution(advertised/empirical):0.0000000010/0.0000002460s Sampledresolution:0.0000005470s Printingthelinesabovetook0.0000483550s %LD_PRELOAD=/usr/lib/libprofiler.so ./websnaphttp://igalia.com1000pprof Loading100%Layoutcompleted Loadsuccessful libprofile.sodetected(0x7f77468e8f90,0x7f77468e8fd0),output'pprof' Profilingstarted,code:0x1,timeout:0 PROFILE:interrupts/evictions/bytes=634/537/22168 http://igalia.com10006.2709987870s %mkdirout&&./runtests1000<urls.txt github.com/aperezdc/websnap
  15. 15. ...AND BEYOND Ad-hoc Python/Bash scripts: Fixlibrarypathsinprofileroutput. Datamunging. Measurementscomparison. GenerateCSVfiles. Reportgeneration. …
  16. 16. SOME RESULTS (DETAILED)
  17. 17. LATIN-1→UTF16: V0 //"dst"array(uint16_t*) //"src"array(uin8_t*) while(len--) *dst++=(uchar)*src++;
  18. 18. LATIN-1→UTF16: V1 ;a0:"dst"array(uint16_t*) ;a1:"src"array(uint8_t*) ;a2:"len" 1: lbu t1,0(a1) addiu a2,a2,-1 ;len-- sh t1,0(a0) addiu a0,a0,2 ;dst++ bnez a2,1b addiua1,a1,1 ;src++
  19. 19. LATIN-1→UTF16: V2 1: lw t1,(a1) ;t1=ABCD ; ;TODO:extractbytesfromt1tot2/t3,padding ; themwithzeroes:t2=0A0B,t3=0C0D ; addiu a1,a1, 4 ;src++ addiu a2,a2,-4 ;len-- sw t2,0(a0) sw t3,4(a0) bnez a2,1b addiua0,a0,8 ;dst+=2
  20. 20. LATIN-1→UTF16: V3 1: lw t1,(a1) ;t1=ABCD srl t2,t1,24 ;t2=000A sll t2,t2,16 ;t2=0A00 sll t1,t1, 8 ;t1=BCD0 srl t4,t1,24 ;t4=000B or t2,t2,t4 ;t2=0A0B sll t1,t1, 8 ;t1=CD00 srl t1,t1,16 ;t1=00CD andi t4,t1,0xFF00;t4=00C0 sll t4,t4, 8 ;t4=0C00 or t3,t1,t4 ;t3=0C0D addiu a1,a1, 4 ;src++ addiu a2,a2,-4 ;len-- sw t2,0(a0) sw t3,4(a0) bnez a2,1b addiua0,a0,8 ;dst+=2
  21. 21. LATIN-1→UTF16: V4 ;DSPinstructionscanunpackbytesdirectly :-) 1: lw t1,(a1) ;t1=ABCD preceu.ph.qblt2,t1 ;t2=0A0B preceu.ph.qbrt3,t1 ;t3=0C0D addiu a1,a1, 4 ;src++ addiu a2,a2,-4 ;len-- sw t2,0(a0) sw t3,4(a0) bnez a2,1b addiua0,a0,8 ;dst+=2
  22. 22. LATIN-1 →UTF-16
  23. 23. ALPHA BLENDING
  24. 24. UTF-16 STRICMP()
  25. 25. RESULTS Speedup histogram
  26. 26. UP TO 30% FASTER RENDERING Thanksto: Orc backend using MIPS DSP instructions QImagecomposition operations Color conversion (RGB16/888→ARGB32) Alpha premultiplication and blending String conversions and comparisons
  27. 27. UPSTREAM STATUS Orc backend complete upstream Initial work based on Qt 4.8 Most of the code is already in Qt 5.2 Rest in the next release No backport to Qt 4.8
  28. 28. THANK YOU FOR YOUR ATTENTION perezdecastro.org +AdrianPerezDeCastro @aperezdc

×