• Like
LCE12: big.LITTLE TC2 update
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

LCE12: big.LITTLE TC2 update

  • 1,012 views
Published

Resource: LCE12 …

Resource: LCE12
Name: big.LITTLE TC2 update
Date: 29-10-2012
Speaker: Morten Rasmussen

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,012
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
23
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Update on big.LITTLE on TC2 Morten Rasmussen Technology Researcher
  • 2. 2 Agenda  big.LITTLE Software solutions overview  ARM's Test Chip 2 overview  Benchmarking Methodology and Use Cases  IKS status update  big.LITTLE MP status update
  • 3. 3 big.LITTLE overview  Performance and power efficiency in one system: Cortex-A15 vs Cortex-A7 Performance Cortex-A7 vs Cortex-A15 Energy Efficiency Dhrystone 1.9x 3.5x FDCT 2.3x 3.8x IMDCT 3.0x 3.0x MemCopy L1 1.9x 2.3x MemCopy L2 1.9x 3.4x
  • 4. 4 IKS solution – Basics  In-Kernel Switcher (IKS):  Targeted first generation big.LITTLE products. Cortex-A7 Cortex-A15 Kernel scheduler IKS Task 1 Task 2 Logical CPU ?
  • 5. 5 MP solution Cortex-A7 Cortex-A15 Kernel scheduler Task 1 Task 2 ?
  • 6. 6 ARM’s Test Chip 2 (TC#2): An Overview  A Versatile Express core tile publically available:  Capabilities  2 x A15 (r2p1) @ up to 1.2 Ghz  3 x A7 (r0p1) @ up to 1Ghz  CCI/DMC/GIC/ADB (r0p0)  DMA (PL330)  2GB external DDR2 memory @ 400Mhz  64k internal SRAM  Coresight debug (including JTAG and ITM trace but no STM)  No GPU  cpufreq support: Independent for each cluster with limited voltage scaling  cpuidle support: Cluster power gating TC2
  • 7. 7 Benchmarking Methodology Results Performance Power Configurable: - CCI - ftrace - streamline CSV config: - Use case - Scheduling model - Numbers of cores to use - Scaling governors  Automated system for running user workloads on target device Choose workload Choose CPU mode: Cortex-A7, Cortex-A15, Migration (cluster or CPU), or MP Choose active cores in each cluster TC2: 1-2 big, 1-3 LITTLE Choose DVFS governor: Interactive, performance, powersave, ondemand Extensible – parameterisation
  • 8. 8 IKS solution  Targeted first generation big.LITTLE products. Cortex-A7 Cortex-A15 Kernel scheduler IKS Task 1 Task 2 Logical CPU ?
  • 9. CONFIDENTIAL9 IKS: CPU Migration  big.LITTLE extends DVFS  DVFS algorithm monitors load on each CPU  When load is low it can be handled on a LITTLE processor  When load is high the context is transferred to a big processor  The unused processor can be powered down  When all processors in a cluster are inactive the cluster and its L2 cache can be powered down
  • 10. CONFIDENTIAL10 IKS: CPU Migration  big.LITTLE extends DVFS  DVFS algorithm monitors load on each CPU  When load is low it can be handled on a LITTLE processor  When load is high the context is transferred to a big processor  The unused processor can be powered down  When all processors in a cluster are inactive the cluster and its L2 cache can be powered down
  • 11. 11 IKS: OPP mapping to A7 / A15 on TC2  Virtual Frequency maps OPPs to big or LITTLE cores Virtual OPP Physical OPP A7 Physical OPP A15 Voltage A7 350000 350000 V1 400000 400000 V1 ... X X V1 800000 800000 V1 900000 900000 V2 1000000 1000000 V3 A15 1200000 600000 V1 1400000 700000 V1 ... X 2X V1 2000000 1000000 V1 2200000 1100000 V2 2400000 1200000 V3
  • 12. 12 IKS: Results for Audio on TC2  Power compared to executing the use case on A15  IKS does not use A15s during Audio run 70% saving TC2: A15 up to 1.2 GHz A7 up to 1 GHz Better results expected on representative silicon.
  • 13. 13 IKS: Results for BBench + Audio on TC2  Performance is measured as from page loading times of BBench  Results normalised to power and performance consumed on same use case run on A15 only BBench page + Audio TC2: A15 up to 1.2 GHz A7 up to 1 GHz Better results expected on representative silicon.
  • 14. 14 IKS: OPPs on TC2
  • 15. 15 IKS: Interactive governor on TC2 if (cpu_load >= go_hispeed_load){ ... new_freq = max_freq * cpu_load / 100; ... } else { ... new_freq = hispeed_freq*cpu_load/100; ... }  For A15 on TC2 with a go_highspeed at 85% (default) this algorithm only uses overdrive section of A15  Approach is to introduce a second point of inflection:highspeed2
  • 16. 16 IKS: Hispeed2
  • 17. 17 IKS: Results: Bbench + Audio  Power improves with no performance cost BBench page + Audio TC2: A15 up to 1.2 GHz A7 up to 1 GHz Better results expected on representative silicon.
  • 18. 18 MP solution Cortex-A7 Cortex-A15 Kernel scheduler Task 1 Task 2 ?
  • 19. 19 MP solution – more details  Scheduler modifications:  Treat big and LITTLE cpus as separate scheduling domains.  Use PJT's load-tracking patches to track individual task load.  Migrate tasks between the big and the LITTLE domains based on task load.  Patch set available through Linaro. L BB L Load balance Load balance Load-based task migration Task load Task state Executing Sleep Load decay
  • 20. 20 MP: Experimental Implementation  Scheduler modifications:  Apply PJTs’ load-tracking patch set.  Set up big and little sched_domains with no load-balancing between them.  select_task_rq_fair() checks task load history to select appropriate target CPU for tasks waking up.  Add forced migration mechanism to push of the currently running task to big core similar to the existing active load balancing mechanism.  Periodically check (run_rebalance_domains()) current task on little runqueues for tasks that need to be forced to migrate to a big core. L BB L load_balance load_balance select_task_rq_fair()/ Forced migration
  • 21. 21 MP: ARM TC2: Audio  Workload: Audio (mp3 playback)  Performance/Energy target:  A7 energy  Status:  Audio related task do not use A15s, but the power consumption is still significantly more than A7 alone.  MP not as power efficient as IKS yet  Todo:  Target spurious wake-ups on A15. All the extra power comes from the A15's which shouldn't be used at all. Energy A7 30.79% MP 39.86% 0 10 20 30 40 50 60 70 80 90 100 Audio A15 A7 2CPU IKS MP Energy TC2: A15 up to 1.2 GHz A7 up to 1 GHz Better results expected on representative silicon.
  • 22. 22 MP: Audio workload analysis  Where is the extra energy spent with MP?  Need a look at why A15's consume power when they are not necessary. A7 MP 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Audio energy breakdown A15 cluster A7 cluster Energy hrtimer functions cpu0 cpu1 cpu2 cpu3 cpu4 hrtimer_wakeup 2 2 1212 417 190 tick_sched_timer 404 58 483 507 779 WQ functions cpu0 cpu1 cpu2 cpu3 cpu4 vmstat_update 30 2 27 25 28 cache_reap 15 2 14 13 14 phy_state_machine 31 0 0 0 0 Enter idle cpu0 cpu1 cpu2 cpu3 cpu4 0 6 2 2379 260 423 1 801 807 8316 9373 9652 TC2: A15 up to 1.2 GHz A7 up to 1 GHz Better results expected on representative silicon.
  • 23. 23 Scale invariant load  Load accumulation rate does not scale with available compute capacity (frequency, big/LITTLE cpu)  Currently, there is no link between cpufreq and the scheduler  Tasks may be migrated away from a cpu at low frequency by the scheduler before cpufreq has increased the frequency to match the cpu load.  Scaling the tracked load accumulation to match the current frequency mitigates this issue.  Tasks cannot accumulate enough load at low frequency to trigger migration and must wait for cpufreq to react first. Freq = x Freq = 2x
  • 24. 24 Scale invariant load 76782.1 76782.2 76782.3 76782.4 76782.5 76782.6 0 200 400 600 800 1000 76332.95 76333.05 76333.15 76333.25 76333.35 76333.45 0 200 400 600 800 1000 Original Frequency invariant
  • 25. 25 Load accumulation rate  For some workloads tracked load saturates too fast and leads to unnecessary task migrations.  Extending the tracked load history reduces tracked load variations due to sudden changes in the load characteristics.  Increasing the y factor in the load expression decreases the load accumulation and decay rates. load= u0+u1⋅y+u2⋅y 2 +…+un⋅y n 1024+y+ y 2 +…+ y n +1 1 21 41 61 81 101 6 11 16 26 31 36 46 51 56 66 71 76 86 91 96 106 111 116 121 126 131 136 141 146 151 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y=0.9785 Time [ms] y<1,0⩽u<1024
  • 26. 26 Load accumulation rate  Increasing y leads to a more conservative tracked load  Should lead to less up/down migrations  Increases up/down migrations delay for tasks that needs to be migrated. 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 4 10 16 22 28 34 40 46 52 58 64 70 76 82 88 94 100 103 106 109 112 115 118 121 124 127 130 133 136 139 142 145 148 151 154 157 160 163 166 169 172 175 178 181 184 187 190 193 196 199 Load accumulation rate Task y=0.9785 y=0.9844 y=0.9922 Time [ms] Trackedload
  • 27. 27 MP – Top Issues  Spurious wakeups  A15s are woken up by scheduler ticks (mainly)  Workqueues  Timers  RCU  cpu wakeup prioritisation  Pick the cheapest target cpu  Global balancing  Spread load to A7s when A15s are overloaded  Pack vs. spread  Cluster aware cpufreq governors
  • 28. 28 Questions?