This document summarizes a presentation about dark silicon in mobile devices and possible open source solutions. It discusses how power and thermal constraints are more severe for mobile devices due to limited battery progress and no fans. It also covers big.LITTLE scheduling, thread-level parallelism challenges, and user-level threading libraries like AsyncTask. Finally, it notes that while some open source parallel programming frameworks exist, fully utilizing parallelism on mobile and addressing dark silicon remain challenges with no widely adopted solutions.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-google
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Pete Warden, Staff Research Engineer and TensorFlow Lite development lead at Google, presents the "Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers" tutorial at the May 2019 Embedded Vision Summit.
Is it possible to deploy deep learning models on low-cost, low-power microcontrollers? While it may be surprising, the answer is a definite “yes”! In this talk, Warden explains how the new TensorFlow Lite framework enables creating very lightweight DNN implementations suitable for execution on microcontrollers. He illustrates how this works using an example of a 20 Kbyte DNN model that performs speech wake word detection, and discusses how this generalizes to image-based use cases. Warden introduces TensorFlow Lite, and explores the key steps in implementing lightweight DNNs, including model design, data gathering, hardware platform choice, software implementation and optimization.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-google
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Pete Warden, Staff Research Engineer and TensorFlow Lite development lead at Google, presents the "Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers" tutorial at the May 2019 Embedded Vision Summit.
Is it possible to deploy deep learning models on low-cost, low-power microcontrollers? While it may be surprising, the answer is a definite “yes”! In this talk, Warden explains how the new TensorFlow Lite framework enables creating very lightweight DNN implementations suitable for execution on microcontrollers. He illustrates how this works using an example of a 20 Kbyte DNN model that performs speech wake word detection, and discusses how this generalizes to image-based use cases. Warden introduces TensorFlow Lite, and explores the key steps in implementing lightweight DNNs, including model design, data gathering, hardware platform choice, software implementation and optimization.
Beating the (sh** out of the) GIL - Multithreading vs. MultiprocessingGuy K. Kloss
Talk given at the June 2008 meeting of the New Zealand Python User Group in Auckland.
Outline: An overview to approaches for parallel/concurrent programming in Python.
Code demonstrated in the presentation can be found here:
http://www.kloss-familie.de/moin/TalksPresentations
Introduction to the new Tensorflow 2.x and the Coral AI Edge TPU hardware. The presentation introduces Tensorflow main features such as Sequential and Functional APIs, mobile support with Tensorflow Lite, web support with TensorflowJS and Google Cloud support with TFX.
In addition, the presentation introduces the new edge TPU architecture coming from Coral AI, including its main hardware features and description of the compiling flow.
Can we move beyond threads and locks to manage concurrency? Are there more advanced models than threads and locks? How do other languages manage the concurrency?
We see some examples in others languages and a possible solution in C++.
Example code: https://github.com/italiancpp/meetup-milano-2014/tree/master/cpp_actor_model
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
When working with big data or complex algorithms, we often look to parallelize our code to optimize runtime. By taking advantage of a GPUs 1000+ cores, a data scientist can quickly scale out solutions inexpensively and sometime more quickly than using traditional CPU cluster computing. In this webinar, we will present ways to incorporate GPU computing to complete computationally intensive tasks in both Python and R.
See the full presentation here: 👉 https://vimeo.com/153290051
Learn more about the Domino data science platform: https://www.dominodatalab.com
TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.
Concurrency and parallelism in Python are always hot topics. This talk will look the variety of forms of concurrency and parallelism. In particular this talk will give an overview of various forms of message-passing concurrency which have become popular in languages like Scala and Go. A Python library called python-csp which implements similar ideas in a Pythonic way will be introduced and we will look at how this style of programming can be used to avoid deadlocks, race hazards and "callback hell".
Unity - Internals: memory and performanceCodemotion
by Marco Trivellato - In this presentation we will provide in-depth knowledge about the Unity runtime. The first part will focus on memory and how to deal with fragmentation and garbage collection. The second part will cover implementation details and their memory vs cycles tradeoffs in both Unity4 and the upcoming Unity5.
Beating the (sh** out of the) GIL - Multithreading vs. MultiprocessingGuy K. Kloss
Talk given at the June 2008 meeting of the New Zealand Python User Group in Auckland.
Outline: An overview to approaches for parallel/concurrent programming in Python.
Code demonstrated in the presentation can be found here:
http://www.kloss-familie.de/moin/TalksPresentations
Introduction to the new Tensorflow 2.x and the Coral AI Edge TPU hardware. The presentation introduces Tensorflow main features such as Sequential and Functional APIs, mobile support with Tensorflow Lite, web support with TensorflowJS and Google Cloud support with TFX.
In addition, the presentation introduces the new edge TPU architecture coming from Coral AI, including its main hardware features and description of the compiling flow.
Can we move beyond threads and locks to manage concurrency? Are there more advanced models than threads and locks? How do other languages manage the concurrency?
We see some examples in others languages and a possible solution in C++.
Example code: https://github.com/italiancpp/meetup-milano-2014/tree/master/cpp_actor_model
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
When working with big data or complex algorithms, we often look to parallelize our code to optimize runtime. By taking advantage of a GPUs 1000+ cores, a data scientist can quickly scale out solutions inexpensively and sometime more quickly than using traditional CPU cluster computing. In this webinar, we will present ways to incorporate GPU computing to complete computationally intensive tasks in both Python and R.
See the full presentation here: 👉 https://vimeo.com/153290051
Learn more about the Domino data science platform: https://www.dominodatalab.com
TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.
Concurrency and parallelism in Python are always hot topics. This talk will look the variety of forms of concurrency and parallelism. In particular this talk will give an overview of various forms of message-passing concurrency which have become popular in languages like Scala and Go. A Python library called python-csp which implements similar ideas in a Pythonic way will be introduced and we will look at how this style of programming can be used to avoid deadlocks, race hazards and "callback hell".
Unity - Internals: memory and performanceCodemotion
by Marco Trivellato - In this presentation we will provide in-depth knowledge about the Unity runtime. The first part will focus on memory and how to deal with fragmentation and garbage collection. The second part will cover implementation details and their memory vs cycles tradeoffs in both Unity4 and the upcoming Unity5.
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...PROIDEA
Users of modern Linux containerization technologies are frequently at loss with what kind of security guarantees are delivered by tools they use. Typical questions range from Can these be used to isolate software with known security shortcomings and rich history of security vulnerabilities? to even Can I used such technique to isolate user-generated and potentially hostile assembler payloads?
Modern Linux OS code-base as well as independent authors provide a plethora of options for those who desire to make sure that their computational loads are solidly confined. Potential users can choose from solutions ranging from Docker-like confinement projects, through Xen hypervisors, seccomp-bpf and ptrace-based sandboxes, to isolation frameworks based on hardware virtualization (e.g. KVM).
The talk will discuss available today techniques, with focus on (frequently overstated) promises regarding their strength. In the end, as they say: “Many speed bumps don’t make a wall
Using Elasticsearch as the Primary Data StoreVolkan Yazıcı
The biggest e-commerce company in the Netherlands and Belgium, bol.com, set out on a 4 year journey to rethink and rebuild their entire ETL (Extract, Transform, Load) pipeline, that has been cooking up the data used by its search engine since the dawn of time. This more than a decade old white-bearded giant, breathing in the dungeons of shady Oracle PL/SQL hacks, was in a state of decay, causing ever-increasing hiccups on production. A rewrite was inevitable. After drafting many blueprints, we went for a Java service backed by Elasticsearch as the primary storage! This idea brought shivers to even the most senior Elasticsearch consultants hired, so to ease your mind I’ll walk you through why we took such a radical approach and how we managed to escape our legacy.
IoT with Ruby/mruby - RubyWorld Conference 2015哲也 廣田
In the Japan OSS Promotion Forum, studied IoT with Ruby/mruby from last year(2014), and developed the sample. This time, will present last year's result, technical challenges, and this year's(2015) activities.
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
A focus on the use of FPGAs by cloud service providers. Includes Microsoft Azure Catapult, Google Tensor Processors, and Amazon EC2 F1 instances. Also includes background info on how to get started with FPGAs
A Journey to Boot Linux on Raspberry PiJian-Hong Pan
Each processor/chip architecture has its own procedure to boot the kernel. It works with desgined partition layout and vendor specific firmwares/bootloaders in the boot partition. We can learn the related knowledge from the Raspbian image for Raspberry Pi, which is the board we can obtain easily. However, the diversity between the special booting procedures with specific firmwares/bootloaders increases the complexity for distribution maintainers. It will be great if there is a way to make it more generic that can be applied to most of the chip architectures/boards to boot up the system.
After referring to some Linux distributions, we learned U-Boot may play a role in the solution. It splits the booting procedure into hardware specific and generic system parts. This helps distribution maintainers deploy the generic system with OSTree, including device trees.
Let’s deep dive into this magic booting procedure!
Why use JavaScript in Hardware? GoTo Conf - Berlin TechnicalMachine
A majority of this presentation was live demos of hardware in action (how to blink lights, send HTTP requests to an Express server, attach sensors, and an integration demo) but it also quickly goes over some reasons why you should consider using JavaScript to prototype hardware.
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
Comparisons are fundamental to computing - and comparing strings is not nearly as straightforward as you might think. Come learn about the history, nuance and surprises of “putting words in order” that you never knew existed in computer science, and how that nuance impacts both general programming and SQL programming. Next, walk through a few actual scenarios and demonstrations using PostgreSQL as a user and administrator, which you can re-run yourself later for further study, including one way you could easily corrupt your self-managed PostgreSQL database if you aren't prepared. Finally we’ll dive into an explanation of the surprising behaviors we saw in PostgreSQL, and learn more about user and administrative features PostgreSQL provides related to localized string comparison.
Similar to Dark Silicon, Mobile Devices, and Possible Open-Source Solutions (20)
Exploring Thermal Related Stuff in iDevices using Open-Source ToolKoan-Sin Tan
This is the era of so-called “dark silicon.” Thermal control is an important but seldom-talked topic. I could not find public information on how iOS does it. Recent checkm8 and follow-on checkra1n enable jailbreaking of iPhone 5s – iPhone X running iOS 12.3 and up. So that we can explore these devices with open-source tools
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
Understanding object model and bytecode is a crucial part in understanding an interpreted object-oriented language. Smalltalk, one of the oldest object-oriented programming languages, has a great object model and has been used bytecode and VM since 1970s. It is interesting to compare Smalltalk's and Python's object model and bytecode. Guido once said "I remember being surprised by its use of metaclasses (which is quite different from that in Python or Ruby!) when I read about them much later. " and "Smalltalk's bytecode was a bigger influence of Python's bytecode though." It is interesting to compare Smalltalk's and Python's metacalss and bytecode.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
1. Dark Silicon, Mobile
Devices, and Possible Open
Source Solutions
Koan-Sin Tan
freedom@computer.org
COSCUP 2013, Aug. 3rd,TICC,Taipei
Friday, August 23, 13
2. • Software engineer, veteran open-source user
• Learned something about light-weight
process (LWP) on Sun OS 4.x in early 1990s
• Did a user-level thread library on 386BSD
with a classmate in 1992
• Was involved in big.LITTLE scheduling work
recently
Friday, August 23, 13
3. Samsung “optimization” for senchmarks
http://www.anandtech.com/show/7187/looking-at-
cpugpu-benchmark-optimizations-galaxy-s-4
Friday, August 23, 13
6. • “Dark Silicon refers to the exponentially
increasing number of a chip's transistors
that must remain passive, or "dark", in
order to stay within a chip's power budget”
Friday, August 23, 13
7. Figure from the textbook. We know we are in CMP era.
“Since 2003, the limits of power and available instruction-
level parallelism have slowed uniprocessor performance.”
Friday, August 23, 13
8. Dennard scaling hits the wall
• Dennard Scaling
• When voltages are scaled along with all dimensions, a device’s electric
fields remain constant, and most device characteristics are preserved
• scaling maintains constant power density
• logic area and power is scaled down by alpha^2
• energy per transition is scaled down by alpha^3, but frequency is
scaled up by 1/alpha, resulting in an alpha^2 decrease in power per
gate
• ........
• google Dennard Scaling you can find more information, such as, http://
www1.cs.columbia.edu/~cs4824/lectures/csee4824_f12_lec22.pdf
Friday, August 23, 13
9. Mobile Devices
• Both power and thermal constrains are
more severe than desktop devices
• The progress of battery is relatively slow
• You don’t want put a fan into you
smartphone
• conduction, convection, radiation
Friday, August 23, 13
10. Yes, modern high-end mobile processors have serious
thermal problems.Tegra 4 game console figure from
iFixit
Friday, August 23, 13
11. Nexus 10 Thermal
Throttling
• Antutu 3.0.2
• Unit for X axis is 200 ms
• It reaches 80 ˚C in 20
second
• Throttling starts at 80 ˚C;
stops at 78 ˚C
• Throttling is to decrement
themaximum freq value of
cpufreq
Friday, August 23, 13
14. Introducingbig.LITTLE
Figure 28-3 Processor DVFS curves
In a big.LITTLE system these operating points are applied both to the Cortex-A15 and
Cortex-A7 processors. When the Cortex-A7 processor is executing the OS can tune the
operating points as it would for an existing platform with a single applications processor. When
the Cortex-A7 processor is at its highest operating point (Figure 28-3), if more performance is
required a switch is invoked that transfers the OS and applications to the Cortex-A15 processor.
Further DVFS tuning takes place on the Cortex-A15 processor if required, as the operating load
increases.
Migration requires rapid context switching capability. Coherency is clearly a critical enabler in
achieving a fast task migration time as it allows the state that has been saved on the outbound
(migrated from) processor to be snooped and restored on the inbound (migrated to) processor
rather than going via main memory. Additionally, for Cluster migration, (or for CPU migration
when all processors have been switched) because the L2 cache of the outbound processor is
coherent it can remain powered up after a task migration to improve the cache warming time of
ARM big.LITTLE
Friday, August 23, 13
15. Thread-Level Parallelism
• Thread-level Parallelism (TLP) is
an index you can treat it as
number of threads running
concurrently
• a table from an ISCA ‘10 paper
named “Evolution of thread-level
parallelism in desktop
applications”
• 2000, 2010
• mobile devices are worse
• http://dl.acm.org/citation.cfm?
id=1816000
Friday, August 23, 13
16. Parallel Programming
Could Help a Bit
• Parallel computing/programming has been there for a long time
• You know pthread and OpenMP are available and C++11 came with currency
support
• Java use thread and its synchronization model
• “Why Threads Are A Bad Idea”, by John Ousterhout, http://www.cc.gatech.edu/
classes/AY2009/cs4210_fall/papers/ousterhout-threads.pdf
• Thread is “easy: to describe; to use; to get wrong” to quote Andrew Birrell,
http://www.cs.princeton.edu/courses/archive/spr07/cos598A/lectures/
Birrell.pdf
• For more theoretical explanation, see “The Problems with Threads” by Edward
Lee, http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
• And you know that except shared memory model, there is message passing
computing model. And more, e.g., actors, data-flow, systolic array, etc.
Friday, August 23, 13
17. Threads are Bad Ideas?
• “Why Threads Are Bad Ideas”, John
Ousterhout, 1995, http://
www.cc.gatech.edu/classes/AY2009/
cs4210_fall/papers/ousterhout-
threads.pdf
• Yes, It’s a bit dated. Some of those
points are no longer valid; many of
them stand the test of time
• Threads:
• Too hard for most
programmers to use
• Even for experts, development
is painful
Friday, August 23, 13
18. Some of Ousterhout’s
arguments remain valid
• Synchronization
• manually set of mutex/lock
• deadlock: yes deadlock
• hard to debug
• threads breaks modularization
• callbacks don’t work with locks
Friday, August 23, 13
19. thread is easy to get
wrong
• Manual selection of mutual exclusion:
• Default is too little (and hence races)
• Easy fix is too much (deadlocks or
blank stares)
• Projects don’t create hierarchical
abstractions
• Can’t decide and/or maintain acyclic
locking order
• “Composition” requires entire new
abstractions
• “Clever” optimizations aren’t maintainable
• .....
Friday, August 23, 13
20. User-level libraries,
frameworks
• Android AsyncTask
• a class to help perform background operations and publish results on the UI
thread without having to manipulate threads and/or handlers
• http://developer.android.com/reference/android/os/AsyncTask.html
• Intel Threading Building Blocks (TBB)
• http://threadingbuildingblocks.org/, http://en.wikipedia.org/wiki/
Intel_Threading_Building_Blocks
• works on Android x86 and ARM
• Apple Grand Central Dispatch (GCD)
• http://developer.apple.com/library/ios/#documentation/Performance/
Reference/GCD_libdispatch_Ref/
• Software Transactional Memory
• http://gcc.gnu.org/wiki/TransactionalMemory
Friday, August 23, 13
21. Language extension
• Intel Cilk Plus
• http://cilkplus.org/, http://en.wikipedia.org/
wiki/Intel_Cilk_Plus
• open sourced, trying to get into gcc and llvm
• Apple blocks
• http://developer.apple.com/library/ios/
#documentation/cocoa/Conceptual/Blocks/
Friday, August 23, 13
22. OpenCL Related
• OpenCL
• pocl, http://pocl.sourceforge.net/
• OpenCL and Java
• Aparapi, https://code.google.com/p/aparapi/
• Smuatra, http://openjdk.java.net/projects/sumatra/
• RenderScript
• in AOSP
• ThorScript
• will be open-sourced
Friday, August 23, 13
23. Cilk Plus: simple language extensions
originated from Charles Leiserson
Friday, August 23, 13
24. Simple Cilk Plus Example
int fib(int n) {
if (n < 2) return n;
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
int fib(int n) {
if (n < 2) return n;
int x = clik_spawn fib(n-1);
int y = fib(n-2);
cilk_sync;
return x + y;
}
Friday, August 23, 13
25. simple GCD+blocks
dispatch_group_t group = dispatch_group_create();
fib = ^() {
if (n < 2) {
result = n;
return;
}
__block int x, y;
int m = n;
n = m - 1;
dispatch_group_async(group, a_queue, ^{fib(); x = result;});
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
n = m - 2;
dispatch_sync(a_queue, ^{fib(); y = result;});
n = m;
result = x + y;
return;
};
Friday, August 23, 13
26. data parallel fib() looks
more reasonable
int fib(int n) {
if (n < 2) return n;
int p = 0, q = 1, result =0;
cilk_for (int i=2; i <= n; i++) {
result = p + q;
p = q; q = result;
}
return result;
}
TextText
Text
n.b.: in case you didn’t
notice, this may produce
wrong results because of
loop-carried dependency
Friday, August 23, 13
27. parallel fib() with GCD
and blocks
int(^fib)(int);
fib = ^(int n){
if (n < 2) return n;
__block int p = 0, q = 1, result = 0;
dispatch_apply(n-1, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t i) {
result = p + q;
p = q; q = result;
});
return result;
};
Friday, August 23, 13
28. GCD is can be used with
OpenCL And GCD
• That’s what is available on Mac OS X and
iOS
• Nope, iOS didn’t open OpenCL yet. But
you can find how to use OpenCL for
ARM on iOS easily
Friday, August 23, 13
29. What are available
• Task-parallel and data-parallel constructs,
libraries or languguages
• Lambda, closure, continuation, etc.
• Queue, queue management: load balance,
work stealing, etc
• Data structures, e.g.,TBB
• Lock-less synchronization
Friday, August 23, 13
30. Lockfree synchronization
• In case you didn’t know it, NO, it’s not new
at all
• Linux has been used RCU (Read-Copy-
Update) for several years
• In fact, it’s there since 1970s, see Kung’s
1980 paper proposed RCU-like mechanism.
Friday, August 23, 13
31. Kernel
• big.LITTLE
• IKS: in-kernel-switcher
• related code being upstreaming after 3.10
• Global Task Scheduling (GTS), Heterogenous Multi-Processor (HMP)
• Current CFS maintainer Ingo didn’t like GTS’s power-saving part
• Power Management
• So many mechanisms: cpufreq, cpuidle, runtime PM, CCF, etc.
• Linaro has a wiki page on how to/what to enable/implement for a new SoC
• Thermal Management
• Throttling, e.g., ask related components to slow down so that less heat will
be generated
Friday, August 23, 13
34. Many are remained to be done
• No widely used open-source power or thermal
management framework available?
• Some problems are fundamental hard to
parallelized, e.g.,
• parsing in browser: nowadays, webkit and
firefox use LALR(1) or similar parsing algorithm
• No full-featured open-source OpenCL
implementation for GPGPU
Friday, August 23, 13
35. Wrap-up
• “dark silicon” is reality on mobile devices,
• power wall and thermal wall
• parallel/concurrent code isn’t popular on
mobile devices (yet)
• discussed some possible free and open
source solutions
• many remained to be done
Friday, August 23, 13