SlideShare a Scribd company logo
1 of 6
Download to read offline
A Multi-core Software/Hardware Co-debug Platform with ARM CoreSight
TM
,
On-chip Test Architecture and AXI/AHB Bus Monitor
Alan P. Su, Jiff Kuo, †
Kuen-Jong Lee, ‡
Ing-Jer Huang, §
Guo-An Jian, §
Cheng-An Chien, §
Jiun-In Guo, and ‡
Chien-Hung Chen
Global Unichip Corp., †
EE Dept. National Cheng Kung University,
‡
CSE Dept. National Sun Yat-Sen University, §
CS Dept. National Chung Cheng University
{alan.su, jiff.kuo}@globalunichip.com, †
kjlee@mail.ncku.edu.tw, ‡
ijhuang@cse.nsysu.edu.tw, §
jiguo@cs.ccu.edu.tw
Abstract
Multi-core system is becoming the next generation embedded
design platform. Heterogeneous and homogeneous processor
cores integrated in Multiple Instruction Multiple Data
(MIMD) System-on-a-Chip (SoC) to provide complex
services, e.g. smart phones, is coming up in the horizon.
However, distributed programming is a difficult problem in
such systems. Today, only in very few MIMD SoC designs we
can find comprehensive multi-core software/hardware
co-debug capability that can stop at not only software but
also hardware breakpoints to inspect data and system status
for identifying bugs. In this work we have integrated various
debug mechanisms so that the entire multi-core SoC is able
to iterate unlimited times of software and hardware breaks
for data and status inspections and stepping forward to
resume execution till next break point. This debug
mechanism is realized with a chip with four ARM1176 cores
and ARM CoreSightTM
on-chip debug and trace system, a
Field Programmable Gate Array (FPGA) loaded with
on-chip test architecture and bus monitor, and software
debug platform to download system trace and processor core
data for inspection and debug control.
Key contributions of this work are (1) a development of
multi-clock multi-core software/hardware co-debug platform
and (2) the exercise of a multi-core program debugging to
visualize the physical behavior of race conditions.
1. Multi-core Programming and Debugging
Multi-core system is becoming the next generation embedded
design platform. Heterogeneous or homogeneous processor
cores integrated in a System-on-a-Chip (SoC) to build small
form factor platforms and provide complex services, e.g. smart
phones, are coming up in the horizon. Smart phones provide
various domains of applications in the fashion of distributed
computing and thus the multi-core architecture is generally a
Multiple Instruction Multiple Data (MIMD) type design [1] to
deliver different software with wide range of resource and
performance requirements. However, unlike parallel
programming on homogeneous and Single Instruction Multiple
Data (SIMD) architectures [1], where the same program runs
on multiple processor cores to process different sets of data,
the distributed programming is an extremely difficult problem
in MIMD architectures. Today, only in very few MIMD SoC
designs we may find comprehensive multi-core
software/hardware co-debug capability [2]. Ideally, the
architecture not only needs to support software but also
hardware breaks and visibility.
Figure 1 gives a simple example of SoC described in [3]
to illustrate software complexity faced in multi-core MIMD
designs. Figure 1(a) is the target system specification described
in a task graph. AP1 fetches encoded image data from an input
source, pre-processes the data, delivers to AP2 for
post-processing and then sends the decoded image for display.
Figure 1(b) is a MIMD dual core implementation of the given
task graph. By design, AP1 runs on an ARM core and fetches
the encoded image data stored on a USB flash memory
through the USB port. After the pre-processing, AP1 stores the
data to a shared SRAM, then notifies DSP core the “data write
complete” message to execute AP2 by issuing an interrupt
using an OS system call. The DSP core receives the interrupt
which triggers the Interrupt Service Routine (ISR) to initiate
AP2 to read data from the shared SRAM, post process it and
send to the frame buffer of the LCD Display.
(a) Task Graph
(b) A Dual Core Implementation
Figure 1, An example of dual core system implementation
However let us consider a race condition scenario.
Assume the data passing between AP1 and AP2 is not
controlled by a mutual exclusive mechanism to guarantee the
AP1-write-before-AP2-read order. The scenario goes that AP1
writes off the last block of data to the shared SRAM and
immediately issues the “data write complete” interrupt to the
DSP core. Let us also assume that bus AHB0 has a lower
priority than bus AHB1 on the Inter Connection Module (ICM)
arbiter and the data write from AP1 is blocked due to other
AHB1 requests issued by the DSP core. The interrupt request
is served in the highest priority by the DSP core and thus the
“data write complete” interrupt triggers AP2 earlier than the
last AP1 data being stored on the shared SRAM. Since there is
no mutual exclusive mechanism in place to prevent AP2 from
reading data before it is ready, the race condition thus occurs.
Imagine for the programmer to debug the problem, s/he needs
to have the debug controllability and visibility into ARM and
DSP cores to track programs executions: the visibility into
ICM to see how it serves AHB0 and ABH1, and the timing
view to learn that the interrupt happens before the AP1 data
1
being written to the shared memory to identify the root cause
of the race condition.
To support the debug capability needed to identify the
race condition described above, the multi-core SoC in question
needs to have a way to set a breakpoint at the end of the AP1
data write to break the complete SoC. The debug mechanism
then allows the designer to
1. inspect AP1 and AP2 programs,
2. view ARM and DSP cores status and data,
3. check components AHB0, AHB1, ICM and shared
SRAM status and data, and
4. step the SoC through the execution to see the
interactions among programs, cores, components and
busses to visualize the race condition
In this work we integrated various sub-systems to
complete a multi-core software/hardware co-debug platform to
deliver above said features. We realized the platform by
implementing a quad-ARM1176 SoC with CoreSightTM
[4] to
hook up with an on-chip test mechanism and AHB/AXI bus
monitor. To validate the platform implemented, a multi-core
programming exercise was also conducted to develop a 3D
image application on this co-debug platform.
In Section 2 we discuss ARM CoreSightTM
. Section 3
describes a multi-clock on-chip test architecture that has the
capabilities to set hardware breakpoints and break, view
functional unit control register data, cycle step and resume.
Section 4 illustrates an AHB/AXI bus monitor that is a
Verification Intellectual Property (VIP) capable of alerting
erroneous AHB/AXI transactions and conducting trace dump.
Section 5 shows the integration of the multi-core
software/hardware co-debug platform with ARM CoreSightTM
,
multi-clock on-chip test architecture and AHB/AXI bus
monitor. Section 6 illustrates the exercise in multi-core 3D
image application programming and debugging using the
co-debug platform developed and Section 7 finalizes this work
with conclusion and future research.
2. ARM CoreSightTM
Figure 2, ARM CoreSightTM
debugging environment
ARM CoreSightTM
is an on-chip component developed by
ARM to support multi-core cross triggering, which allows a
core on hitting a breakpoint to break all other cores. It is done
by a general Cross Trigger Matrix (CTM) and individual Cross
Trigger Interface (CTI) on each core. ARM has developed CTI
for ARM9, ARM11 and Cortex families. CTI is used for debug
control and ARM core status and register viewing.
CoreSightTM
also supports trace dump. Each core dumps
its trace through its own Embedded Trace Macrocell (ETM)
onto the Advanced Microcontroller Bus Architecture (AMBA)
Trace Bus (ATB) and to trace port through the Trace Port
Interface Unit (TPIU). The trace dump can further provide
complete core information for debug purposes.
DS-5, the ARM debugging tool for CoreSightTM
and
beyond, controls program debug and trace dump through
DSTREAM, the In-Circuit Emulator (ICE) of CoreSightTM
, via
the Joint Test Action Group (JTAG) port, Debug Access Port
(DAP), Debug APB and into ETM and CTI.
CoreSightTM
does not restrict its support only to ARM
core families. By following ETM and CTI protocols, one can
also develop the ETM and CTI for other cores like the DSP in
Figure 2. This external core is thus controlled by DS-5 and
integrated into the debug environment. This is how we hook
the on-chip test architecture introduced in next section with
CoreSightTM
.
3. On-Chip Debug Architecture
Following the Moore’s Law, the integrated circuit (IC)
technology doubles its gate density every eighteen months. At
28nm technology the gate density has reached 4.2M gates per
mm2
. With such a high capacity we can start to consider
putting self testing ability onto the chip. The development of
on-chip test architecture is studied in [5, 6]. We leverage this
on-chip test architecture also for debug purposes in this work.
The side band test bus and test port can be used for component
core register inspection [7]. By adding multiple clock gating
and stepping mechanism, we can implement hardware break,
component register data viewing and cycle stepping to support
hardware debug capability.
3.1 Overall On-Chip Debug Architecture
Figure 3, Overall architecture of SoC debug platform
Figure 3 shows the overall architecture of the on-chip
debug platform which consists of both software and hardware
components. The embedded processor (ARM 1176) is
employed to execute the software program through the JTAG
port and ICE with the debugger tool in PC-host. The
instruction memory is used to store the instructions to be
executed while the CUD (Core Under Debug) data memory is
used to store the required data for Intellectual Property (IP)
2
application and the operational results of the IP.
The IP cores are wrapped with the IEEE 1500 wrappers
[8] that support core-level testing with parallel scan capability.
The Test Access Mechanism Controller (TAM Controller or
TAMC) generates debug control signals to control the debug
procedure for IP cores. It also buffers the traced data and stores
them to a local memory. The dedicated test bus connects the
wrapped CUDs with the TAMC for the transfer of the control
signals and the traced data.
To integrate the debug platform with the ARM CoreSight
on-chip debug and trace architecture, a customized CTI
module and an AHB-APB bridge are added to this platform.
The CTI module can deliver a debug request signal (DBGRQ)
to let the TAM Controller enter into the debug mode. During
the debug mode, the TAM Controller stops the CUD when
hitting the break point and dumps the contents of the CUD to
the local memory. It can also compare the obtained data with
the golden data retrieved from external or embedded memory.
These operations are controlled by a debug tool called
DASTEP which is stored in the PC-host. The user can thus
examine the test results immediately. After finishing the debug
function, the TAM Controller delivers an acknowledge signal
(DBGACK) to the CTI module. The bus bridge is needed
because the CTI module is compliant with AMBA APB
protocol.
The software is composed of a user-provided application
program and a debug program. The application program
executes the functional operation of the system. The debug
program contains the setup date to initialize the TAM
Controller, which is also generated by DASTEP.
3.2 Multiple Clock Gating and Stepping
The main issue to gate and step multiple clocks is clock
synchronization. As an example of two cores with 100MHz
and 125MHz clocks respectively, we can find synchronous
positive edges every 0.5 milliseconds. When we gate at the
first synchronous positive edge, every synchronous step has to
be 0.5 milliseconds away. Too many events can happen in this
period of time and the resolution is too low for meaningful
hardware/software debug. We have investigated this problem
by carefully examining the relationship between the clock rates
of interactive cores and are now able to identify much more
cycles that can be “safely” stopped and resumed. Thus instead
of breaking at synchronous positive edges and stepping the
least common multiple of clock frequencies, we gate clocks at
the identified safe instances that are usually the same as or just
a few cycles away from the break points. With this breaking
mechanism, even though all the clocks may have different
phase shifts toward its last positive edge, they can be resumed
synchronously and continued correctly without any glitches.
3.4 DASTEP and Debug Procedure
DASTEP is the control Graphical User Interface (GUI) of
the on-chip debug architecture. It can set hardware breakpoints
on components and view control register values and cycle step
the system. DASTEP also can dump and view component trace
data.
Using DASTEP to run debugging procedure, first we
should determine which CUD at which clock domain is to be
observed. Then we set the cycle-based hardware breakpoint
and wait for the traced data. The traced data will be transferred
to the PC-host through the UART mechanism when the
breakpoint matches. The traced data can be displayed and
compared to the golden data. In the following, the debug
procedure using the GUI is described in detail.
Figure 4 shows the overview of the graphic user interface
in DASTEP. By clicking the MCD hardware Breakpoint item
under the Debug Platform entry, the setup window as shown
in Figure 5 appears, which allows the user to select the cores to
be debugged (○1 ), to set the first breakpoint (○2 ), and to select
the master clock domain for the reference of the break point
cycle (○3 ). The user can then click “Apply” to enter the debug
information, “Run” to start the debug session or “Cancel” to
cancel the setup information.
Figure 4, Overview of graphic user interface
Figure 5, Setup window of MCD hardware breakpoint
insertion
After starting the debug session, the PC-host is waiting for
receiving the traced data. Once the breakpoint occurs, the
display window will display the traced data in a control and
trace window as shown in Figure 6. The user can now examine
the traced data in the window (○1 ). Four operations are
supported here (○3 ): “Browse & Save” to store the traced
results, “Select Register” to select the registers to be displayed,
“Terminate Debug Mode” to continue the functional
operation of the CUD, and “Cancel” to cancel the control and
display window.
After examining the information of the current breakpoint,
the user can then continue for the next breakpoint set-up (○2 )
by entering the next breakpoint cycle in the “Next Breakpoint
Cycle” column, clicking “Run” to resume the system and let it
stop at the next breakpoint, clicking “Receive” to wait and
receive the traced data, or clicking “Single Step” to continue
the debug session in a cycle by cycle manner.
It is worth mentioning that an open source software,
Gtk-wave, is employed to help show the trace data with
waveform-based display, as shown in Figure 7.
3
Figure 6, Control and display window
Figure 7, Waveform-based displays of traced data
4. AHB/AXI Bus Monitor
The bus monitor consists of a protocol checker, a bus tracer
and a trace memory. Figure 8 shows the bus monitor
architecture in the red block modules. The protocol checker
detects real-time bus protocol error or inefficiency. The bus
tracer captures on-chip bus signals at many levels of
abstraction and performs real time compression. The trace
memory is used to store compressed traces. The protocol check
and the bus tracer could collaborate with each other. For
example, when the protocol checker detects a bus protocol
error, it triggers the bus tracer to start/stop monitor activity and
store the trace data into the trace memory.
In [9] an AHB bus monitor is developed. Later the
technology evolved and the AXI bus is also supported. These
two works developed a hardware VIP to help verifying
components with AHB and/or AXI interfaces. Figure 9 shows
the AXI trace verification method. In the simulation
environment, the AXI VIP produces AXI interconnection cycle
accurate behavior. The AXI tracer passively captures signals
from the VIP, and compresses the trace data stored in the trace
memory. The bus analyzer decompresses trace result. We
compared trace data between simulation direct dump data and
decompression trace result to verify the AXI tracer. Similarly,
the AXI protocol checker verification is also based on the AXI
VIP [10]. This monitor did AXI rule checking and reported an
error message when an AXI violation was found. A circular
buffer then dumped its data which was a bus trace around 1000
cycles before the violation.
The bus monitor has been available for both AHB and AXI
buses [9-10]. Figure 8 shows the AXI monitor integrated into
the SoC debug platform. The AXI bus is the center of the SoC.
The AHB bus is used as a debug bus for the SoC. The PCI
interface is a communication channel between the debug bus
of the SoC and the debug software running on a PC. The debug
commands from the debug software are translated into AHB
master commands by a PCI2AHB transactor to configure and
access the AXI monitor components. Once an error has been
signaled, the debug architecture thus notifies CoreSightTM
and
breaks the whole SoC. The user can then cycle step and view
program, processor core, component and bus data and status to
identify the problem.
PCI Interface of PC-host
PCI 2 AHB Transactor
Master
I/F
Slave
I/F
Debug Bus(Based on AHB Bus)
ARM
926EJ-S
Wrapper
AXI
Checker
AXI
Tracer
Trace
Memory
Wrapper Wrapper Wrapper
SMCLMC
SRAM
PCI I/F
ROM
ROM
I/F
AXI Interconnect
ARM 1126
Wrapper
CUD
Wrapper
Trigger
Event
Memory
Wrapper
TAMC I/F
TAM
Controller
UART
Figure 8, Integration bus monitor in the SoC debug platform
Figure 9 is the bus monitor analyzer software running on
the debug PC to configure and access the bus monitor. There
are four windows showing (1) a multi-resolution waveform
viewer, (2) an access control signal analyzer, (3) an
address/data timing distribution analyzer, and (4) a bus state
transition analyzer.
(1)Multi-Resolution WaveformViewer (2)Access Control Signals Analyzer
(3)Address/DataTiming Distribution (4)Bus Transition ModelAnalyzer
Figure 9, Bus monitor analyzer
5. Multi-core Software/Hardware Co-debug
Platform
The multi-core software/hardware co-debug platform
4
described in previous sections was only the hardware side of
the system and integration with the software debugger is also
required. Our first step was to connect to CoreSightTM
using
DS-5 and DSTREAM. We then set breakpoints in ARM1176
programs and break all four ARM1176 and on-chip test
architecture once a breakpoint was hit. With DS-5 we can
inspect program data. With DASTEP we can view component
control register data and with AHB/AXI bus monitor we can
dump and view bus trace data. The final step is to use
DASTEP to set hardware breakpoints and break four
ARM1176 program executions for debug purposes as well. A
user friendly multi-core software/hardware co-debug platform
is thus completed.
6. Experiment
We use a 3D depth map generation project to verify the
co-debug platform. The front-end of the 3D depth map
generation is a high profile H.246 decoder [11, 12]. The 3D
depth map generation [13, 14] transforms 2D video images
into 3D view with depth map for 3D video viewing. The
development is started with single threaded C++ code and
moved into MIMD programs. Through the help of the
multi-core co-debug platform we realize the need for a
hardware implemented H.264 decoder to cope with 3D depth
map generation using all four ARM1176 to play a high profile
video in real time.
6.1 Algorithm of 3D Depth Map Generation
Figure 10 shows the 3D depth map generation algorithm.
It generates the depth maps in good quality for most 2D
images. In addition, the processing steps in the proposed
algorithm have been optimized for reducing its complexity
while preserving good quality. The encapsulated low
complexity techniques are introduced below.
Figure 10, Proposed depth map generating algorithm
We first use Sobel mask to get the edge information of the
input image for detecting vanishing lines in the next step. We
optimize the Sobel mask formula to reduce about 65%
computational complexity with quality results. Then, we use
the 5×5 Hough transform to detect vanishing lines. After
Hough transform, we classify the input images into three types,
which are Normal (with vanishing point), Scenery (with
sky/mountain), and Close-up. By the proposed classification
method, we use different methods to generate the depth map
with good quality.
In Normal type, we calculate the intersection point of
vanishing lines. After calculating all intersection points of
vanishing lines, we use an 8x8 region to group the nearest
points in the image which is also called the vanishing region
(VR). According to the position of VR, we generate the
Gradient Depth Map (GDM) according to the distance between
every pixel and the VR for the “normal” type of images. For
the “Scenery” type, we define that the top of the image is the
VR to generate the GDM. In Scenery type, we assign a static
GDM since the sky or mountain is always in the top of the
image. In Close-up type, we only adopt a block-based contrast
filtering to classify the background and foreground objects.
Finally, Joint Bilateral Filtering (JBF) is used to
post-process the merged depth map by strengthening the edge
information of the objects related to the original image in the
proposed algorithm. We optimize each step in the proposed 3D
depth map generation algorithm and achieve about 90% of
complexity reduction in terms of execution time as compared
to the original ones.
6.2 Parallelization of 3D Depth Map Generation
For the sake of realizing the 3D depth map generation on
the multi-core platform, we propose a parallel 3D video
playing system as shown in Figure 11. In this system, we use
one thread to perform decoding of H.264 video, three threads
to perform 3D depth map generating, and one thread to collect
the depth map from each 3D depth map generator. At the front
end of the proposed system, H.264 decoder will decode the
bit-stream and produce video. In the following, the decoded
video is delivered to each 3D depth map generator frame by
frame. Finally, the pseudo display collects all the depth maps
in order.
H.264 Decoder
3D Depth Map
Generator
3D Depth Map
Generator
3D Depth Map
Generator
3D Depth Map
Generator
Pseudo Display
File
H.264
Bitstream
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
Figure 11, Proposed parallel 3D video playing system
In order to ensure the correctness for the execution of the
proposed parallel 3D video playing system, we establish a
synchronization mechanism among the threads. For this reason,
we use a synchronized FIFO to connect any two threads once
if one of them has to deliver data to another. As shown in
Figure 12, the proposed synchronized FIFO is essentially a
circular FIFO carried out based on a producer-consumer
mechanism. The front end of the synchronized FIFO is
connected to the thread that plays the producer while the rear
end of the synchronized FIFO is connected to the thread that
plays the consumer. At the producer end, data can be written to
the FIFO anytime except the FIFO is full. Similarly, data can
be read from the FIFO at the consumer end anytime except the
FIFO is empty. Once any thread is not permitted to access the
FIFO, it has to wait until getting the permission. With such a
synchronization mechanism, we can easily realize the
synchronization in the proposed parallel 3D video playing
system.
In the following, we make a description of how the
proposed synchronized FIFO achieves the synchronization
between two threads. Figure 13 shows the pseudo code for the
synchronization at the producer end of the synchronized FIFO.
At first, the thread at the producer end checks if the FIFO is
full. When the FIFO is full, the thread has to wait until the
5
FIFO is not full. After getting the access permission, the thread
starts to write its data to the FIFO. Finally, the thread calls a
confirmation function that will update the information
recorded in the FIFO and issue a notification signal to the
thread at the consumer end of the FIFO. Similarly, Figure 14
shows the pseudo code for the synchronization at the consumer
end of the synchronized FIFO. The thread at the consumer end
performs almost the same steps as mentioned for the producer
end except it checks if FIFO is empty rather than full.
Figure 12, Proposed synchronized FIFO
Figure 13, Pseudo code for the synchronization at the producer
end of the synchronized FIFO
Figure 14, Pseudo code for the synchronization at the
consumer end of the synchronized FIFO
6.3 Performance Evaluation
In this section, we discuss about the performance
improvement for the proposed parallel 3D video playing
system. Figure 15 shows the performance for different
configurations. The test video we use is in CIF resolution and
contains 300 frames in total. Under the configuration of using
one thread for H.264 decoding and three threads for 3D depth
map generating, the processing speed of the proposed system
can achieve 27.75 fps 3D video display in CIF resolution.
Figure 15, Performance evaluation for the proposed parallel
3D video playing system
7. Conclusion and Future Work
In this work we have built a generic multi-core
software/hardware co-debug platform, which is the framework
to design future multi-core SoC with MIMD programming and
debug support. The hardware system designed in this work is a
prototype and not a stand alone multi-core SoC. We like to
deploy the platform as an IP onto a commercial multi-core
SoC. Also with this platform we like to discover never before
seen physical multi-core programming issues like race
condition and deadlock. The outcome of this research can be
applied to validate many distributed computation theories and
enhance algorithms to solve many more problems.
Reference
[1] M. Flynn, "Some Computer Organizations and Their
Effectiveness". IEEE Trans. Computer. C-21: 948, 1972
[2] A. Mayer, H. Siebert and C. Lipsky, “Multicore Debug
Solution IP,” an IPextreme white paper 2007,
http://www.ip-extreme.com/downloads/MCDS_whitepap
er_070523.pdf
[3] A. Su, “Application of ESL Synthesis on GSM Edge
algorithm for base station,” Proc. ASP-DAC’10, January
2010, pp. 732~737
[4] CoreSight
TM
Components Technical Reference Manual,
http://infocenter.arm.com/help/topic/com.arm.doc.ddi031
4h/DDI0314H_coresight_components_trm.pdf
[5] K-J Lee, C-Y Chu and Y-T Hong, “An Embedded
Processor Based SOC Test Platform,” Proc. International
Symposium on Circuits and Systems, pp. 2983~2986,
2005
[6] W-C Huang, C-Y Chang and K-J Lee, “Toward
Automatic Synthesis of SoC Test Platform,” Proc. VLSI
Design, Automation and Test, pp. 1~4, 2007
[7] K-J Lee, S-Y Liang and A. Su, “A Low-Cost SoC Debug
Platform Based on On-Chip Test Architecture,” Proc.
SOC Conference, 00. 161~164, 2009
[8] IEEE, “1500-2005 IEEE Standard Testability Method for
Embedded Core-Based Integrated Circuits,” E-ISBN
0-7381-4694-3, print ISBN 0-7381-4693-5, IEEE 2005.
[9] Y-T Lin, W-C Shiue and I-J Huang, “A Multi-resolution
AHB Bus Tracer for Real-time Compression of
Forward/Backward Traces in a Circular Buffer,” Proc.
DAC'08, pp. 862~865, 2008
[10] C-H Chen, J-C Ju, and I-J Huang, “A Synthesizable AXI
Protocol Checker for SoC Integration,” IEEE
International SoC Design Conference (ISOCC'10),
Incheon, Korea, Nov. 2010.
[11] Y-C Yang, and J-I Guo, “A High Throughput H.264/AVC
High Profile CABAC Decoder for HDTV Applications,”
IEEE Transactions on Circuits and Systems for Video
Technology, vol. 19, no. 9, pp. 1395-1399, September
2009
[12] K Xu, T-M Liu, J-I Guo, and C-S Choy, “Methods for
Power/Throughput/Area Optimization of H.264/AVC
Decoding,” Journal of Signal Processing Systems, Vol. 60,
No. 1, pp. 131-145, July 2010
[13] C-A Chien, C-Y Chang, J-S Lee, J-H Chang, and J-I Guo,
“Low Complexity 3D Depth Map Generation for Stereo
Applications,” Proc. 2010 VLSI Design/CAD
Symposium, Kaohsiung, Taiwan, August 3-6, 2010.
[14] C-A Chien, C-Y Chang, J-S Lee, J-H Chang and J-I
Guo, ”Low Complexity 3D Depth Map Generation for
Stereo Applications,” Proc. ICCE’11, Jan. 9-12, Las
Vegas, USA, 2011
6

More Related Content

What's hot

Embedded system software
Embedded system softwareEmbedded system software
Embedded system softwareJamia Hamdard
 
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
MPSoC Platform Design and  Simulation for Power %0A Performance EstimationMPSoC Platform Design and  Simulation for Power %0A Performance Estimation
MPSoC Platform Design and Simulation for Power %0A Performance EstimationZhengjie Lu
 
poster
posterposter
posterLeo Wu
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureRaahul Raghavan
 
The Basics of Cell Computing Technology
The Basics of Cell Computing TechnologyThe Basics of Cell Computing Technology
The Basics of Cell Computing TechnologySlide_N
 
ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3Raahul Raghavan
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAnh Dung NGUYEN
 
Top ranking colleges in india
Top ranking colleges in indiaTop ranking colleges in india
Top ranking colleges in indiaEdhole.com
 
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3Raahul Raghavan
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheRaahul Raghavan
 
Presentation aix performance updates & issues
Presentation   aix performance updates & issuesPresentation   aix performance updates & issues
Presentation aix performance updates & issuessolarisyougood
 
AAME ARM Techcon2013 003v02 Software Development
AAME ARM Techcon2013 003v02  Software DevelopmentAAME ARM Techcon2013 003v02  Software Development
AAME ARM Techcon2013 003v02 Software DevelopmentAnh Dung NGUYEN
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2Léia de Sousa
 
System on chip buses
System on chip busesSystem on chip buses
System on chip busesA B Shinde
 

What's hot (20)

Embedded system software
Embedded system softwareEmbedded system software
Embedded system software
 
EC6703 unit-4
EC6703 unit-4EC6703 unit-4
EC6703 unit-4
 
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
MPSoC Platform Design and  Simulation for Power %0A Performance EstimationMPSoC Platform Design and  Simulation for Power %0A Performance Estimation
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
 
Ch7 v70 scl_en
Ch7 v70 scl_enCh7 v70 scl_en
Ch7 v70 scl_en
 
poster
posterposter
poster
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
 
The Basics of Cell Computing Technology
The Basics of Cell Computing TechnologyThe Basics of Cell Computing Technology
The Basics of Cell Computing Technology
 
Ch7 v70 scl_en
Ch7 v70 scl_enCh7 v70 scl_en
Ch7 v70 scl_en
 
ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and Optimization
 
Top ranking colleges in india
Top ranking colleges in indiaTop ranking colleges in india
Top ranking colleges in india
 
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
 
Presentation aix performance updates & issues
Presentation   aix performance updates & issuesPresentation   aix performance updates & issues
Presentation aix performance updates & issues
 
AAME ARM Techcon2013 003v02 Software Development
AAME ARM Techcon2013 003v02  Software DevelopmentAAME ARM Techcon2013 003v02  Software Development
AAME ARM Techcon2013 003v02 Software Development
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Automative basics v3
Automative basics v3Automative basics v3
Automative basics v3
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2
 
System on chip buses
System on chip busesSystem on chip buses
System on chip buses
 

Similar to AMulti-coreSoftwareHardwareCo-DebugPlatform_Final

Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...IAEME Publication
 
Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...IAEME Publication
 
Avionics Paperdoc
Avionics PaperdocAvionics Paperdoc
Avionics PaperdocFalascoj
 
Chapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptxChapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptxJanethMedina31
 
UNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxUNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxmohan134666
 
Ec8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsEc8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsRajalakshmiSermadurai
 
A novel implementation of
A novel implementation ofA novel implementation of
A novel implementation ofcsandit
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...eSAT Publishing House
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.pptKundanSingh887495
 
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSHIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSijngnjournal
 
Security Enhancement in Networked Embedded System
Security Enhancement in Networked Embedded System Security Enhancement in Networked Embedded System
Security Enhancement in Networked Embedded System IJECEIAES
 
Engineer new post -hangzhou wumu technology co.,ltd.The Design of Human-Mach...
Engineer new post  -hangzhou wumu technology co.,ltd.The Design of Human-Mach...Engineer new post  -hangzhou wumu technology co.,ltd.The Design of Human-Mach...
Engineer new post -hangzhou wumu technology co.,ltd.The Design of Human-Mach...Stephanie hu
 

Similar to AMulti-coreSoftwareHardwareCo-DebugPlatform_Final (20)

Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...
 
Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...
 
Avionics Paperdoc
Avionics PaperdocAvionics Paperdoc
Avionics Paperdoc
 
Chapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptxChapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptx
 
UNIT 1.docx
UNIT 1.docxUNIT 1.docx
UNIT 1.docx
 
UNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxUNIT 1 SONCA.pptx
UNIT 1 SONCA.pptx
 
Ec8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsEc8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systems
 
Embedded Systems
Embedded SystemsEmbedded Systems
Embedded Systems
 
A novel implementation of
A novel implementation ofA novel implementation of
A novel implementation of
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.ppt
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSHIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
 
EC8791-U5-PPT.pptx
EC8791-U5-PPT.pptxEC8791-U5-PPT.pptx
EC8791-U5-PPT.pptx
 
Co question 2008
Co question 2008Co question 2008
Co question 2008
 
Co question 2006
Co question 2006Co question 2006
Co question 2006
 
Security Enhancement in Networked Embedded System
Security Enhancement in Networked Embedded System Security Enhancement in Networked Embedded System
Security Enhancement in Networked Embedded System
 
Engineer new post -hangzhou wumu technology co.,ltd.The Design of Human-Mach...
Engineer new post  -hangzhou wumu technology co.,ltd.The Design of Human-Mach...Engineer new post  -hangzhou wumu technology co.,ltd.The Design of Human-Mach...
Engineer new post -hangzhou wumu technology co.,ltd.The Design of Human-Mach...
 

AMulti-coreSoftwareHardwareCo-DebugPlatform_Final

  • 1. A Multi-core Software/Hardware Co-debug Platform with ARM CoreSight TM , On-chip Test Architecture and AXI/AHB Bus Monitor Alan P. Su, Jiff Kuo, † Kuen-Jong Lee, ‡ Ing-Jer Huang, § Guo-An Jian, § Cheng-An Chien, § Jiun-In Guo, and ‡ Chien-Hung Chen Global Unichip Corp., † EE Dept. National Cheng Kung University, ‡ CSE Dept. National Sun Yat-Sen University, § CS Dept. National Chung Cheng University {alan.su, jiff.kuo}@globalunichip.com, † kjlee@mail.ncku.edu.tw, ‡ ijhuang@cse.nsysu.edu.tw, § jiguo@cs.ccu.edu.tw Abstract Multi-core system is becoming the next generation embedded design platform. Heterogeneous and homogeneous processor cores integrated in Multiple Instruction Multiple Data (MIMD) System-on-a-Chip (SoC) to provide complex services, e.g. smart phones, is coming up in the horizon. However, distributed programming is a difficult problem in such systems. Today, only in very few MIMD SoC designs we can find comprehensive multi-core software/hardware co-debug capability that can stop at not only software but also hardware breakpoints to inspect data and system status for identifying bugs. In this work we have integrated various debug mechanisms so that the entire multi-core SoC is able to iterate unlimited times of software and hardware breaks for data and status inspections and stepping forward to resume execution till next break point. This debug mechanism is realized with a chip with four ARM1176 cores and ARM CoreSightTM on-chip debug and trace system, a Field Programmable Gate Array (FPGA) loaded with on-chip test architecture and bus monitor, and software debug platform to download system trace and processor core data for inspection and debug control. Key contributions of this work are (1) a development of multi-clock multi-core software/hardware co-debug platform and (2) the exercise of a multi-core program debugging to visualize the physical behavior of race conditions. 1. Multi-core Programming and Debugging Multi-core system is becoming the next generation embedded design platform. Heterogeneous or homogeneous processor cores integrated in a System-on-a-Chip (SoC) to build small form factor platforms and provide complex services, e.g. smart phones, are coming up in the horizon. Smart phones provide various domains of applications in the fashion of distributed computing and thus the multi-core architecture is generally a Multiple Instruction Multiple Data (MIMD) type design [1] to deliver different software with wide range of resource and performance requirements. However, unlike parallel programming on homogeneous and Single Instruction Multiple Data (SIMD) architectures [1], where the same program runs on multiple processor cores to process different sets of data, the distributed programming is an extremely difficult problem in MIMD architectures. Today, only in very few MIMD SoC designs we may find comprehensive multi-core software/hardware co-debug capability [2]. Ideally, the architecture not only needs to support software but also hardware breaks and visibility. Figure 1 gives a simple example of SoC described in [3] to illustrate software complexity faced in multi-core MIMD designs. Figure 1(a) is the target system specification described in a task graph. AP1 fetches encoded image data from an input source, pre-processes the data, delivers to AP2 for post-processing and then sends the decoded image for display. Figure 1(b) is a MIMD dual core implementation of the given task graph. By design, AP1 runs on an ARM core and fetches the encoded image data stored on a USB flash memory through the USB port. After the pre-processing, AP1 stores the data to a shared SRAM, then notifies DSP core the “data write complete” message to execute AP2 by issuing an interrupt using an OS system call. The DSP core receives the interrupt which triggers the Interrupt Service Routine (ISR) to initiate AP2 to read data from the shared SRAM, post process it and send to the frame buffer of the LCD Display. (a) Task Graph (b) A Dual Core Implementation Figure 1, An example of dual core system implementation However let us consider a race condition scenario. Assume the data passing between AP1 and AP2 is not controlled by a mutual exclusive mechanism to guarantee the AP1-write-before-AP2-read order. The scenario goes that AP1 writes off the last block of data to the shared SRAM and immediately issues the “data write complete” interrupt to the DSP core. Let us also assume that bus AHB0 has a lower priority than bus AHB1 on the Inter Connection Module (ICM) arbiter and the data write from AP1 is blocked due to other AHB1 requests issued by the DSP core. The interrupt request is served in the highest priority by the DSP core and thus the “data write complete” interrupt triggers AP2 earlier than the last AP1 data being stored on the shared SRAM. Since there is no mutual exclusive mechanism in place to prevent AP2 from reading data before it is ready, the race condition thus occurs. Imagine for the programmer to debug the problem, s/he needs to have the debug controllability and visibility into ARM and DSP cores to track programs executions: the visibility into ICM to see how it serves AHB0 and ABH1, and the timing view to learn that the interrupt happens before the AP1 data 1
  • 2. being written to the shared memory to identify the root cause of the race condition. To support the debug capability needed to identify the race condition described above, the multi-core SoC in question needs to have a way to set a breakpoint at the end of the AP1 data write to break the complete SoC. The debug mechanism then allows the designer to 1. inspect AP1 and AP2 programs, 2. view ARM and DSP cores status and data, 3. check components AHB0, AHB1, ICM and shared SRAM status and data, and 4. step the SoC through the execution to see the interactions among programs, cores, components and busses to visualize the race condition In this work we integrated various sub-systems to complete a multi-core software/hardware co-debug platform to deliver above said features. We realized the platform by implementing a quad-ARM1176 SoC with CoreSightTM [4] to hook up with an on-chip test mechanism and AHB/AXI bus monitor. To validate the platform implemented, a multi-core programming exercise was also conducted to develop a 3D image application on this co-debug platform. In Section 2 we discuss ARM CoreSightTM . Section 3 describes a multi-clock on-chip test architecture that has the capabilities to set hardware breakpoints and break, view functional unit control register data, cycle step and resume. Section 4 illustrates an AHB/AXI bus monitor that is a Verification Intellectual Property (VIP) capable of alerting erroneous AHB/AXI transactions and conducting trace dump. Section 5 shows the integration of the multi-core software/hardware co-debug platform with ARM CoreSightTM , multi-clock on-chip test architecture and AHB/AXI bus monitor. Section 6 illustrates the exercise in multi-core 3D image application programming and debugging using the co-debug platform developed and Section 7 finalizes this work with conclusion and future research. 2. ARM CoreSightTM Figure 2, ARM CoreSightTM debugging environment ARM CoreSightTM is an on-chip component developed by ARM to support multi-core cross triggering, which allows a core on hitting a breakpoint to break all other cores. It is done by a general Cross Trigger Matrix (CTM) and individual Cross Trigger Interface (CTI) on each core. ARM has developed CTI for ARM9, ARM11 and Cortex families. CTI is used for debug control and ARM core status and register viewing. CoreSightTM also supports trace dump. Each core dumps its trace through its own Embedded Trace Macrocell (ETM) onto the Advanced Microcontroller Bus Architecture (AMBA) Trace Bus (ATB) and to trace port through the Trace Port Interface Unit (TPIU). The trace dump can further provide complete core information for debug purposes. DS-5, the ARM debugging tool for CoreSightTM and beyond, controls program debug and trace dump through DSTREAM, the In-Circuit Emulator (ICE) of CoreSightTM , via the Joint Test Action Group (JTAG) port, Debug Access Port (DAP), Debug APB and into ETM and CTI. CoreSightTM does not restrict its support only to ARM core families. By following ETM and CTI protocols, one can also develop the ETM and CTI for other cores like the DSP in Figure 2. This external core is thus controlled by DS-5 and integrated into the debug environment. This is how we hook the on-chip test architecture introduced in next section with CoreSightTM . 3. On-Chip Debug Architecture Following the Moore’s Law, the integrated circuit (IC) technology doubles its gate density every eighteen months. At 28nm technology the gate density has reached 4.2M gates per mm2 . With such a high capacity we can start to consider putting self testing ability onto the chip. The development of on-chip test architecture is studied in [5, 6]. We leverage this on-chip test architecture also for debug purposes in this work. The side band test bus and test port can be used for component core register inspection [7]. By adding multiple clock gating and stepping mechanism, we can implement hardware break, component register data viewing and cycle stepping to support hardware debug capability. 3.1 Overall On-Chip Debug Architecture Figure 3, Overall architecture of SoC debug platform Figure 3 shows the overall architecture of the on-chip debug platform which consists of both software and hardware components. The embedded processor (ARM 1176) is employed to execute the software program through the JTAG port and ICE with the debugger tool in PC-host. The instruction memory is used to store the instructions to be executed while the CUD (Core Under Debug) data memory is used to store the required data for Intellectual Property (IP) 2
  • 3. application and the operational results of the IP. The IP cores are wrapped with the IEEE 1500 wrappers [8] that support core-level testing with parallel scan capability. The Test Access Mechanism Controller (TAM Controller or TAMC) generates debug control signals to control the debug procedure for IP cores. It also buffers the traced data and stores them to a local memory. The dedicated test bus connects the wrapped CUDs with the TAMC for the transfer of the control signals and the traced data. To integrate the debug platform with the ARM CoreSight on-chip debug and trace architecture, a customized CTI module and an AHB-APB bridge are added to this platform. The CTI module can deliver a debug request signal (DBGRQ) to let the TAM Controller enter into the debug mode. During the debug mode, the TAM Controller stops the CUD when hitting the break point and dumps the contents of the CUD to the local memory. It can also compare the obtained data with the golden data retrieved from external or embedded memory. These operations are controlled by a debug tool called DASTEP which is stored in the PC-host. The user can thus examine the test results immediately. After finishing the debug function, the TAM Controller delivers an acknowledge signal (DBGACK) to the CTI module. The bus bridge is needed because the CTI module is compliant with AMBA APB protocol. The software is composed of a user-provided application program and a debug program. The application program executes the functional operation of the system. The debug program contains the setup date to initialize the TAM Controller, which is also generated by DASTEP. 3.2 Multiple Clock Gating and Stepping The main issue to gate and step multiple clocks is clock synchronization. As an example of two cores with 100MHz and 125MHz clocks respectively, we can find synchronous positive edges every 0.5 milliseconds. When we gate at the first synchronous positive edge, every synchronous step has to be 0.5 milliseconds away. Too many events can happen in this period of time and the resolution is too low for meaningful hardware/software debug. We have investigated this problem by carefully examining the relationship between the clock rates of interactive cores and are now able to identify much more cycles that can be “safely” stopped and resumed. Thus instead of breaking at synchronous positive edges and stepping the least common multiple of clock frequencies, we gate clocks at the identified safe instances that are usually the same as or just a few cycles away from the break points. With this breaking mechanism, even though all the clocks may have different phase shifts toward its last positive edge, they can be resumed synchronously and continued correctly without any glitches. 3.4 DASTEP and Debug Procedure DASTEP is the control Graphical User Interface (GUI) of the on-chip debug architecture. It can set hardware breakpoints on components and view control register values and cycle step the system. DASTEP also can dump and view component trace data. Using DASTEP to run debugging procedure, first we should determine which CUD at which clock domain is to be observed. Then we set the cycle-based hardware breakpoint and wait for the traced data. The traced data will be transferred to the PC-host through the UART mechanism when the breakpoint matches. The traced data can be displayed and compared to the golden data. In the following, the debug procedure using the GUI is described in detail. Figure 4 shows the overview of the graphic user interface in DASTEP. By clicking the MCD hardware Breakpoint item under the Debug Platform entry, the setup window as shown in Figure 5 appears, which allows the user to select the cores to be debugged (○1 ), to set the first breakpoint (○2 ), and to select the master clock domain for the reference of the break point cycle (○3 ). The user can then click “Apply” to enter the debug information, “Run” to start the debug session or “Cancel” to cancel the setup information. Figure 4, Overview of graphic user interface Figure 5, Setup window of MCD hardware breakpoint insertion After starting the debug session, the PC-host is waiting for receiving the traced data. Once the breakpoint occurs, the display window will display the traced data in a control and trace window as shown in Figure 6. The user can now examine the traced data in the window (○1 ). Four operations are supported here (○3 ): “Browse & Save” to store the traced results, “Select Register” to select the registers to be displayed, “Terminate Debug Mode” to continue the functional operation of the CUD, and “Cancel” to cancel the control and display window. After examining the information of the current breakpoint, the user can then continue for the next breakpoint set-up (○2 ) by entering the next breakpoint cycle in the “Next Breakpoint Cycle” column, clicking “Run” to resume the system and let it stop at the next breakpoint, clicking “Receive” to wait and receive the traced data, or clicking “Single Step” to continue the debug session in a cycle by cycle manner. It is worth mentioning that an open source software, Gtk-wave, is employed to help show the trace data with waveform-based display, as shown in Figure 7. 3
  • 4. Figure 6, Control and display window Figure 7, Waveform-based displays of traced data 4. AHB/AXI Bus Monitor The bus monitor consists of a protocol checker, a bus tracer and a trace memory. Figure 8 shows the bus monitor architecture in the red block modules. The protocol checker detects real-time bus protocol error or inefficiency. The bus tracer captures on-chip bus signals at many levels of abstraction and performs real time compression. The trace memory is used to store compressed traces. The protocol check and the bus tracer could collaborate with each other. For example, when the protocol checker detects a bus protocol error, it triggers the bus tracer to start/stop monitor activity and store the trace data into the trace memory. In [9] an AHB bus monitor is developed. Later the technology evolved and the AXI bus is also supported. These two works developed a hardware VIP to help verifying components with AHB and/or AXI interfaces. Figure 9 shows the AXI trace verification method. In the simulation environment, the AXI VIP produces AXI interconnection cycle accurate behavior. The AXI tracer passively captures signals from the VIP, and compresses the trace data stored in the trace memory. The bus analyzer decompresses trace result. We compared trace data between simulation direct dump data and decompression trace result to verify the AXI tracer. Similarly, the AXI protocol checker verification is also based on the AXI VIP [10]. This monitor did AXI rule checking and reported an error message when an AXI violation was found. A circular buffer then dumped its data which was a bus trace around 1000 cycles before the violation. The bus monitor has been available for both AHB and AXI buses [9-10]. Figure 8 shows the AXI monitor integrated into the SoC debug platform. The AXI bus is the center of the SoC. The AHB bus is used as a debug bus for the SoC. The PCI interface is a communication channel between the debug bus of the SoC and the debug software running on a PC. The debug commands from the debug software are translated into AHB master commands by a PCI2AHB transactor to configure and access the AXI monitor components. Once an error has been signaled, the debug architecture thus notifies CoreSightTM and breaks the whole SoC. The user can then cycle step and view program, processor core, component and bus data and status to identify the problem. PCI Interface of PC-host PCI 2 AHB Transactor Master I/F Slave I/F Debug Bus(Based on AHB Bus) ARM 926EJ-S Wrapper AXI Checker AXI Tracer Trace Memory Wrapper Wrapper Wrapper SMCLMC SRAM PCI I/F ROM ROM I/F AXI Interconnect ARM 1126 Wrapper CUD Wrapper Trigger Event Memory Wrapper TAMC I/F TAM Controller UART Figure 8, Integration bus monitor in the SoC debug platform Figure 9 is the bus monitor analyzer software running on the debug PC to configure and access the bus monitor. There are four windows showing (1) a multi-resolution waveform viewer, (2) an access control signal analyzer, (3) an address/data timing distribution analyzer, and (4) a bus state transition analyzer. (1)Multi-Resolution WaveformViewer (2)Access Control Signals Analyzer (3)Address/DataTiming Distribution (4)Bus Transition ModelAnalyzer Figure 9, Bus monitor analyzer 5. Multi-core Software/Hardware Co-debug Platform The multi-core software/hardware co-debug platform 4
  • 5. described in previous sections was only the hardware side of the system and integration with the software debugger is also required. Our first step was to connect to CoreSightTM using DS-5 and DSTREAM. We then set breakpoints in ARM1176 programs and break all four ARM1176 and on-chip test architecture once a breakpoint was hit. With DS-5 we can inspect program data. With DASTEP we can view component control register data and with AHB/AXI bus monitor we can dump and view bus trace data. The final step is to use DASTEP to set hardware breakpoints and break four ARM1176 program executions for debug purposes as well. A user friendly multi-core software/hardware co-debug platform is thus completed. 6. Experiment We use a 3D depth map generation project to verify the co-debug platform. The front-end of the 3D depth map generation is a high profile H.246 decoder [11, 12]. The 3D depth map generation [13, 14] transforms 2D video images into 3D view with depth map for 3D video viewing. The development is started with single threaded C++ code and moved into MIMD programs. Through the help of the multi-core co-debug platform we realize the need for a hardware implemented H.264 decoder to cope with 3D depth map generation using all four ARM1176 to play a high profile video in real time. 6.1 Algorithm of 3D Depth Map Generation Figure 10 shows the 3D depth map generation algorithm. It generates the depth maps in good quality for most 2D images. In addition, the processing steps in the proposed algorithm have been optimized for reducing its complexity while preserving good quality. The encapsulated low complexity techniques are introduced below. Figure 10, Proposed depth map generating algorithm We first use Sobel mask to get the edge information of the input image for detecting vanishing lines in the next step. We optimize the Sobel mask formula to reduce about 65% computational complexity with quality results. Then, we use the 5×5 Hough transform to detect vanishing lines. After Hough transform, we classify the input images into three types, which are Normal (with vanishing point), Scenery (with sky/mountain), and Close-up. By the proposed classification method, we use different methods to generate the depth map with good quality. In Normal type, we calculate the intersection point of vanishing lines. After calculating all intersection points of vanishing lines, we use an 8x8 region to group the nearest points in the image which is also called the vanishing region (VR). According to the position of VR, we generate the Gradient Depth Map (GDM) according to the distance between every pixel and the VR for the “normal” type of images. For the “Scenery” type, we define that the top of the image is the VR to generate the GDM. In Scenery type, we assign a static GDM since the sky or mountain is always in the top of the image. In Close-up type, we only adopt a block-based contrast filtering to classify the background and foreground objects. Finally, Joint Bilateral Filtering (JBF) is used to post-process the merged depth map by strengthening the edge information of the objects related to the original image in the proposed algorithm. We optimize each step in the proposed 3D depth map generation algorithm and achieve about 90% of complexity reduction in terms of execution time as compared to the original ones. 6.2 Parallelization of 3D Depth Map Generation For the sake of realizing the 3D depth map generation on the multi-core platform, we propose a parallel 3D video playing system as shown in Figure 11. In this system, we use one thread to perform decoding of H.264 video, three threads to perform 3D depth map generating, and one thread to collect the depth map from each 3D depth map generator. At the front end of the proposed system, H.264 decoder will decode the bit-stream and produce video. In the following, the decoded video is delivered to each 3D depth map generator frame by frame. Finally, the pseudo display collects all the depth maps in order. H.264 Decoder 3D Depth Map Generator 3D Depth Map Generator 3D Depth Map Generator 3D Depth Map Generator Pseudo Display File H.264 Bitstream FIFO FIFO FIFO FIFO FIFO FIFO FIFO FIFO Figure 11, Proposed parallel 3D video playing system In order to ensure the correctness for the execution of the proposed parallel 3D video playing system, we establish a synchronization mechanism among the threads. For this reason, we use a synchronized FIFO to connect any two threads once if one of them has to deliver data to another. As shown in Figure 12, the proposed synchronized FIFO is essentially a circular FIFO carried out based on a producer-consumer mechanism. The front end of the synchronized FIFO is connected to the thread that plays the producer while the rear end of the synchronized FIFO is connected to the thread that plays the consumer. At the producer end, data can be written to the FIFO anytime except the FIFO is full. Similarly, data can be read from the FIFO at the consumer end anytime except the FIFO is empty. Once any thread is not permitted to access the FIFO, it has to wait until getting the permission. With such a synchronization mechanism, we can easily realize the synchronization in the proposed parallel 3D video playing system. In the following, we make a description of how the proposed synchronized FIFO achieves the synchronization between two threads. Figure 13 shows the pseudo code for the synchronization at the producer end of the synchronized FIFO. At first, the thread at the producer end checks if the FIFO is full. When the FIFO is full, the thread has to wait until the 5
  • 6. FIFO is not full. After getting the access permission, the thread starts to write its data to the FIFO. Finally, the thread calls a confirmation function that will update the information recorded in the FIFO and issue a notification signal to the thread at the consumer end of the FIFO. Similarly, Figure 14 shows the pseudo code for the synchronization at the consumer end of the synchronized FIFO. The thread at the consumer end performs almost the same steps as mentioned for the producer end except it checks if FIFO is empty rather than full. Figure 12, Proposed synchronized FIFO Figure 13, Pseudo code for the synchronization at the producer end of the synchronized FIFO Figure 14, Pseudo code for the synchronization at the consumer end of the synchronized FIFO 6.3 Performance Evaluation In this section, we discuss about the performance improvement for the proposed parallel 3D video playing system. Figure 15 shows the performance for different configurations. The test video we use is in CIF resolution and contains 300 frames in total. Under the configuration of using one thread for H.264 decoding and three threads for 3D depth map generating, the processing speed of the proposed system can achieve 27.75 fps 3D video display in CIF resolution. Figure 15, Performance evaluation for the proposed parallel 3D video playing system 7. Conclusion and Future Work In this work we have built a generic multi-core software/hardware co-debug platform, which is the framework to design future multi-core SoC with MIMD programming and debug support. The hardware system designed in this work is a prototype and not a stand alone multi-core SoC. We like to deploy the platform as an IP onto a commercial multi-core SoC. Also with this platform we like to discover never before seen physical multi-core programming issues like race condition and deadlock. The outcome of this research can be applied to validate many distributed computation theories and enhance algorithms to solve many more problems. Reference [1] M. Flynn, "Some Computer Organizations and Their Effectiveness". IEEE Trans. Computer. C-21: 948, 1972 [2] A. Mayer, H. Siebert and C. Lipsky, “Multicore Debug Solution IP,” an IPextreme white paper 2007, http://www.ip-extreme.com/downloads/MCDS_whitepap er_070523.pdf [3] A. Su, “Application of ESL Synthesis on GSM Edge algorithm for base station,” Proc. ASP-DAC’10, January 2010, pp. 732~737 [4] CoreSight TM Components Technical Reference Manual, http://infocenter.arm.com/help/topic/com.arm.doc.ddi031 4h/DDI0314H_coresight_components_trm.pdf [5] K-J Lee, C-Y Chu and Y-T Hong, “An Embedded Processor Based SOC Test Platform,” Proc. International Symposium on Circuits and Systems, pp. 2983~2986, 2005 [6] W-C Huang, C-Y Chang and K-J Lee, “Toward Automatic Synthesis of SoC Test Platform,” Proc. VLSI Design, Automation and Test, pp. 1~4, 2007 [7] K-J Lee, S-Y Liang and A. Su, “A Low-Cost SoC Debug Platform Based on On-Chip Test Architecture,” Proc. SOC Conference, 00. 161~164, 2009 [8] IEEE, “1500-2005 IEEE Standard Testability Method for Embedded Core-Based Integrated Circuits,” E-ISBN 0-7381-4694-3, print ISBN 0-7381-4693-5, IEEE 2005. [9] Y-T Lin, W-C Shiue and I-J Huang, “A Multi-resolution AHB Bus Tracer for Real-time Compression of Forward/Backward Traces in a Circular Buffer,” Proc. DAC'08, pp. 862~865, 2008 [10] C-H Chen, J-C Ju, and I-J Huang, “A Synthesizable AXI Protocol Checker for SoC Integration,” IEEE International SoC Design Conference (ISOCC'10), Incheon, Korea, Nov. 2010. [11] Y-C Yang, and J-I Guo, “A High Throughput H.264/AVC High Profile CABAC Decoder for HDTV Applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 9, pp. 1395-1399, September 2009 [12] K Xu, T-M Liu, J-I Guo, and C-S Choy, “Methods for Power/Throughput/Area Optimization of H.264/AVC Decoding,” Journal of Signal Processing Systems, Vol. 60, No. 1, pp. 131-145, July 2010 [13] C-A Chien, C-Y Chang, J-S Lee, J-H Chang, and J-I Guo, “Low Complexity 3D Depth Map Generation for Stereo Applications,” Proc. 2010 VLSI Design/CAD Symposium, Kaohsiung, Taiwan, August 3-6, 2010. [14] C-A Chien, C-Y Chang, J-S Lee, J-H Chang and J-I Guo, ”Low Complexity 3D Depth Map Generation for Stereo Applications,” Proc. ICCE’11, Jan. 9-12, Las Vegas, USA, 2011 6