Interactive Powerpoint_How to Master effective communication
NIOS II Processor.ppt
1.
2. Outline
What is a “Soft” Processor
What is the NIOS II?
Architecture for NIOS II, what are the
implications
• TigerSHARC VS. NIOS II
• Pipeline Issues
• Issues related to FIR
Hardware acceleration, using FPGA
logic
3. What’s is a “Soft”
Processor?
Processor implemented in VHDL, Verilog,
etc., and downloaded onto FPGA hardware
Can implement many parallel processors
on one FPGA
Can use addition FPGA resources on the
same chip that is not part of the processor
core.
NIOS II is a “Soft” Processor
4. Why “Soft” Processor?
Higher level of design reuse
Reduced obsolescence risk
Simplified design update or change
Increased design implementation
options
Lower latency between processor and
FPGA components
5. What is NIOS II?
Software-defined processor
The processor core is loaded onto
FPGA
Programmed using ‘normal’
programming tools (C, asm), not
hardware description languages
Can use the rest of the FPGA hardware
for accelerating parts of the code
6. How Is NIOS II
Implemented
The custom FPGA logic that interacts
with the processor is implemented in
Altera Quartus II
The Avalon Interface bus (common
instruction/data bus) is implemented in
Quartus II
The architecture is generated in Quartus
II and used for programming in Eclipse
IDE
7.
8. NIOS II IDE
Coding is implemented in Eclipse rather than
VisualDSP.
9. The Different NIOS II Cores
There are 3 cores available from Altera
NIOSII/e: Economical Core
NIOSII/s: Standard Core
NIOSII/f: Fast Core
10. What’s the Difference between
the Cores?
An LE is equivalent to a 8-1 NAND gate + 1 D-Flip Flop
An ALM is equivalent to 2 LE’s
13. NIOS II Architecture
-thirty two 32-bit general registers, six 32-bit control registers
-variable cache based on how much FPGA space you have
-ALU- 32bit two input to one input, does shifts, logic and arithmetic. Shifter is
not separate like TigerSHARC
14. Avalon Interface
-separate address, data and control lines
-up to 1024-bit data width transfer, can be set to any width (not power of 2)
-one transfer per clock cycle.
15. NIOS II/f pipeline
Six stages
One instruction can be dispatched and/or
retired pre cycle
Dynamic branch prediction: 2-bit branch
history table (no BTB like in TigerSHARC)
16. NIOS II/f pipeline
The pipeline stalls for:
• Multi-cycle instructions
• Cache misses
• Data dependencies (2 cycles between
calculating and using result)
Mispredicted branch penalty: 3 cycles
17.
18. Hardware multiply
Can use different options for multiplier
(at the processor design stage)
No h/w multiply (saves FPGA gates)
○ Speed depends on algorithm
Use embedded multipliers (if FPGA has
those)
○ 1-5 cycles (depends on FPGA)
Implement multipliers on FPGA gates
○ 11 cycles
Division 4-66 cycles on hardware
19. Compare to TigerSHARC
No support for parallel instructions
No support for SIMD operations
Multicycle instructions stall the pipeline
All the above limitations can be overcome
by using FPGA space unoccupied by the
processor itself
22. Speed analysis
0 movi r4,8 i = 8
1 Loop: ldw r2,0(r6) load data
2 ldw r3,0(r7) load coefficient
3 addi r4,r4,-1 i--
4 addi r6,r6,4 coeffPt++
5 mul r2,r2,r3 data = data * coeff
6 addi r7,r7,-4 dataPt--
7 stall data stall – waiting for multiplication
result
8 add r5,r5,r2 output += data
9 bne
r4,zero,0x10002a0
will mispredict 2 times in the
beginning, and 1 time in the end of
the loop (waste 3 cycles each time)
23. Speed analysis
9 cycles per iteration except the first two
(branch predicted not taken) and the last
(branch predicted taken) – those will be
9+3=12 cycles
1 data stall – can remove by moving
instruction from line 4 to 7
Speed: 8 cycles * (N-3) + 11 cycles * 3 =
8*(N-3)+33 cycles
For 1024-tap FIR: 8201 cycles
Clock cycle is 3 times longer (200MHz vs
600MHz)
24. Speed comparison
• 8201 NIOS II cycles equivalent to 24603
TigerSHARC cycles
• Lab3 timing:
– 56000 cycles Debug mode
– 13000 unoptimized ASM
– 4000 Optimized ASM
Worse than unoptimized assembly, but no
hardware acceleration used, so this is not
that bad
25. Hardware Acceleration
Profiling tool in Eclipse can show how
long each function takes
If function takes too long, it can be sped
up by
Custom instructions
Hardware Acceleration
Hardware Acceleration is to take the
function and transform it into FPGA
circuitry
26. Hardware Acceleration
Can be done using C2H compiler from Altera
Trades off Logic Size for Speed up.
Table 1. User Application Results Example
Algorithm Speed Increase
(vs. Nios II CPU)
System fMAX
(Mhz)
System Resource
Increase (1)
Autocorrelation 41.0x 115 124%
Bit Allocation 42.3x 110 152%
Convolution Encoder 13.3x 95 133%
Fast Fourier Transform
(FFT)
15.0x 85 208%
High Pass Filter 42.9x 110 181%
Matrix Rotate 73.6x 95 106%
RGB to CMYK 41.5x 120 84%
RGB to YIQ 39.9x 110 158%
27. Conclusion
“Soft” Processors such as the NIOSII
offers another alternative in the
embedded system scene.
The NIOSII offers the advantage of
added configurability, and customization
that blur the line between FPGAs and
DSPs
28. References
[1] http://www.fpgajournal.com/articles/behere.htm
Describes an FPGA-DSP project based on Altera Nios
[2] http://www.altera.com/products/ip/processors/nios2/ni2-index.html
Official Nios II page
[3] http://www.hunteng.co.uk/dsp-fpga.htm
DSP or FPGA? What is better when?
[4] http://www.hunteng.co.uk/pdfs/tech/DSP1736FPGA.pdf
Article from Xilinx about FPGA DSPs
[5] http://www.niosforum.com
Community forum for NIOS
[6] http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf
NIOSII Processor Handbook –Altera Corporation
[7] http://www.altera.com/literature/manual/mnl_avalon_spec.pdf
Avalon Memory-Mapped Interface Specifications – Altera Corporation
[8] http://www.analog.com/en/prod/0,2877,ADSP%252DTS201S,00.html
ADSP-TS201S 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded
DRAM
Editor's Notes
Intro: Traditionally we have a dsp, and it interacts with other modules, usual other asics. Then we have SOCs, integrate other logics to improve latency. Now we have FPGAs, added reconfiguration. Well, we want to integrate that too. SOPCs: system on a programmable chip. This is what the NIOS II is suppose to do. What happens when we want to integrate a dsp on an sopc system. (we have a thing called a hard processor)
Yay outline! Basically, the concept, how it looks like in software
Similar to how a verilog wire circuit can be put on a fpga to allow for high configurability, a soft processor is a processor implemented on a fpga. This is different than a hard processor, which is a processor implemented in hardware. Soft processor is a logical schematic (software) that can be loaded onto any fpga. So a soft processor isn’t really a processor, but just a schematic (or code like software). This gives it all the advantages of software such as giving updates and improving the development cycle.
Well, why do you want to do this? Isn’t an fpga slower clocked, high power consumption…
No, not more power hungry because it can be better customized for the application, slower clocked doesn’t mean slower, it means more has to be done in a cycle, and an fpga allows the developer to customize it to make instructions finish in one cycle. Plus you get all the other advantages.
It is a special schematic designed by altera that interacts very well with other altera IP mega blocks.
Well, if the processor is in software, how do you write programs for it? So are you basically writing software for software? Doesn’t this seem somewhat redundant? Yes, exactly, it does seem a bit redundant. But it is the current model of soft processor right now, perhaps there will be a better programming environment for it later. What you need to do is write the processor (bus and fpga logic) in software first using quartus, make an emulation file, and use that to write your dsp program in ecilipse. (there is no hardware optimizer, like an assembler optimizer)
Here is what it looks like for quartus. You need to define the schematic. At the top you have your clock source. The middle is your avalon interface, and the bottom is your FPGA logic.
Here is your NIOS II IDE environment. Now you take your emulated file and program for it like VDSP. So if the processor is in software, does that mean you can do simulation analysis, and not hardware like in the labs? No… you can run the generated processor on an FPGA and have this connect to the FPGA when it runs.
So exactly, what does altera give you as the basic architecture for you to customize?
3 cores of different features. Here are the specs…
Notice it is very similar to a MIPS processor we learned in other classes.
Print off sheet to list the architecture features
Print sheet to list of architecture
All the ports on the right actually share one bus, the avalon archtecture.
-separate address, data and control lines. No need to decode data for address.
-up to 1024-bit data width transfer, can be set to any width (not power of 2)
-synchronous operation
-dynamic bus sizing: this means no design consideration when address items that have different bus widths.
-one transfer per clock cycle.
-The Avalon Interface is basically an interface that creates a common interface from different interfaces of the all the memory and peripheral components of the system.
Are there bus issues because it’s one common interface? No… it’s a special inteface. With dedicated memory ports.
Cost Vs. Performance:
niosII package $495for a year + $150 for cyclone II fpga, C2H is $3000/computer TigerSharc
VDSP is $3500/computer + $750 for evaluation board tigerSHARC