VLSI DESIGN PROJECT IN VHDLPIPELINE STALLING WITH CLOCK KOSURU SAI MALLESWAR
CONTENTS 1. Objectives 2. Pipelining – definition 3. Modules of the project 4. Coding technique 5. VHDL files i. Register.vhd ii. Multiplexer.vhd iii. Pipelinestalling.vhd iv. Alu.vhd v. Pipelinedmultiplier.vhd 6. User constraints file for FPGA 7. Simulation waveforms – Performance Analysis 8. Applications 9. Conclusions
1. Objectives 1. To design the “pipeline stalling system” used in the design of computers and other digital electronic devices to increase their instruction throughput i.e., to program a series of registers to move data from one stage to the next stage based on a common clock. 2. To program an ALU that can fetch the opcode and operands in a pipelined sequence and executes the operations. 3. To program a pipelined multiplier for the ALU, which can perform multiplication of 32 bit numbers using “partial multiply,shift and add” algorithm. 2. Pipelining - definition Pipelining is an implementation technique where multiple instructions are overlapped in execution. The computer pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.This allows the computers control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once, and each step is connected to the next. The scheduling of transfer of data from one stage to next stage can be done with the help of a“clock”. Most modern CPUs are driven by a clock. The CPU consists internally of logic and register (flip flops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on.Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline.Because the pipe stages are hooked together, all the stages must be ready to proceed at the same time. The performance of a pipelined processor may vary widely between different programs.
3. Modules of the project1. Stalling pipeline architecture with registers accepting data on rising edges: Logic diagram: Stalling pipeline architecture with registers accepting data on rising edges2. ALU that can fetch the opcode and operands in a pipelined sequence3. Pipelined multiplier that can multiply two 32 bit numbers by using partial multiply shift and add algorithm:
4.Coding techniqueWhen storage elements accept data on a rising clock, initialize clock to 0 so that atransition does not occur at time zero. The 3 registers R1, R2 and R3 are in the three stagesof processor namely fetching unit, decoding unit and executing unit. The registers arepointing to the location from where program code is being read. Stall clock is “OR” ofclock and stall signal.On first rising edge of stall clock the data in the R1 will be sent to R2; data in R2 will besent to R3. On next rising edge, R1 increments and points to the next location; Data in R2will move to R3.If stall becomes low, R1 updates R2 at each rising edge of the clock andR2 updates R3 at each rising edge of clock.When stall becomes high, R1 transfers data to R2 and R1 is updated from memory onrising edge of clock. But R3 doesn’t receive instructions. It receives zeros fromMultiplexer. This is useful for execution of instructions involving forward jump.ALU is programmed by making use of the pipelined increment of the pointed memorylocations. The code is stored in the memory such that the contents first location specifiesthe operation to be performed followed by the next locations which will contain theoperands. ALU fetches contents of 3 memory locations at a time. The arithmetic or logicaloperation will be performed based on the most 16 significant bits of the instruction whichis presented by ir register.The operands are stored in registers ar, br of ALU temporarily while calculations areperformed. The output of the ALU is given by alu_out.In pipelined multiplier, the inputs are a and b, which are two unsigned 32 bit numbers. Oneach rising edge a and b are multiplied and the output y is updated. Starting from right end,a is multiplied with least 8 significant bits of b, then b shifts right by 8 digits and againmultiplies a with least 8 significant bits and so on till multiplication is completed. The 4partial sums are added to produce the output.
5. VHDL files 1. Register.vhd: library IEEE; use IEEE.std_logic_1164.all; entityregist is port(clk : in std_logic; clear : in std_logic; ip : in std_logic_vector (31 downto 0); op : out std_logic_vector (31 downto 0) ); end entity regist; architecturebeh of regist is signaltemp:std_logic_vector(31 downto 0); begin reg: process(clk, clear) begin if clear=1 then temp<= (others=>0); --elsifrising_edge(clk) then elsifclk=1 then temp<= ip ; end if; end process reg; op<=temp; end architecture beh; 2. Multiplexer.vhd: library IEEE; use IEEE.std_logic_1164.all; entity mux is port(in0 : in std_logic_vector (31 downto 0); in1 : in std_logic_vector (31 downto 0); ctl : in std_logic; result : out std_logic_vector (31 downto 0)); end entity mux;
architecturebeh of mux isbeginresult<= in1 when ctl=1else in0 after 1 ns;end architecture beh;3. Pipelinestalling.vhd: library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_textio.all; use IEEE.std_logic_arith.all; entity pipe is port( reg1,reg2,reg3:inout std_logic_vector(31 downto 0)); end entity pipe; architecture beh of pipe is signal clk : std_logic := 0; -- master clock signal stall : std_logic := 0; -- stall signal signal sclk : std_logic := 0; -- stall clock signal clear : std_logic := 1; -- one shot clear subtype word is std_logic_vector(31 downto 0); signal zeros : word := (others=>0); signal R1 : word; signal R1_a : word; signal R2 : word; signal R2_mux : word; signal R3 : word; signal cnt : word := "00000000000000000000000000000000"; begin clock: process(clk) begin if clear=1 then clear <= 0 after 500 ps; end if; clk <= not clk after 5 ns; end process clock;
cnt <= unsigned(cnt)+unsigned("00000000000000000000000000000001") after 1 ns when sclkevent and sclk=1; stall <= 1 after 1 ns when R2="00000000000000000000000000000010" and R3="00000000000000000000000000000001" else 0 after 1 ns; sclk <= clk or stall after 1 ns; -- pipeline stages R1_reg: entity work.regist port map(sclk, clear, cnt, R1); R1_a <= R1 or "00000000000000000000000000000000" after 1 ns ; --logic R2_reg: entity work.regist port map(sclk, clear, R1_a, R2); A2_mux: entity work.mux port map(R2, zeros, stall, R2_mux); R3_reg: entity work.regist port map(clk, clear, R2_mux, R3); reg1<=R1; reg2<=R2; reg3<=R3; end beh; 4. ALU.vhd:library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_unsigned.all;use IEEE.std_logic_arith.all;use ieee.numeric_std.all;entity alu isport (clk : in std_logic;ir, ar, br : inout std_logic_vector(31 downto 0);alu_sig : out std_logic;
alu_out : out std_logic_vector(63 downto 0));end alu;architecture beh of alu issignal alu_st : std_logic;signal alu_output : std_logic_vector(63 downto 0);type mem is array (0 to 31) of std_logic_vector(31 downto 0);signal temp_mem:mem;constant content:mem := ( 0=>"00000000000000000000000000000001",1=>"00000000000000000000000000000100",2=>"00000000000000000000000000000110",others=>"11111111111111111111111111111111");component pipe isport(reg1,reg2,reg3:inout std_logic_vector(31 downto 0));end component;beginreg: pipe port map(reg1=>br,reg2=>ar,reg3=>ir);clocked_alu: process(clk,ir)beginif (rising_edge(clk)) thenalu_output<=(others =>0);alu_st<= 1;case ir(31 downto 16) is
when "0000000000000000" => alu_output<=temp_mem(conv_integer(ar))+temp_mem(conv_integer(br));when "0000000000000001" =>alu_output<= temp_mem(conv_integer(ar))*temp_mem(conv_integer(br));when "0000000000000010" =>alu_output<= temp_mem(conv_integer(ar))-temp_mem(conv_integer(br));when "0000000000000011" =>alu_output<= temp_mem(conv_integer(br))-temp_mem(conv_integer(ar));when "0000000000000100" =>alu_output<= temp_mem(conv_integer(ar)) and temp_mem(conv_integer(br));when "0000000000000101" =>alu_output<= temp_mem(conv_integer(ar)) or temp_mem(conv_integer(br));when "0000000000000110" =>alu_output<= temp_mem(conv_integer(ar)) xor temp_mem(conv_integer(br));when "0000000000000111" =>alu_output<= temp_mem(conv_integer(ar)) nand temp_mem(conv_integer(br));when "0000000000001000" =>alu_output<= temp_mem(conv_integer(ar)) nor temp_mem(conv_integer(br));when "0000000000001001" =>alu_output<= not(temp_mem(conv_integer(ar)));when others => null;end case; end if;alu_sig<= alu_st;alu_out<= alu_output;end process clocked_alu;end beh;
5. Pipelinedmultiplier.vhd: libraryieee; use ieee.std_logic_1164.all; useieee.std_logic_arith.all; useieee.std_logic_unsigned.all; entitypipemult is port ( clk1 : in std_logic ; a, b : in unsigned(31 downto 0) ; y : out unsigned(63 downto 0) ); endpipemult ; architecture rtl3 of pipemult is signal y1, y2, y3, y4, y5 : unsigned (39 downto 0) ; constant z : unsigned (63 downto 0) := (others => 0); begin process(clk1) begin if (rising_edge(clk1)) then y1 <= a * b( 7 downto 0) ; y2 <= a * b(15 downto 8) ; y3 <= a * b(23 downto 16) ; y4 <= a * b(31 downto 24) ; y <= (z(63 downto 40) & y1 ) + (z(63 downto 48) & y2 & z( 7 downto 0)) + (z(63 downto 56) & y3 & z(15 downto 0)) + (y4 & z( 23 downto 0)) ; end if; end process; end rtl3 ;
6. User constraints file for FPGA NET "clk" LOC = "AJ15"; NET "clear" LOC = "AC11"; #SW0 NET "R3(31)" LOC = "T7"; NET "R3(30)" LOC = "T8"; NET "R3(29)" LOC = "U4"; NET "R3(28)" LOC = "U5"; NET "R3(27)" LOC = "V2"; NET "R3(26)" LOC = "W2"; NET "R3(25)" LOC = "T9"; NET "R3(24)" LOC = "U9"; NET "R3(23)" LOC = "V3"; NET "R3(22)" LOC = "V4"; NET "R3(21)" LOC = "W1"; NET "R3(20)" LOC = "Y1"; NET "R3(19)" LOC = "U7"; NET "R3(18)" LOC = "U8"; NET "R3(17)" LOC = "V5"; NET "R3(16)" LOC = "V6"; NET "R3(15)" LOC = "W3"; NET "R3(14)" LOC = "W4"; NET "R3(13)" LOC = "AA1"; NET "R3(12)" LOC = "AB1"; NET "R3(11)" LOC = "W5"; NET "R3(10)" LOC = "W6"; NET "R3(9)" LOC = "Y4"; NET "R3(8)" LOC = "Y5"; NET "R3(7)" LOC = "AA3"; NET "R3(6)" LOC = "AA4"; NET "R3(5)" LOC = "W7"; NET "R3(4)" LOC = "W8"; NET "R3(3)" LOC = "AB3"; NET "R3(2)" LOC = "AB4"; NET "R3(1)" LOC = "AB2"; NET "R3(0)" LOC = "AC2";
7. Simulation waveforms – Performance AnalysisOutput wave forms of pipeline simulation on Modelsim: Pipeline simulation: Pipelined multiplier simulation
8. ApplicationsPipelining for multicore computers:Using a Pipeline architecture is a common and effective method of increasingthroughput and reducing loop execution times on multicore computers. Pipeliningcan be used when data must go through multiple processes that can be broken intostage. Pipelining is a type of task parallelism that can be implemented for a seriesof serial tasks that have data dependencies.Operating systems design:In Unix-like computer operating systems (and, to some extent, Windows), apipeline is the original software pipeline: a set of processes chained by theirstandard streams, so that the output of each process (stdout) feeds directly as input(stdin) to the next one. Each connection is implemented by an anonymous pipe.Filter programs are often used in this configuration.
Super scalar pipelining:Superscalar pipelining involves multiple pipelines in parallel. Internal componentsof the processor are replicated so it can launch multiple instructions in some or allof its pipeline stages. The RISC System/6000 has a forked pipeline with differentpaths for floating-point and integer instructions. If there is a mixture of both typesin a program, the processor can keep both forks running simultaneously. Bothtypes of instructions share two initial stages (Instruction Fetch and InstructionDispatch) before they fork. Often, however, superscalar pipelining refers tomultiple copies of all pipeline stages (In terms of laundry, this would mean fourwashers, four dryers, and four people who fold clothes). Many of todays machinesattempt to find two to six instructions that it can execute in every pipeline stage. Ifsome of the instructions are dependent, however, only the first instruction orinstructions are issued.Pipelining to firmware:pipelining at the firmware level of machine organization can provide significantexecution time benefits for certain types of instructions. The essential conceptinvolved with this approach is the pipelining of operations within the hardwareunder direct control of the firmware, rather than the pipelining ofmicroinstructions.Dynamic pipelining:Dynamic pipelines have the capability to schedule around stalls. A dynamicpipeline is divided into three units: the instruction fetch and decode unit, five to tenexecute or functional units, and a commit unit. Each execute unit has reservationstations, which act as buffers and hold the operands and operations.
9. ConlusionsTo summarize, pipelining is a technique that programmers can use to gain aperformance increase in inherently serial applications (on multicore machines).The CPU industry trend of increasing cores per chip means that strategies such aspipelining will become essential to application development in the near future.Inorder to gain the most performance increase possible from pipelining, individualstages must be carefully balanced so that no single stage takes a much longer timeto complete than other stages.The project has been done on pipelined execution unit, pipelined multiplier andpipelined alu. Handling of structural, data and control hazards can also beprogramed to improve design efficiency, since it is very important for physicalimplementation of the design. Cache miss handling and exception handling arealso required for improving performance of pipelining for RISC like systems.