Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    19 19 Document Transcript

    • Execution Models for Processors and Instructions Florian Brandner Viktor Pavlu and Andreas Krall COMPSYS, LIP, ENS de Lyon Institute of Computer Languages UMR 5668 CNRS – ENS de Lyon – UCB Lyon – INRIA Vienna University of Technology Email: florian.brandner@ens-lyon.fr Email: {vpavlu,andi}@complang.tuwien.ac.at Abstract—Modeling the execution of a processor and its in- example, the Pentium 4 allows up to 128 instructions to bestructions is a challenging problem, in particular in the presence in-flight simultaneously [1].of long pipelines, parallelism, and out-of-order execution. A naive We thus propose a partitioning of the resources of the pro-approach based on finite state automata inevitably leads to anexplosion in the number of states and is thus only applicable to cessor’s datapath, where each resource is assigned a dedicatedsimple minimalistic processors. sub-automaton that is (largely) independent of the automata During their execution, instructions may only proceed forward of other resources. This allows the reduction of the overallthrough the processor’s datapath towards the end of the pipeline. number of automaton states, but requires a coordinated proce-The state of later pipeline stages is thus independent of potential dure in order to model state transitions among the individualhazards in preceding stages. This also applies for data hazards,i. e., we may observe data by-passing from a later stage to an automata. However, for most pipelined processors this parti-earlier one, but not in the other direction. tioning can be derived readily from the pipeline organization Based on this observation, we explore the use of a series based on the following observation: During execution the stateof parallel finite automata to model the execution states of of a resource depends only on the current instruction it isthe processor’s resources individually. The automaton model executing, its successors in the datapath where the currentcaptures state updates of the individual resources along withthe movement of instructions through the pipeline. A highly- instruction will proceed after finishing its computations atflexible synchronization scheme built into the automata enables the resource, as well as the next instruction that the resourcean elegant modeling of parallel computations, pipelining, and will receive from one of its predecessors in the datapath. Aeven out-of-order execution. An interesting property of our similar interaction pattern can be observed for data by-passing,approach is the ability to model a subset of a given processor where register values are forwarded from instructions that areusing a sub-automaton of the full execution model. executed late in the pipeline to instructions in early pipeline stages. It is thus possible to model the state of the processor I. I NTRODUCTION by a series of parallel finite automata (PFA) [2], which are Formal models of the dynamic behavior of processors and synchronized in order to perform state updates according toinstructions during their execution find applications in a wide the datapath and the processor’s pipeline structure.range of development, design, validation, and verification Furthermore, our approach allows to derive execution mod-tasks. For example, cycle-accurate instruction set simulators els of individual instructions or groups of instructions usingrely on an execution model in order to efficiently emulate sub-automata that include only those resources that are re-the processor’s behavior during the execution of a program. quired for the execution of the considered instructions. ThisDuring the simulation pipeline effects, resource-, control-, and is particularly interesting for analysis and verification tasksdata-hazards need to be detected and faithfully modeled in where the reduced size of the sub-automata is expected to beorder to guarantee correctness. The validation and verification beneficial.of a processor poses similar requirements. Here the processor The rest of this paper is organized as follows. First, wedesigner seeks for a proof that the behavior of the processor will present previous work on execution models in Section II,actually reflects the properties stated by a formal specification. followed by an introduction to the basic concepts of parallelIn addition, processor models can help to build software devel- finite automata in Section III. Section IV introduces our exe-opment and analysis tools, such as worst-case execution time cution model for processors based on parallel finite automata.(WCET) analysis tools, code profiling tools, code certification Next, we present how execution models for a subset of thetools, and compilers. processor’s instruction set can be derived down to the level of Naive approaches based on finite state automata (FSA) individual instructions. Finally, we will conclude by shortlyinevitably lead to a rapid growth in the number of states. discussing the merits our new approach in Section VI.In particularly when multiple instructions are executed inparallel, as is the case for pipelined, superscalar, or explicitly II. R ELATED W ORKparallel processors such as VLIW machines. Stalls due to Execution models for processors are often developed invarious kinds of data hazards, long memory latencies, shared combination with processor description languages (PDL) [3].resources, etc. further increase the number of states, rendering Lisa [4] models pipeline effects using extended reservationthis approach impractical for many application scenarios. For tables called L-charts that contain operations and the activation978-1-4244-8971-8/10$26.00 c 2010 IEEE
    • a multiple source nodes are termed synchronization transitions. A transition can be applied, or is enabled, if all its source n4 nodes are active in the current state and the current input start n1 symbol matches the transition’s label. The new state after the execution of a transition is derived by first deactivating all c source nodes and then activating all its target nodes. At the same time the current input symbol is consumed, unless the λ label represents the special symbol ‘λ‘. In that case the current b n2 n3 input symbol is not consumed, nevertheless the transition is still considered enabled. The automaton accepts its input when all the input symbols have been consumed and a final stateFig. 1. A parallel finite automaton accepting a∗ (b∗ c) - see Example 1. has been reached, i.e., when all final nodes in F are active.relations between them to support dynamic scheduling for A formal definition of PFAs and their execution is presentedsimple in-order pipelined processors. MADL uses extended in [2].finite state machines with tokens [5]. The token manager, A PFA can be represented as a labeled directed hyper-however, is implemented in C/C++, so the formal model graph [8] H = (V, E), consisting of a set of vertices Vremains incomplete. The approach presented in this paper and hyperedges e ∈ E ⊆ P(V ) × P(V ) × (Σ ∪ {λ}). Thewas also designed in conjunction with a PDL [6], [7]. The automaton’s nodes are represented by vertices of the graph,language relies on hypergraphs [8] in order to specify the i. e., V = N , and the transitions in µ correspond to labeledhardware structure of a processor. Due to the close relationship hyperedges.between hypergraphs and PFAs it is straightforward to derive The language accepted by a PFA can be represented usingan execution model from a processor specification. an extended form of regular expressions, with the common Furthermore, processor models are used in software and operators ?, ∗, +, and |. An additional operator representshardware verification tools, analysis tools, and instruction set the interleaving of the languages specified by the respectivesimulators. Reshadi and Dutt use Reduced Colored Petri Nets sub-expressions.to formally model the pipeline of processors. They are able Example 1. Consider the PFA depicted in Figure 1. The nodeto synthesize efficient instruction set simulators [9] for a wide n1 is the start node, while nodes n2 and n4 are final nodes. Therange of processor architectures. Thesing [10] uses finite state figure shows a hypergraph representation, where the hyperedgemachines in order to capture the execution state of a processor labeled ‘λ‘ is a parallel transition that does not consume anyfor worst-case execution time (WCET) analysis. Synchronous symbol. The two sub-automata represented by n2 as well asTransition Systems are used by Damm and Pnueli [11] to verify n3 and n4 may proceed independently. The PFA thus acceptsmachines with out-of-order execution by showing that they the language a∗ (b∗ c), where the sub-expressions b∗ and creach the same final state as purely sequential machines. are processed independently from each other. The automaton rejects the word abb , because the non- III. PARALLEL F INITE AUTOMATA final node n3 is activated after consuming the input symbol Parallel finite automata (PFA) are related to traditional finite ‘a‘ but never deactivated thereafter. Final node n4 is neverstate automata, but allow multiple nodes of the automaton activated and thus no final state is reached. The word abbcto be active simultaneously. Stotts and Pugh showed that is, however, accepted. In terms of the automaton’s executionPFAs are equivalent to finite state automaton (FSA) [2], i.e., this word is equivalent to any other interleaving, e. g., abcbit is possible to construct a regular FSA from any given or acbb .PFA. However, this transformation may lead to an exponentialgrowth in the number of states. PFAs thus allow a more IV. P ROCESSOR E XECUTION M ODELINGcompact representation of certain classes of regular languages. Initially, lets assume that the processor, for which an ex-Definition. A parallel finite automaton is defined by a ecution model is to be developed, is composed by a set oftuple M = (N, Q, Σ, µ, q0 , F ), where N is a set of nodes, resources, each of which is able to process one instruction atQ ⊆ P(N ) is a finite set of states, Σ a finite input alphabet, a time. Furthermore, assume that instructions traverse a subsetµ : P(N )×(Σ∪{λ}) → P(N ) is a transition function, q0 ∈ Q of the resources during their execution - note that it is possiblea start state, and F ⊆ N a set of final nodes. that one instruction uses several resources at the same time. In the following, we will show how parallel automata can be used Very similar to traditional automata, the execution of a to model the execution state of the processor, its resources, andPFA is described by repeated transitions from one state to its instructions.another. The difference is that multiple nodes can be active oneach state before and after every transition. Consequently, the A. Modeling Resourcestransition function µ provides an extended semantics covering A resource is modeled using four kinds of nodes in themultiple nodes of the automaton. Transitions with multiple automaton: (1) a ready node, (2) an accept node, (3) atarget nodes are termed parallel transitions, while those having complete node, and (4) two synchronization nodes. A local
    • cycle end start rdyb rdy synco rdya acptb synci τ enter ω cycle start cmplta acptc acpt ρ cmplt rdyc Fig. 3. Nodes and transition modeling a connection among multiple resources leave of the processor.Fig. 2. Nodes and transitions representing the state of a resource in theparallel automaton. model the movement of instructions through the datapath and in addition perform arbitration of the involved resources.view of the nodes and transitions of a resource is shown in Figure 3 depicts an inter-resource transition modeling anFigure 2, the interaction with other automata of the execution instruction that after completing its execution at resourcemodel is indicated by the gray transitions. a moves on to resources b and c, which then process the The synchronization nodes separate transitions logically instruction in parallel. The transition is enabled only when thebelonging to one execution cycle from those of the next cycle. complete node of the source resource is active and at the sameThe synchronization node at the center of Figure 2 serves as time both ready nodes of the targets are active too. When thea token that ensures that only a single action is performed transition is executed, the source resource becomes ready andby a resource on every cycle. The activation of the second at the same time the target resources leave the ready node andsynchronization node on the other hand indicates that the transition into the accept nodes. This effectively models howresource has performed its action for the current cycle and an instruction advances through the datapath from resourceis ready to proceed to the next cycle. The gray transactions to resource. Furthermore, mutual exclusive use of resourcesleading to and from those nodes represent a synchronization is guaranteed, since the ready nodes of the target resourcesamong all the resources of the processor. are properly deactivated when an instruction is accepted for The other nodes represent the current status of the resource execution.itself. The ready node is initially active and indicates that theresource is currently ready to accept instructions for execution.The node is deactivated whenever an instruction is accepted for C. Cycle Synchronizationexecution by the resource, and may only be reactivated when Every resource is allowed to perform a single resource-localthe currently executing instruction has completed. When the transition per cycle, since all resource-local transitions requireresource is idle a transition may immediately re-activate the the synci node to be active. The synci node thus logicallyready node – as indicated by transition . If an instruction marks the beginning of a cycle. Similarly, all resource-localhas been accepted, it proceeds by performing the computation transitions activate the synco node, which logically marks therepresented by the transition labeled ρ leading from the accept end of a cycle in the view of a resource. Once all synco nodesnode to the complete node. Finally, when an instruction is of all resources are active a global synchronization transitionblocked, i. e., it cannot proceed to another resource due to is enabled, which re-activates all synci nodes and signals thea hazard, it has to stall. This is represented by transition ω, transition from one execution cycle to the next.leading from the complete node back to itself. Note that all Furthermore, advanced arbitration schemes can be realizedthese resource-local transitions disable the synci node and in using the synchronization nodes by keeping the synci nodes ofturn enable synco . certain resources inactive and selectively enabling them when appropriate. This can be particularly valuable in pipelinedB. Instruction Execution processors where resources late within the pipeline control the The sub-automata of the individual processor resources are behavior of other resources that appear earlier in the pipeline,chained together using inter-resource transitions as indicated as for example for data by-passing and other forms of databy the enter and leave transitions in Figure 2. These transitions and control hazards.
    • D. Transition Guards associating its transitions with simulation functions that update The model presented so far is already able to express the emulated processor state. In comparison to interpretersdatapaths with structural hazards. In the presence of data or based on FSAs this will most likely induce a performancecontrol dependencies, the transitions have to be augmented hit, since multiple transitions are required in order to sim-with guard functions. Similarly, dynamic scheduling of mod- ulate a single execution cycle. However, in the context ofern superscalar processors and shared resources might require compiling simulators, which are orders of magnitudes fasteradditional guard functions. The guard functions inspect vari- than interpreting simulators, this additional information hasables that are associated with the sub-automata of individual the potential to unveil optimization opportunities, e.g., byresources. These variables represent control decisions, data, eliminating unused code via tracing [12].or input operands processed by the instructions currently ex- In addition to instruction set simulation, PFA-based execu-ecuted by the respective resources. The structure of the guard tion models might also prove valuable for the developmentfunctions is highly dependent on the application scenario an of verification and analysis tools. With PFAs it is possibleexecution model is applied to. However, our model generally to derive sub-automata of individual instructions, pairs ofsimplifies the guard functions since the synchronization nodes instructions or groups of instructions. These sub-automata areand transitions provide a consistent view of the currently active much smaller than the original automaton of the processor’sresources and instructions. full execution model. This opens the possibility for formal verification methods at the instruction level to proceed in two V. I NSTRUCTION - LEVEL M ODELS steps. The analysis first operates on the smaller instruction- An interesting property of the approach presented in the level sub-automaton in order to detect possible fault condi-previous section is that an instruction is represented as a set tions, such as data loss or race conditions. During the secondof active nodes in the automaton throughout its execution. The phase the partial results obtained during the first phase aremovement of an instruction through the datapath is resembled applied to the full execution model and further extended toby the deactivation and activation of nodes by corresponding cover the state of the complete processor. This reduces thetransitions. number of processor states that have to be examined during It is thus possible to deduce a sub-automaton consisting the analysis considerably.of only those nodes that are potentially activated during the R EFERENCESexecution of an instruction. The result is an execution model [1] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quan-of the given instruction, which is considerably smaller than the titative Approach, 4th ed. Morgan Kaufmann, 2006.full processor model. This idea can also be extended to pairs [2] P. D. Stotts and W. Pugh, “Parallel finite automata for modelingof instructions or more general groups of instructions, such as concurrent software systems,” J. Syst. Softw., vol. 27, no. 1, pp. 27– 43, 1994.all arithmetic instructions or all floating-point instructions of [3] P. Mishra and N. Dutt, Processor Description Languages. Morgana processor. Kaufmann Publishers Inc., 2008, vol. 1. [4] A. Hoffmann, H. Meyr, and R. Leupers, Architecture Exploration for VI. D ISCUSSION AND C ONCLUSION Embedded Processors with Lisa. Kluwer Academic Publishers, 2002. [5] W. Qin and S. Malik, “Flexible and formal modeling of microprocessors PFAs offer a number of advantages for the modeling of with application to retargetable simulation,” in DATE ’03: Proceedingsprocessors. The processor itself is inherently parallel, so the of the Conference on Design, Automation and Test in Europe. IEEEmodeling with PFAs is quite natural and easy to understand. Computer Society, 2003, pp. 556–561. [6] F. Brandner, D. Ebner, and A. Krall, “Compiler generation fromPFA-based processor models are very visual and can be structural architecture descriptions,” in CASES ’07: Proceedings of theorganized in order to match common graphical representations International Conference on Compilers, Architecture, and Synthesis forof processors such as pipeline diagrams. The movement of Embedded Systems. ACM, 2007, pp. 13–22. [7] F. Brandner, “Compiler backend generation from structural processorinstructions through the pipeline is explicitly visible, which models,” Ph.D. dissertation, Institut f¨ r Computersprachen, Technische usimplifies the development and debugging of PFA models. Universit¨ t Wien, 2009. a A major advantage of PFAs over FSAs is the decoupling of [8] J. A. Bondy and U. S. R. Murty, Graduate texts in mathematics - Graph theory. Springer, 2007, vol. 244.states and nodes. PFAs thus consist of fewer nodes compared [9] M. Reshadi and N. Dutt, “Generic pipelined processor modeling andto equivalent regular FSA modeling the same processor. Still, high performance cycle-accurate simulator generation,” in DATE ’05:the PFA implicitly encodes the same number of states. The Proceedings of the Conference on Design, Automation and Test in Europe. IEEE Computer Society, 2005, pp. 786–791.expressive power does not change since PFAs and FSAs are [10] S. Thesing, “Safe and precise WCET determination by abstract inter-equivalent. The implicit encoding of all possible processor pretation of pipeline models,” Ph.D. dissertation, Naturwissenschaftlich-states also simplifies the construction of PFA models, i.e., Technische Fakult¨ t der Universit¨ t des Saarlandes, 2004. a a [11] W. Damm and A. Pnueli, “Verifying out-of-order executions,” in Pro-it is no longer required to pre-compute all possible states to ceedings of the IFIP WG 10.5 International Conference on Correctconstruct the automaton. Hardware Design and Verification Methods. Chapman & Hall, Ltd., 1997, pp. 23–47.A. Application Scenarios [12] F. Brandner, “Precise simulation of interrupts using a rollback mecha- nism,” in SCOPES ’09: Proceedings of the 12th International Workshop The presented approach was initially intended as a flexible on Software and Compilers for Embedded Systems. ACM, 2009, pp.and generic execution model for instruction set simulators. 71–80.Interpreting simulators can immediately reuse the PFA by