Streams on wires

  • 408 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
408
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide


















Transcript

  • 1. Streams on Wires — A Query Compiler for FPGAs Rene Mueller Jens Teubner Gustavo Alonso rene.mueller@inf.ethz.ch jens.teubner@inf.ethz.ch alonso@inf.ethz.ch Systems Group, Department of Computer Science, ETH Zurich, Switzerland ABSTRACT The Avalanche project at ETH Zurich addresses the chal- Taking advantage of many-core, heterogeneous hardware for lenging questions raised by these scenarios. In Avalanche, data processing tasks is a difficult problem. In this paper, we we study the impact of modern computer architectures on data processing. As part of the efforts around Avalanche, : consider the use of FPGAs for data stream processing as co- processors in many-core architectures. We present Glacier, we assess the potential of using FPGAs as additional cores a component library and compositional compiler that trans- in many-core machines and how they could be exploited for 2010.06.17 forms continuous queries into logic circuits by composing library components on an operator-level basis. In the pa- data processing purposes. In this paper, we report on the work we have done devel- per we consider selection, aggregation, grouping, as well as oping Glacier, a library of components and a basic compiler windowing operators, and discuss their design as modular for continuous queries implemented on top of an FPGA. elements. The ultimate goal of this line of work is to develop a hy- We also show how significant performance improvements brid data stream processing engine where an optimizer dis- can be achieved by inserting the FPGA into the system’s tributes query workloads across a set of CPUs (general- data path (e.g., between the network interface and the host purpose or specialized) and FPGA chips. In here, we focus CPU). Our experiments show that queries on the FPGA on how conventional streaming operators can be mapped to
  • 2. Abstract ABSTRACT The Ava Taking advantage of many-core, heterogeneous hardware for lenging qu data processing tasks is a difficult problem. In this paper, we we study consider the use of FPGAs for data stream processing as co- data proc processors in many-core architectures. We present Glacier, we assess a component library and compositional compiler that trans- in many-c forms continuous queries into logic circuits by composing data proce library components on an operator-level basis. In the pa- In this p per we consider selection, aggregation, grouping, as well as oping Glac windowing operators, and discuss their design as modular for contin elements. The ultim We also show how significant performance improvements brid data can be achieved by inserting the FPGA into the system’s tributes q data path (e.g., between the network interface and the host purpose or CPU). Our experiments show that queries on the FPGA on how co can process streams at more than one million tuples per circuits on second and that they can do this directly from the network, over data removing much of the overhead of transferring the data to to operate a conventional CPU. actual per system. The pap
  • 3. uples per network, over data streams; the proper system setup for the FPGA e data to Contributions to operate in combination with an external CPU; and the actual performance that can be reached with the resulting system. The paper makes the following contributions: 1. We describe Glacier, a component library and compiler rds many- for FPGA-based data stream processing. Besides clas- pportuni- sical streaming operators, Glacier includes specialized rocessing building blocks needed in the FPGA context. With the t go away operators and specialized building blocks, we show how r the I/O Glacier can be used to produce FPGA circuits that im- ow to ex- plement a wide variety of streaming queries. ow to deal host com- 2. Since FPGAs behave very differently than software, we machines provide an in-depth analysis of the complexity and per- provides formance of the resulting circuits. We discuss latency PUs) next and issue rates as the relevant metrics that need to be dors have considered by an optimizing plan generator. e machine e floating 3. Finally, we evaluate the end-to-end performance of an de larger FPGA inserted between the network and the CPU, run- be CPUs ning a query compiled with Glacier. Our results show that the FPGA can process streams at a rate beyond one million tuples per second, far more than the CPU could. ed provided Our work is organized as follows. The upcoming Sec-
  • 4. which typically indicate a turbulent market situation. At Table 1: Supported streaming a Motivating Applications the same time, latency is critical and measured in units of names; q, qi : sub-plans; x: para microseconds (µs). input). To abstract from the real application, we assume an input stream that contains a reduced set of information about each trade handled by Eurex (the actual streams are implemented of UBS stocks (similar functionality with a Swiss bank as a compressed representation of the feature-rich FIX pro- plement finite-impulse response filter tocol [4]). financial trading application receives data from a setdetermines the average trade prices Their Expressed in the syntax of StreamBase [16], the of streams with up-to-date market information from different stock exchanges. schema of our data would read: over the last ten-minutes window: CREATE INPUT STREAM Trades ( SELECT count () AS Number Seqnr int, -- sequence number FROM Trades [ SIZE 600 ADVANC Symbol string(4), -- valor symbol WHERE Symbol = "UBSN" Price int, -- stock price INTO NumUBSTrades Volume int) -- trade volume SELECT wsum (Price, [ .5, .25, .12 To keep matters simple, we look at queries that process FROM (SELECT * FROM Trades a single data stream only. In particular, we disallow the WHERE Symbol = "UBSN") use of joins. To facilitate the allocation of resources on the [ SIZE 4 ADVANCE 1 TUPLE FPGA, we restrict ourselves to queries with a predictable INTO WeightedUBSTrades space requirement. We do allow aggregation queries and SELECT Symbol, avg (Price) AS Av windowing; in fact, we particularly look at such functionality FROM Trades [ SIZE 600 ADVANC in the second half of Section 4. GROUP BY Symbol These restrictions can be lifted with techniques that are INTO PriceAverages no different to those applied in software-based systems. The We use these five queries in the fol necessary FPGA circuitry, however, would introduce addi- various features as well as the composi tional complexity and only distract from the FPGA-inherent compiler. considerations that are the main focus of this work. 2.3 Algebraic Plans 2.2 Example Queries Input to our compiler is a query rep
  • 5. 1 x|k,l 2 Input to our compiler is a Our first set of example queries substituted on each wind.; apply q2 with x is designed to illustrate bra for streaming queries. O Example Queries a hardware-based implementation for the most basic oper- t ∈ {time, tuple}: time-, or tuple-based the algebra dialect listed in T ators 1in stream processing. Queries Q1 and Q2 field simple q q2 concatenation; position-based use join be composed in an arbitrary projections as well as selections and compound predicates: Our algebra uses an “asse Table 1: Price, Volume SELECT Supported streaming algebra (a, b, c: field breaks down selection, arithm names; q, qi : sub-plans; x: parameterized sub-plan FROM Trades into separate algebra operat input). Symbol = "UBSN" (Q1 ) simple projections as well as selections and similar representation also su WHERE compound predicates INTO UBSTrades tics of XQuery [11]. In the c notation turns out to have n ofSELECTstocks (similar functionality is used, e.g., to im- UBS Price, Volume flow in a hardware circuit an FROM Trades plement finite-impulse response filters). Finally, Query 2 )5 (QQ for parallel evaluation. determines the average trade prices for > 100000 symbol WHERE Symbol = "UBSN" AND Volume each stock Figure 1 illustrates how ou overINTO last ten-minutes window: the LargeUBSTrades to express the semantics of Q SELECT count () AS Number Financial analytics often depend on statistical information how in Figure 1(a), e.g., ope counts the number of trades of UBS shares over FROM Trades [ SIZE 600 ADVANCE 60 TIME ] from the data stream. Using sliding-window and grouping) the last 10 each (600 seconds) and returns the of minutes input tuple with the r (Q3 aggregate every minute. WHERE Symbol =Q3 counts the number of trades of UBS functionality, Query "UBSN" explicit. Its output, the new shares INTO the last 10 minutes (600 seconds) and returns over NumUBSTrades to filter out non-qualifying t the aggregate every minute. In Query Q.125 ]) AS Wprice SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the The concatenate operator FROM (SELECT * FROM Trades presence of an aggregation function wsum that computes the the presence of an aggregation by position”. T called a “join function wsum that weighted sum WHEREthe prices"UBSN") the last four trades) over Symbol = seen in (Q4 computes the weighted sum over the prices seen in are combined into a wide re the last four trades of UBS stocks (similar [ SIZE 4 ADVANCE 1 TUPLES ] functionality is used INTO WeightedUBSTrades SELECT Symbol, avg (Price) AS AvgPrice FROM Trades [ SIZE 600 ADVANCE 60 TIME ] the average trade prices for each stock symbol (Q5 ) over the last ten-minutes window GROUP BY Symbol INTO PriceAverages
  • 6. Algebraic Plans πa1 ,...,an (q ) projection σa (q ) select tuples where field a contains true ○a:(b1 ,b2 ) (q ) arithmetic/Boolean operation a = b1 b2 llabora- q1 ∪ q2 union plication agg b:a (q ) aggregate agg using input field a , market agg ∈ {avg, count, max, min, sum} rmation packages q1 grpx|c q2 ( x ) group output of q1 by field c, then a rate at invoke q2 with x substituted by the group s rate is q1 t x|k,l q2 ( x ) sliding window with size k , advance by l ; 5]. apply q2 with x substituted on each wind.; ] cannot t ∈ {time, tuple}: time-, or tuple-based ential fi- q1 q2 concatenation; position-based field join = join by position uations, ion. At Table 1: Supported streaming algebra ( a, b, c: field units of names; q, qi : sub-plans; x : parameterized sub-plan input). an input out each
  • 7. .2 q1 Example Queries t q2 (x) sliding window with size k, advance by l; Input to our compiler is a query representation x|k,l Our first set of example queries substituted on each wind.; apply q2 with x is designed to illustrate Algebraic Plans bra for streaming queries. Our compiler currentl hardware-based implementation for the most basic oper- t ∈ {time, tuple}: time-, or tuple-based the algebra dialect listed in Table 1, whereby ope tors 1in stream processing. Queries Q1 and Q2 field simple q q2 concatenation; position-based use join be composed in an arbitrary fashion. rojections as well as selections and compound predicates: Our algebra LargeUBSTradesLargeU BS Trades NumU BS Trades uses an “assembly-style” represent NumUBSTrades UBSTrades π Table 1: Price, Volume SELECT Supported streaming algebra (a, b, c: field breaks down selection, arithmetics, and predicate U BSPrice,Volume Trades time πPrice,Volume x|600,60 time Wei x|600,60 names; q, qi : sub-plans; x: parameterized sub-plan FROM Trades into separate algebra operators. In earlier work, w πPrice,Volume σc πPrice,Volume Trades σc coun t Number input). Symbol = "UBSN" (Q1 ) similar representation also suited to express,countNum ○c:(a,b) ∧ Trades e.g., WHERE σa ○c:(a,b) ∧ σa σa INTO UBSTrades > σa In the tics of XQuery○b:(Volume,100000)> context of the current [11]. σa ○b:(Volume,100000) ○a:(Symbol,"UBSN") = ○a:(Symbol,"UBSN") ○a:(Symb notation turns a:(Symbol,"UBSN") ○ ○a:(Symbol,"UBSN") nice correspondences t = out to have = = = ○a:(Symbol," = ofSELECTstocks (similar functionality is used, e.g., to im- UBS Price, Volume ○a:(Symbol,"UBSN") = flow in a hardware circuit and helps detecting opp FROM Trades Trades Trades x Trad plement finite-impulse response filters). Finally, Query 2 )5 Q Trades Trades x (Q for parallelQevaluation.Q2 (a) Query 1 (b) Query (c) Query Q3 determines the average trade prices for > 100000 symbol WHERE Symbol = "UBSN" AND Volume each stock FigureBS Trades (a) Query Q1 (b) Query Q2 1 illustrates how our streaming algebra c (c) Query Q3 overINTO last ten-minutes window: the LargeUBSTrades LargeU BS Trades NumU to express the semanticsAlgebraic query plans for PriceAver Figure 1: Queries Q1 throughfive of Figure 1: Algebraic query plan the Q U BS Trades πPrice,Volume time WeightedU BS Trades SELECT count () AS Number Financial analytics often depend on statistical information LargeUBSTrades how in Figure 1(a), e.g., operator ○ makes the c NumUBSTrades x|600,60 = πPrice,Volume σc 6-to-1 lookup tables 69,120 a time piece FROM Trades [ SIZE 600sliding-window and grouping rom the data stream. Using ADVANCE 60UBSTrades TIME ] tuple Tradesinput tuple 6-to-1 lookup requested stock symbo (Q3 )∧ c:(a,b) of each flip-flops (1-bit flip-flopsthe registers) πPrice,Volume with time countNumber registers) tables 69,120 x|4,1 69,120 ing the PriceAverages x|600, WeightedUBSTrades WHERE Symbol = "UBSN" ○ x|600,60 (1-bit 69,120 unctionality, Query Q3 counts the number of trades of UBS πPrice,Volumeσa c explicit. block RAM block RAM 296×18WPrice:(Price,[··· ]) is usedgra Trades output, the a 25×18-bit countNumber new column a, kbit RAM Its σa multipliers σtuple wsum kbit 296×18 Trades 64 time progra INTO the last 10 minutes (600 seconds) and returns NumUBSTrades ○b:(Volume,100000) > x|4,1 25×18-bit multipliers (operator σ ). x|600,60 64 progra hares over ○a:(Symbol,"UBSN") ○c:(a,b) = ∧ to filter out non-qualifying rate MHz typical clock rate ○a:(Symbol,"UBSN")typical clock = tuples 100 ○a:(Symbol,"UBSN") = 100 MHz tribute ax a ○a:(Symbol,"UBSN") = x wsumWPrice:(Price,[··· ]) Trades grpy|Sym he aggregate every minute. In Query Q.125 ]) AS ○b:(Volume,100000) SELECT wsum (Price, [ .5, .25, .125, 4 , we assume the Wprice > The concatenate operator a a Table 2: Xilinx XC5VLX110T characteristics. represents what the FP FROM (SELECT * FROM Trades ○a:(Symbol,"UBSN") resence of an aggregation function wsum that computes the Trades called a:(Symbol,"UBSN") ○a:(Symbol,"UBSN") Tuples from both inp = Trades ○a:(Symbol,"UBSN") = ○ a “join xby = = position”. x Trades Table 2: Xilinx XC5VLX110T characteristics. x A pa avg of (e) cont weighted sum WHEREthe prices"UBSN") the last four Q1 (Q4 ) Query Q2 they arrive.Queryoperator a necessary, for instance, to evalu- the tionalQ over Symbol = seen in (a) Query Trades trades (b) Trades are combined Q3 (c) (d) Query Q4 The into is wide result tuple in x they arrive.Trades orde m [ SIZE 4 ADVANCE 1 TUPLES ] ate and return different aggregation functions over thefor instance, to dat The operator is necessary, same by eva (a) Query Q1 (b) Query Q2 1: Algebraic query plans return different aggregation functions overQuery Figure input stream.3 ate and for the five example queries Q1 to Q5the sa (c) Query Q (d) Query Q4 . (e) Typica INTO WeightedUBSTrades input stream. LargeUBSTrades NumUBSTrades the add UBSTrades Symbol, avg (Price) AS AvgPrice SELECT 3. FPGAS FOR STREAM PROCESSING to 6-to-1Figure tables PriceAverages plans for the five example queries Q1 to Q5 .the RAM c lookup 1: Algebraic query tionalit πPrice,Volume time WeightedUBSTrades 69,120 3. FPGAS FORof data, its address, flip-flop reg a piece STREAM PROCESSING constan FROM Tradesσ [ SIZE 600x|600,60 ADVANCE 60 TIME ] flip-flops (1-bit registers) time its69,120heart, every FPGA chip dataFPGA chipFPGAs, of three mr At very ing the consists of three main back. In πPrice,Volume c block RAM5 ) tuple (Q types of components. its a piece of data, distributed overthe RAM and tig 296×18 kbit At A large heart, every very number of lookup tables (LUTs)chip RAM are consists the We GROUP BY Symbol Trades countNumber 6-to-1 lookup tables x|4,1 25×18-bit multipliers 69,120 x|600,60 its address, to lookup tables (LU 64 types of ing the of logic large number lookup components. A gates. Each of provides a programmable type data back. In FPGAs,addition, registers programmable logic. In chip, t tion prt lookup ∧ c:(a,b) flip-flops (1-bit registers) 69,120 flip-flop INTO PriceAverages programmed at runtime gates. beimplem provides aarbitrary 6 bit → 1 bit logicand thusEach look programmable type of function. used a σa σa σa wsumWPrice:(Price,[··· ]) Trades kbit MHz typical clock rate block RAM table100 implement an can 296×18 grpy|Symbol RAM are distributed over the chip and tightly w memor > b:(Volume,100000) 25×18-bit multipliers Lookup tables are wired through an an arbitrary such, → 1 bit tables 64 table can implement memory. As 6 bit the on-chip sto tributed programmable interconnectan interconnect netwo network functi logic. In addition, lookup The ac = a:(Symbol,"UBSN") these five queries in the following to demonstrate Lookup tablesthe FPGA can be accessed in a truly paral are wired through We use = = a:(Symbol,"UBSN") = a:(Symbol,"UBSN") Xilinx XC5VLX110T x can route that canacross the chip.runtime and thus Finally, rare up Table 2: typical clock rate x 100 MHz avgAvgPrice:Price programmed at across the flip-flops used as add that signals characteristics. Finally, route 1-bit storage units that can signals be A particular use ofchip. potential flip-flo this is the a:(Symbol,"UBSN") (also called registers) provide memory. As such, the on-chip storage r tributed various features as well as the compositionality of the Glacier provide 1-bit storage unitsthat is (also called registers)logic. of content-addressable memory (CAM). c directly be wired into the remaining be accessed in a truly parallel fasthat O Trades Trades x Table 2: Xilinx XC5VLX110T characteristics.yto evalu-the wired into the remaining logic. Trades FPGA can they arrive. The operator is necessary, Theinstance, directly betables, the wiring of the intercon- for behavior of lookup lookup tional memory, content-addressable memor compiler. (b) Query Q A particular use of this potentialof thelatency is theinterco ate and return different aggregation nect, Query Q5 initial state of byof lookup tables, all be con- explicit me functions over theThe behavior data values, rather wiring impl (a) Query Q1 2 (c) Query Q3 (d) Query Q4 (e) and the same the flip-flops can the than by input stream. nect, The content-addressable the flip-flops can all a co of initial state of memory (CAM). Other t figured by software. and the Typically,content-addressable memory begiv implem 2.3 Algebraic Plans actual configuration are typically resolve can they arrive. The operator is necessary, for instance, to evalu- CAMs is used to tional memory, figured by software. Thelanguageconfiguration istwo cy actual (such as typica Figure 1: Algebraic query plans for the five example queries Q1 to Q5described using a hardware descriptionrather than by explicit More ge . ate and return different aggregation functions over the same the address it has been stored at. byusing a hardware description language (such data values, memory described VHDL or Verilog) and loadedtionality can implement an arbitrary key- into the FPGA.
  • 8. FPGAs for Stream Processing notification algebraic VHDL query plans Code Xilinx query plan CPU Glacier Synthesizer network the FPGA is directly connected to the physical network interface, NIC π σ (a) with parts of the network controller implemented inside the Main FPGA fabric. data Memory stream Figure 3: Compilation of abstract que FPGA notification hardware circuits for the FPGA. query plan CPU sub-plan adheres to the same well-defined wi data π σ (b) stream Main Memory 4.1 Wiring Interface the FPGA can also be used in a traditional co-processor setup As in data streaming engines, our processin FPGA tirely push-based. Each n-bit-wide tuple is a set of n parallel wires in the FPGA fabric Figure 2: System Architectures: (a) Stream engine wires, a new tuple can be propagated in eve between network interface and CPU. (b) Stream FPGA’s system clock (i.e., 100 million tuple engine as a coprocessor to the CPU. An additional data valid line signals the pres in a given clock cycle. Tuples are only consid of the data stream if their data valid flag is s tem main memory, then informs the host CPU about the if the data valid line carries an electrical “hig arrival of new data (e.g., by raising an interrupt). The can also Figure 2 (a) fits pattern that is highly common inthe following, we use rectangles to repre Alternatively, the FPGA architecture inbe used in aa traditional In data stream applications. ponents (with the exception of multiplexers co-processor setup, as illustrated in Figure 2 (b). Here, the use the common trapezoid notation). Our CPU hands over data to the FPGA either by writing directly clock-driven or synchronized and every oper into FPGA registers (so-called slave registers) or it prepares brary writes its output into a flip-flop regi
  • 9. Query Compilation algebraic VHDL query plans Code Xilinx CPU Glacier Synthesizer (a) circuit on FPGA Main emory Figure 3: Compilation of abstract query plans into hardware circuits for the FPGA. The compiler applies compilation rules (Section 4) and op- timization heuristics (Section 6), then emits the description of a logic circuit that implements the input plan. CPU sub-plan adheres to the same well-defined wiring interface. (b) Main 4.1 Wiring Interface emory As in data streaming engines, our processing model is en- tirely push-based. Each n-bit-wide tuple is represented as a set of n parallel wires in the FPGA fabric. On a set of ream engine wires, a new tuple can be propagated in every cycle of the (b) Stream FPGA’s system clock (i.e., 100 million tuples per second).
  • 10. down the data path, as shown in R ule Proj: & & rator’s leftmost observableFor orresponds to the output.com- rectangles to represent logic response time, whereas 4: Compiled hardware execution plan for ue we multiplexers, throughput. rate determines black-box ,ion of depict thefor which we q Wiring Interface Qhardware Our circuits are all circuit is 3, issue rate 1. a 1 . Latency of this d notation). implementation Synchronization the right. zed and every operator in our li- πa1 ,...,an (q) ⇒ ··· .1 (Proj) 1 ery q as shown on after pro- nto a flip-flop register also need to be considered dur- a1 an ∗ se arrows sometimes boxes and properties to denote the wiring between hardware gisters as gray-shaded q ry compilation. For instance, all compilation rules plansthat a generated circuit willwe identifythis islines T his implementation for πa1 ,...,an has an interesting side ents. Wherever appropriate, never try to those also the rate explicit as nsure have an issue rate of one, push tuple For that correspondan operator thattuple field utput. ples in bus to a specific has black-box plan.cycles into successive omplete than one. We use two types of logic Our circuits are allOclock-driven or synchronized and every q effect. ur compiler emits the description of a hardware cir- 2 abel at less respective input/output port. The labelin ourthat iswrites its output into synthesizer to generate the actual e rate the mentation cuit library passed into a a flip-flop register operator ds for “all remaining fields”. We between sub- after processing. configuration for the F P G A . T he synthesizer op- nentsright. the to implement the synchronization do not represent hardware Union : ote the wiring betweentuple. The hardware plan for the r of fields within a hardware timizes out “dangling wires”, effectively implementing pro- Figure 5: Compilation r expression σa (q)flag thus be for streams with of an algebraic for free. (shown for an instance ppropriate, we identify those lines illustrated as a data flow data can registers queues act asvalidpoint of view, the task short-term buffers jection pushdown orrespond to a specific tuple field T here is no actual work to do at runtime (though fields varying data rate. T hey emit data at a predictable tive input/output port. The label perator ∪ is notrate of an upstream sub-circuit. ate, typicallyWe do ning fields”. to accumulate the outputare propagated into a new set ofin parallel). and the issue represent of several open registers). L atency treams hardware the for theoutput stream. Since,rate of this implementation for pro jection are both 1. tuple. The runtime, single . & as with the output rate. σa(q) issue in our aote that, atinto a plan average input rate must not xceed what illustrated can thus be is achievable 4.3 Arithmetics and Boolean Operations source streams a ∗ truly in parallel,circuit, the logical ʻandʼ gate invalidates the output operate In this a hardware n most practical cases, the depth of the F I F O can be tuple wheneverindicated in false. 1, we use the generic ○a:(b ,b ) op- A s field a contains Table ntation for ∪ needs to ensure proper synchroniza- arithmetic computations, value compar- q ept very low.. T his not only implies a small resource erator to represent 1 2 eotprint, but also means that the impact onports usingisons or Boolean connectives in relational plans. Sub-plan do ∗ by buffering all input the over- a so FIFOs: current window. T he in- ircuit, the typically ‘and’ gate invalidates the output l latency is logical small. stance every execution and may, e q henever field a contains false. FIFO −n operatorsinvalidates block data items for a fixed num- can the union z 4-way output 1 ○tically sound manner. = a:(b ,b ) (q) , 2 ‘and’ gate FIFOs ntains false.Characteristics e.g., to properly syn- Circuit er of cycles n. T his can be used, e.g., will emit all fields in q, Our compiler field a that ex tended by a new implements hronize the output of slow arithmetic operators with contains the outcome of b1 = b2a template circuit (full bove circuittuple flow (the circuit below implements clock teristics he remaining will compute its output in a single into . d will its output in to consume a of input tuple at T his semantics directlyure 5). We an implementation add ○a:(b ,b )be ready a that the latency new is 2): (q); assume single clock translates into introduce an teconsume clock.inside the union component logic (wefor- a similar circuit a moment ago): ompute 1 machine input tuple that its latency and issue then saw 2 ck of the a new We say at in to uples1. In its the input FIFOs in more than one fashion We say from latency circuits may need a round-robin stream”) next to the data v both that general, and issue a l, circuits may need theirthan one a Delay operators (q) ⇒ a window closes to notify t more computation can be picked up ○ = a:(b ,b ) = (Equals) heir them as can beunion result. ts the result of −2 picked −2 computation the til 1 2 . up perator output—they larger a latency that is larger the union fixed number of cycles n bof ∗that window. T z theyeverylatency that is have gh z . have a individual input may feed z can block data items for a into −n elements 2 b1 Due circuits semantics, ∗ b1 b2 mantics,to theirthat implement circuits that implement sub-plan to start generating q g or at an arbitrary produce output before they ent windowing cannot they rate (i.e., issue rate 1), the annot produce output before tuple q of the respective query streams together must Most simple arithmetic or Boolean operations userun within rate of all input window. not exceed A common will case are s en the last to be theof the respective query window. single clock cycle. More complex tasks, such as multipli- tuple number of a efine latency ples belong to several wind
  • 11. Synchronization & & & & of Though[9] ain orindividual input may feed may . require (Proj) cation / division, a straightforward fashion. To implement π 1 ,...,a floating-point arithmetics, into the union state) every n (q) ⇒ ··· earlier how our assembly-style selection WPrice rice WPrice opera- coun t , e.g., we use Sometimes, counter component and wire additional latency. a standard the actual acircuit that im- a1 n ∗ WPriceneed WPriceconsidered dur- component at an arbitrary tuple rate (i.e., issue rate 1), the operties sometimes also circuit. Compilation to be plements can be tuned within the trade-offs latency, is- Query Units be cast into a hardware q wsum eosFor instance, all compilation rules average rate of all the data valid signal of the input stream. wsum eos wsum eos wsum compilation. translation in the notation we also its trigger input to input streams together must not exceed ormalizes this Once weand chip spaceper cycle, whichthe latency of(a) emit sue rate, reach consumption. If is eos ure that a of this& remainder generated circuit will never try toto work. We & the ⇒ & use more than one the endoperatorsπhave to is the maximum tu- tuple of the current stream, we symbol push the counter implementation for a1 ,...,an be introduced to greaterTthan one, delay his value to the upstream data pathan interesting side has and & s in successive cycles and, as before, assume synchronize the operator effect. to component the forward (b) reset ple rate thatur compiler emits the description of a up-stream the union output with can remaininghardware cir- e “compiles to” relationinto an operator that has the counterO zero to prepare for the next input stream. fields rate 0 labeled q 1is the We 1usethat results from angleless than one. circuit two types CSR1 of logic the shownpath. In terms of 4.1.2). (as data that is passed into latency, the state machine inside before in Section a 1 Incuit translation rulesinglesynthesizer to generate the actual the for coun t a (q), process. The FIFOs nts logical ‘and’ gate synchronization between sub- cy- operator requires a he to implement the & completes within a single the q: hardware configuration for cycleFtoG A . T he synthesizer op- the P T herefore, the∗latency and the issue rate of the circuit the input, ith the rules we have either flip-flop can now or at timizes W implementedwires”, seen so far, implementing pro- Example. out “dangling using effectively we registers rated for σa are both 1. eos a block RAMpushdown fortrade-off), a hardware circuit. In cy- translate our (a resource query into add another latency jection first example free. we the & σa ere,σa& act ⇒ short-term buffers .for a1 ,...,a(Sel) discard igure Tthereillustrated the circuit to has at minimum(though fields eues(q)use as pro jection operator π streams with cle.coun a (q) ⇒ 4, overall circuit therefore do a runtime (Count) F The we is no actual work that results from applying of , latency n to counter rst srying datatuple flow. Supportdatafieldarenaming (often Depending on theto Q uery Q1set of registers). L atency and from the a rate. T heyaemit for at predictable 2. compilation rules into a new . distribution, however, the ∗ ∗ our are propagated input data ∗ essed usingthe issue rateqof an upstream sub-circuit. observed latency this implementation reflects tuples queue up 1. , typically the π operator) is a straightforward ex ten- T he hardware circuit be higher whenever the shape of a issue rate of may quite literally for pro jection are both q eof what we present the averagea:(Symbol,"UBSN") that, at runtime, here. ○ input rate must not in an input FIFO. E ach of the operators can individually = = the algebraic plan. scarding is field"UBSN" leavestuple output rate. means to Strictly speaking,signal to the data validOperations t the Symbol achievable the all tuples essentially resulting circuit ∗ ed what a from with the flow simply we forwardArithmetics and Booleanand issue sufficient 4.3 operate in a the eoscyclebinaryhave latency single a (i.e., union component is rates output register wire instance, could tuples by set ting their data invalidates discarded be ex- to one). s Since all plan operators are applied sequentially, of most the respective outputdepth of the F I F O canfurther implement the algebraic ∪ operator: generic ○a:(b1 ,b2 ) op- practical cases, the ports with any inputs be to implement (a) and Table the same the for NumUBSTrades A indicated in feed 1, we use signal into the reset ntothe data Tpath, as shown in R ule nature time the pressed using the very similar in Proj: to his is false. Trades query plan latencies add counter the circuit in F igure 4 has an overall t up and to implement (b). Note vectors”here following column shown that Monet B / X implies a small resource input of the represent arithmeticissue rate of a that coun a very low. T his notDonly 100 [13] uses to avoid x|600,60 erator three. B y contrast, the computations, value compar- to latency of a new output field without reading any particu- pipelined print, but also On the that the impact on the over- constructsplanBoolean connectives slowest sub-plan. SinceT he in- ng. the right. on means hard- Trades isons or is determined by itsunion execution 2-way in relational plans. Hardware (q) ⇒ small. · · ·for Query Q4 . (Proj) input value. The operator emits no other field but the atencyside, execution plan lar stance ware is typically πa1 ,...,an the semantics of . a1 an ∗ minPrice q ∪q ⇒ maxaggregate1 (we2 handle grouping separately,. see next). For (Union) q1 q2 is straightforward to Price ∗ ∗ implement. can block di- items for a fixed num- the algebraic aggregates we=consider, coun t , sum, avg, m i n, perators z −nWe simply data q σa ○a:(b1 ,b2 q(q) , q1 )2 es cycles n. Topen window. e.g., to properly syn- and max, the latency is one cycle. A tuple can be applied at of thethe signals frombe used, oldest his can ○a:(Symbol,"UBSN") = his implementation for all 1in- n has an interesting side e.g., will emit all fields in q, ex tended by a new field a that rect πa ,...,a nize fields to a common implements operators with As we willeveryin the cycle (the issue rate is 1). put the output of slow arithmetic he hardware circuit that out- x the input see clock following, however, the availability of t. O ur compiler emits the description theasliding- cir- contains the outcome of b1 = b2 . of hardware remaining tuple Figure 6. circuitthe windowing put register set. flow (the With below implements a general, n-way union implementation eases the implemen- Q4 is passed into tuple gen- A Holistic Aggregate directly translates intoaggregate func- T his semantics Functions. For some an implementation that is shown in a synthesizer to generate the actual erated thisassume thatmeaningful if both is 2): tuples were of other functionality. (q); way onlyfor],the F P G Afour windows (b1ADVANCE 1 TUPLES is the latency of input tation logic (we saw a similar circuit a moment ago): 4 ,b2 ware )configuration a logical ‘and’ gate to combine them: in the state required is not within constant bounds. They at most . T he synthesizer op- tions, valid. Hence, we use together at any point in effectively implementing pro- zes out “dangling wires”, time. Hence, we in- on pushdown wsum sub-plan. The window type copies of the for free. 4.5 to batch (parts of) their input until the aggregate can need Windowing be computed when the end of the stream is seen. The pro- a is q ⇒ is tuple-based. −2 a hereq1 no2 actual The &to do at runtime (though fields The concept,b2for such operators are the computation of work ‘counter’ component .on (Concat) totype example )windowing bridges the gap between stream ○a:(b1 of (q) ⇒ = = . (Equals) −2 z z propagated tuples and sends the adv signal as s incoming into a new set of registers). L atency and ∗ . ∗ processing most relational-style Our weighted∗ medians or and frequent items. semantics. The operator b1 b2 sum operation fied by the query’s ADVANCE2clause (in this par-both 1.1 rate of this implementation for pro q b1 q 1 b ∗ jection are 2 q wsum x|k,l q2 consumes but needs toof qits left-hand the last behaves similarly, the output remember only sub-plan DVANCE = 1 and we could simplify our circuit (q1 ) and slices it into ause of flip-flops is For each window, four input tuples. The set of windows. a good choice to Arithmetics and adv line). q Boolean Operations uting data valid togate easily finishes within a 2 Again, the ‘and’ the single cycle. Most simplequantities of data. Here weof the will run within hold such small arithmetic or Boolean operations right-hand x|k,l invokes a parameterized execution can use them in Hence, latency and issue rate are both 1. ○a:(b ,b ) op- shiftsingle(x), mode, each occurrence of x replacedalways indicated in Table 1, we use the generic sub-plan q2 clock cycle. More complex tasks, buffer asby the a a register with such that the operator such multipli- 1 2 lection and Projection or to represent arithmetic computations, value compar- cation / division, or floating-point arithmetics, may require essing in the windowing part of the plan is im- contains the last four input values. 5. AUXILIARY COMPONENTS or Boolean connectives in relational plans. T he in-
  • 12. Examples πPrice,Volume Price Volume ∗ q1 x|k,l q2 (x) ⇒ n-way union σa adv & a ∗ πPrice,Volume Price Volume ∗ 0 0 q1 x|k,l 0 2 (x) q ⇒ 1 CSR2 a = ○a:(Symbol,"UBSN") = & & & & n-way union Symbol ∗ σa adv ∗ ∗ ∗ ∗ & "UBSN" a ∗ Trades eos 0 q2 eos 0 q2 eos q2 eos 0 1 q2 CSR2 a & & & & = ○a:(Symbol,"UBSN") = Figure 4: Compiled hardware execution plan for & & & & Symbol this ∗ Query Q1 . Latency of "UBSN" circuit is 3, issue rate 1. ∗ ∗ ∗ ∗ 1 1 0 1 CSR1 eos q2 eos q2 eos q2 eos q2 Trades & && ∗ & ll sub-plans have an issue rate of one, this is also the rate Figure 4: Compiled hardware execution plan for q1 f the complete plan. 2 (Wind) Query Q1 . Latency of this circuit is 3, issue rate 1. 1 1 0 1 CSR1 4.4 Union Figure 5: Compilation rule for windowing operator From a data flow pointissue rate theone, thisan also the rate ∗ all sub-plans have an of view, of task of is algebraic (shown for an instance with at most three windows of operator ∪ plan. union the completeis to accumulate the output of several 2 open in parallel). q1 (Wind) ource streams into a single output stream. Since, in our 4.4 Union ase, all source streams operate truly in parallel, a hardware Figure 5: Compilation rule for windowing operator From a data ∪ needs of view, proper of an algebraic mplementation forflow pointto ensure the tasksynchroniza- (shown for an instance with at most three windows ion. We do so by buffering all input ports using FIFOs:several current window. Sub-plan q2 thus sees a finite input for union operator ∪ is to accumulate the output of open in parallel). every execution and may, e.g., use aggregation in a seman- source streams into a single output stream. Since, in our tically sound manner. 4-way union case, all source streams operate truly in parallel, a hardware FIFOs Our compiler implements this semantics by wrapping q2 implementation for ∪ needs to ensure proper synchroniza-
  • 13. Examples 5-way union aggregation fun Algebraic Ag 0 0 1 0 0 CSR2 braic aggregate & & & & & of state) [9] in WPrice WPrice WPrice WPrice WPrice coun t , e.g., we eos wsum eos wsum eos wsum eos wsum eos wsum its trigger inpu Once we reach & & & & & the counter va the counter to 1 0 1 1 1 CSR1 In the transl counter ∗ adv & σa coun t a (q) a ∗ a = ○a:(Symbol,"UBSN") = Symbol "UBSN" ∗ we forward the to implement Trades input of the co constructs a ne Figure 6: Hardware execution plan for Query Q4 . lar input value aggregate (we
  • 14. Examples q1 grpx|c q2 ( x ) ⇒ for instan pressed u n -way union shown her eos on the rig ∗ ∗ ∗ ∗ ware side eos q2 q2 q2 q2 q1 q2 is addr implemen DEMUX rect the s data rst put fields CAM put regist data c ∗ erated thi q1 (GrpBy) valid. Hen Figure 7: Compilation rule to implement the ‘group q1 q2 by’ operator grp. causes a noticeable effect on the overall chip space consump- tion. Again,
  • 15. if data volumes become high. In this case, the data is writ- If no componen ten into (ex ternal) memory, where logic on the F P G A picks bound (such as a Streaming De-Multiplexing it up autonomously after it has received a work request from the host C P U . plan optimizer ca and run (part of ) idea to the plan f 5.3 Stream De-Multiplexing our actual implem A ctual implementations may depend on specialized func- tionality that would be inefficient to express using standard algebra components. In our use case, algorithmic trading, & input data is received as a multiplexed stream, encoded in a = compressed variant of the F I X protocol [4]. E xpressed us- Symbol "UB ing the Stream B ase syntax, the multiplex stream contains actual streams like Trade CREATE INPUT STREAM NewOrderStream ( A s shown in th MsgType byte, -- 68: new order plan now runs bo ClOrdId int, -- unique order identifier and finishes with OrdType char, -- 1:market, 2:limit, 3:stop T he most appa Side char, -- 1:buy, 2:sell, 3:buy minus tion of latency. T TransactTime long) -- UTC Timestamp of one. In additio CREATE INPUT STREAM OrderCancelRequestStream ( sources, primarily MsgType byte, -- 70: order cancel request before. ClOrdId int, -- unique order identifier OrigClOrdId int, -- previous order 6.2 Increasin Side char, -- 1:buy, 2:sell, 3:buy minus T he elimination TransactTime long) -- UTC Timestamp cally leads to task hardware circuit We have implemented a stream de-multiplexer that in- terprets the MsgType field (first field in every stream) and V dispatches the tuple to the proper plan part. &
  • 16. Optimization Heuristics Reducing Clock Synchronization Increasing Parallelism Trading Unions for Multiplexers Group By/Windowing Unnesting
  • 17. Evaluation FPGA software (Linux 2.6) high pack packets processed 100 % tuples. 100 % 100 % 80 % Our res 60 % the expec 60 % attractive 40 % 36 % is used as the remain 20 % load. Thi 0% system ca 300,000 pkt/s 1,000,000 pkt/s data input rate 8. REL Figure 8: Number of packages successfully processed The ide for two given input loads. The hardware implemen- cessing da tation is able to sustain both package rates. explored w [3]. His D operated of 1 means that the circuit can process 100 million input database s While e
  • 18. Related Work Database Processing Systems as Hardware implementation “database machine” [3] tailor-made hardware for database processing Netezza Performance Server system [2] SPUs = Disk, NIC, a CPU ,and an FPGA SQL Chip [14] In Avalanche, we aim at exploiting the configura- bility of FPGAs. With Glacier, we present a compiler that compiles queries from an arbitrary workload into a tailor- made hardware circuit. Processing Model MonetDB/X100 system[13] (is resembled) Specialized processors for use in a database context GPUTeraSort[8] stream joins for the Cell processor[6]