9. 9
Trivial example
Regular expression : ACCGTGGA
Opcode Reference
& ACCG
& TGGA
NOP -
Input string : ACACCGTGGA
Instructions Clock Cycles Data
#1 #2 #3 #4 #5 #6
& ACCG FD EX ACAC
& ACCG FD EX CACC
& ACCG FD EX ACCG
& TGGA FD EX TGGA
NOP FD EX -
10. 10
Example: Kleene operators
Opcode Refernce
( -
&)+ TTTT
& CT
NOP -
Regular expression : (TTTT)+CT
Input string : TCTTTTCT
Instructions Clock Cycles Data
Opcode / Ref #1 #2 #3 #4 #5 #6 #7 #8
( FD EX -
&)+ TTTT FD EX TCTT
&)+ TTTT FD EX CTTT
&)+ TTTT FD EX TTTT
&)+ TTTT FD EX CT--
& CT FD EX CT--
NOP FD EX -
11. Regular Expression
Time Flex
[µs]
Time TiReX
[µs]
File
dimension
Speedup
factor
ACCGTGGA 271 us 39,9 us 16 kB x6
(TTTT)+CT 121 us 81,35 us 16 kB x1.5
(CAGT)|(GGGG)|(TTGG)TGCA(C|G)+ 263 us 173,835 us 16 kB x1.5
11
Flex vs TiReX
18. A further step: Multicore 18
• Dark silicon problem why fpga
Editor's Notes
Hi to everyone i’m davide a master student in cs and engineering at engineering at politecnico di milano and now i will present to u tirex tiled regular expressions matching architecture
Our focus is on regular expression that have several applicative domains that ranges from signature based detection for antivirus and network intrusion detection systems to genomic data analysis for personalized medicine and diagnostics
I’d like to introduce 2 applicative scenarios…. That has the common task of finding
Pattern matching is a compute intesive task and furthermore has high speed requirements and needs to manage huge amount of data. For example billions of characters compose the human dna.
Disadvantage wrt sw solutions
Making a further step we can see RE as set of instructions over a stream of data. For example we can see & and | as a plus or minus. But since a processor has a fixed instruction set
But we don’t want a fixed ISA, and we design the core to have a reconfigurable ISA. Thus whenever a user wants to match a new RE. He write the RE, pass it to the compiler that translates into the machine code that will drive the computation of the processor.
Our architecture has a 2 pipeline stage, composed by fd and ex stage, and the control path that synchronize the computation and keeps track of re status
Firstly we fetch an instruction from the instruction memory then the decode unit produces three signals: an opcode of the instruction, the reference, that are the characters that has to be matched, and the valid reference that is the number of characters present in the instruction
Afterwards we have the execute phase. The data are fetched from the data buffer and passed to the clusters. Those cluster are a set of comparators that produce a result signal. Each cluster takes as input a chunk of data, each one shifted by a position esempio??? Then each intermediate result is procesed by the engine that depending on the opcode and the valid refernce produce and global result.
Lastly we have the control unit, composed by a fsm that synchronize the pipeline and produce the control signals. We have also a status register and stack in order manage with the context switch of an open parenthesis.
Esempio degli operatori di kleene comparazione tra flex e il nostro core. mancaaa
Esempio degli operatori di kleene comparazione tra flex e il nostro core. mancaaa
Even if we have just a simple prototype we compare our core to flex. As we can see from the table tirex outperform flex in all these three kind of examples.
The other result i want to show u is the area utilization. We implement tirex on vc707 board powered by virtex 7. The table evidence about we are underutilizing the fpga resources.
Considering this factor and the huge amount of data we have to deal with. We are going from a single core architecture to a multicore architecture able to manage this brontosaurus data and reach high performance
This kind of architecture can interoperate with 2 differnt modes. The simd where we have the same RE for each core and we divide the stream of data to achieve a parallel computation.
The other mode is MISD, with security application field we have the problem of having a lot of RE to be matched over the same stream of data. Thus let the user to decide which kind of modus operandi use dependending also on the application scenario.
In conclusion i’ve presented to u a single core pattern matching architecture that at the current implementation outperform flex. As future works we are working on the multicore architecture to push on performance side and on the amount of dat we can process
Lastly we have the control unit, composed by a fsm that synchronize the pipeline and produce the control signals. We have also a status register and stack in order manage with the context switch of an open parenthesis.