RFID Data Management
Kamlesh Laddhad (05329014)
Guide: Prof. Bernard Menezes
• Introduction to RFID Technology.
• Issues with RFID Technology.
• RFID Data Characteristics.
• Data Warehousing.
– Expressive Temporal Model: Dynamic Relationship ER Model
– RFID - Cuboids.
– Use of Bitmap Datatype.
• Data Cleaning.
– Extensible Sensor stream Processing (ESP)
– Statistical sMoothing for Unreliable RFid data.(SMURF)
• Future Plans.
• Radio Frequency Identification:
– It is an Automatic Identification and Data Capture Technology.
– No contact or line of sight.
– Uses radio-frequency waves to transfer data
– Tag: small, low-cost device that can hold a limited amount of data.
• Associated with objects, such as pallets, cases, and even individual items.
– Reader: Recognize presence of tag and read info stored on it.
• Unique electronic product code (EPC) associated with a tag.
• By placing RFID tag readers at various locations, one can track the
movement of objects through supply chain networks.
Applications and Adoptions
• Supply Chain Management: real-time inventory
– US Department Of Defense: shipments to armed forces
• Retail: Active shelves monitor product availability
– Wal-Mart, Albertson: Major Retails stores
• Access control: toll collection, transportation.
– Airline luggage management:
• British airways:20 million bags a year
• Implemented to reduce lost/misplaced luggage
• Anti-counterfeiting and security:
– Food and Drug Administration: To reduce counterfeit in
pharmaceutical supply chain
Prospective for RFID research
• The physics of building tags and readers
– Tags have few gates: Apart from basic operation, very less computing power.
– Radio-frequency has some issues with operating in certain physical mediums.
• The privacy and safety issues:
– Complex encryption schemes are not possible on RFID tags.
– Counterfeiting by means of either illegitimate readers or spoofed tags are
– Reader-tag communication is wireless: Third parties can eavesdrop on signals.
• Software Architecture to collect, filter, organize, and answer online
– No. of tags are proportional to No of items being serviced/tracked.
– No. of readers are proportional to traceable strategic locations/areas
• Each Reader picks up tag signals on continuous basis.
• Data generated by RFID systems is enormous:
• E.g. Wal-Mart is expected to generate 7 terabytes of RFID data per day.
• Our Focus: Third Stream.
Data Management Challenges
• Data Explosion : Example
– A retailer with 3,000 stores, selling 10,000 items a
day per store.
– Each item moves 10 times on average before being
• Movement recorded as (EPC, location, second)
– Data volume: 300 million tuples per day.
– Example OLAP Query: “Average time for items to
move from warehouse to checkout counter in March
• Costly to answer if there are a billion tuples for March
• Temporal and history oriented
– Applications dynamically generate observations (readings).
– Objects location and containment relationship among objects changes
– Need: Expressive data model.
• Inaccurate data and implicit semantics
– False positive: Non-existing tag incorrectly read.
– False Negative: Reader missed a tag which was in its vicinity.
– Noisy data & duplicate readings (redundancy): Same tag read more than
– Need: Automated data filtering and transformation.
• Streaming and large volume
– Object stay in place for longer duration: Readers records them
periodically. Large data keeps generating.
– We need to preserve this data for tracking and monitoring.
– Need: Scalable storage scheme, compression techniques to reduce data.
• Data Granularity
– Data collection granularity needs to be decided
– Differs across applications.
• Lossless compression
– Remove redundancy: (r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10) => (r1,l1,t1,t10)
– Group objects that move and stay together.
• Data cleaning: Multi-reading, missed-reading, error-reading, bulky movement.
• Data mining: Find trends, outliers, frequent, sequential, flow patterns.
• Multi-dimensional summary: product, location, time, …
– Store manager: Check item movements from the backroom to different shelves
in his store
– Region manager: Collapse intra-store movements and look at distribution
centers, warehouses, and stores
• Query Processing
– Support for OLAP: roll-up, drill-down, slice, and dice
– Path query: New to RFID-Warehouses, about the structure of paths
• What products that go through quality control have shorter paths?
• What locations are common to the paths of a set of defective auto-parts?
• Identify containers at a port that have deviated from their historic paths
Dynamic Relationship ER Model
• Proposed by Wang and Liu from Siemens.
• RFID entities are static and are not altered.
• RFID relationships: dynamic and change all the
• Two types of dynamic relationships added:
– Event-based dynamic relationship. A timestamp
attribute added to represent the occurring timestamp
of the event.
– State-based dynamic relationship. tstart and tend
attributes added to represent the lifespan of a state.
• Missing RFID Object Detection:
– Find when and where object holding EPC= `MEPC’
• select location_id, tstart, tend from objectlocaiton
where epc='MEPC' and tstart = ( select max(o.tstart) from
objectlocation o where o.epc='MEPC' )
– Check if there are missing objects at current location C,
knowing that all objects were complete at previous
location L at time T.
• select l.epc from objectlocation l where l.location_id =
'L' and l.tstart <= 'T' and l.tend >= 'T' and l.epc not
in ( select c.epc from objectlocation c where
c.location_id = 'C' )
• RFID Object Moving Time Inquiry:
– Time it takes to supply ‘OEPC’ from location S to
• select (e.tstart-s.tstart) as supplying_time from
objectlocation e, objectlocation s where e.epc =
'OEPC' and s.epc='OEPC' and s.location_id ='S' and
• Bulky object movements
– Objects often move and stay together through the supply chain.
– If 1000 packs of product P stay together at the distribution center,
register a single record.
– (GID, distribution center, time_in, time_out).
– GID is a generalized identifier that represents the 1000 packs that stayed
together at the distribution center
• Analysis usually takes place at a much higher level of abstraction
than the one present in raw RFID data
Dist. Center 1
• Fact Table: (EPC, location, time_in, time_out).
• In supply chain: Items travel through a series of locations.
• Query: what is the average time that product P stays at store in
• Traditional cubes miss the path structure of the data
• Stay Table: (GIDs, location, time_in, time_out: measures):
– Records information on items that stay together at a given location
– If using record transitions: difficult to answer queries, lots of
• Map Table: (GID, <GID1,..,GIDn>)
– Links together stages that belong to the same path. Provides additional:
compression and query processing efficiency
– High level GID points to lower level GIDs
– If saving complete EPC Lists: high costs of IO to retrieve long lists,
costly query processing
• Information Table: (EPC list, attribute 1,...,attribute n)
– Records path-independent attributes of the items, e.g., color,
• Electronic product code
– Standard naming scheme, proposed by Auto-Id Center.
– An EPC uniquely identifies an item.
– Format: <Header, Manager_No., Object Class, Serial No.>
• Header: Identifies the length, type, structure, version and generation
• Manager Number: Identifies an organizational entity.
• Object Class: Identifies a “class”, or type of thing.
• Serial Number: Specific instance of the Object Class being tagged.
– We will refer to
• <Header, Manager No, Object Class>: Prefix
• <Serial No.>: Suffix
Use of Bitmap Datatype
• Observation: Items move together.
– Groups of items in the same proximity - e.g. on a shelf, on a
– Groups of items with same property - e.g. Same product
• Use a bitmap type for modeling a collection of EPCs
that can occur in item tracking applications.
– Instead of storing a tuple per item store a tuple for all the
items having same prefix.
– New extra fields instead of epc:
• <Len, Suffix_length, Prefix, suffix_start, Suffix_end, bitmap>
Use of Bitmap Datatype
Header EPC_Manager Object_Class Serial_Number
2-bits 21-bits 17-bits 24-bits
Len Suff_len Prefix Suff_start Suff_end bitmap
64 24 0x4AA890001F 0x62C160 0xA0B38E 101001…00010
• To use this with such datatype in SQL, we need
operations on such bitmaps.
• Conversion and couting Operations: epc2Bmap,
bmap2Epc and bmap2Count
• Pairwise Logical Operations: bmapAnd, bmapOr,
bmapMinus, and bmapXor
• Maintenance Operations: bmapInsert and bmapDelete
• Membership Testing Operation: bmapExists
• Comparison Operation: bmapEqual
Use of these operations in SQL
• Items added to a given shelf between time t1 and t2.
– SELECT bmap2Epc(bmapMinus(s2.item_bmap,
s1.item_bmap)) FROM Shelf_Inventory s1, Shelf_Inventory
s2 WHERE s1.shelf_id = <sid1> AND s1.shelf_id =
s2.shelf_id AND s1.time = <t1> AND s2.time = <t2>;
• Book store categorizes books in various categories.
– Following query determines the shelves where the books with
property ’Adventure’ and ’Romance’, are currently present in
– SELECT s.shelf_id FROM Shelf_Inventory s WHERE
bmap2Count(bmapAnd( s.item_bmap, SELECT
bmapAnd(p.Adventure, p.Romance) FROM
Propery_Inventory p) ) > 0; AND s.time=<current_date>;
• Extension to bitmap proposal:
– Bitmap datatype is more appropriate for initial bulk-load & batch updates.
– It performs badly for incremental updates.
– A ‘hybrid Scheme’ for incremental Updates:
• Maintain inventories periodic checkpoints using bitmaps.
• For changes occurring between checkpoints, Maintain a traditional item-level
• Answer queries by merging the latest checkpoint bitmap with the
corresponding duration’s item-level data.
• The epc_suffix in the collection may not be contiguous
– The bitmap will be sparse- Lot of zeros.
– Compress this using some encoding scheme
• Good for initial bulk loading and batch updates
• May reduce efficiency of bitmap operations.
• Efficient methods data mining problems
– Trend analysis
– Outlier detection
– Path clustering
• We will try exploring data mining applications to
Issues in Data Cleaning
• Lack of Completeness
– RFID readers capture only 60-70% of all tags that are in the
– Smoothing of data is done to rectify the loss of intermediate
• Temporal Nature of data or tag dynamics
– RFID tags are in motion and that is what makes them more
difficult to handle
– But motion of a tag causes dropping of messages
• RFID data streams are very fast and are huge in
– Hence filtering is important before sending them to database
• Temporal Granule:
– Based on the fact that tag data do not differ much
over a small time period
– Data can be clubbed on a small time frame
• Spatial Granule:
– Similarly, data from physically close readers are also
Stages of ESP
• Point: operates over a single value in a sensor
stream, filtered by a predicate in the WHERE
• Smooth: granularity defined by applications to
correct for missed readings temporally (over one
input only); uses aggregate function over the
• Merge: granularity specified by the application
to correct for missed readings spatially; grouped
by the specified spatial granule.
Stages of ESP (contd.)
• Arbitrate: deals with
conflicts between different
spatial granules; grouped by
spatial granule first and then
uses HAVING construct to
determine those conflicts
• Virtualize: used for
combining data streams from
different sources, could also
be different devices; join
construct is used to combine
the different data streams
and then filtered using some
• False Positives: (erroneous readings) reporting objects
that are not actually present
• False Negatives: (missed readings) not reporting objects
that actually are present
False positives and False Negatives [Jeff06]
• The reader has an internal table called the Tag List.
• An epoch is the smallest unit of interaction between the reader
and the middleware.
• Every epoch consists of certain number of Interrogation cycles
• Interrogation Cycle is one run of the reader protocol to
determine all tags
• At every epoch the reader sends the tag list to the middleware.
Tag ID Responses Timestamp
12341234 6 t1
12347890 1 t2
SMURF – Per tag Cleaning
• SMURF uses statistical methods to reduce the false
negative and false positives happening in the RFID
• The goal here is two fold: one is to determine the
statistical window size, and secondly, ensuring that the
transition of the tags is determined.
• To determine the window size we need to fit a
probability distribution to the sample size
• And to determine the transition of the tag out of the
reader's vicinity, we define a 98% confidence interval
within that probability distribution function on the
sample size |Si|.
SMURF – Per tag Cleaning (contd.)
• Using the tag list, per-epoch sampling
probability, pi,t is determined,
pi,t = number of times tag was read in a epoch /
interrogation cycles per epoch
• We average this over the sample size |Si| to get
the average read rate (pi
avg) for a tag i.
• If same probability of pi is assumed for each
epoch throughout the window then each
successful observation is like a Bernoulli trail.
SMURF – Per tag Cleaning (contd.)
• So, |Si| is the binomial random variable for a sample Si
with mean = wi. pi
avg and variance = wi. pi
• Now using this we can express the window size as a limit,
• If the current window size is less than the calculated one
then the window size is adjusted accordingly.
• Similarly using the Central limit theorem for transition
detection we get ||Si| - μ| > 2 σ
Normal Sliding window….
• Epoch based mid-point sliding window
• Emits a reading with an epoch value corresponding to
the middle of the window
• In the first window, pi
avg demands a larger window
• Thus window size is increased
• In the first window the number of readings decreases
significantly (and statistically)
• Thus a transition is likely to have occurred; so window
SMURF – Multi-tag aggregate
• Similar to per-tag cleaning, the window for multi-tag cleaning is
Here, pavg is the average per-epoch sampling probability over all
• To detect the transition in population count, we estimate the
population count of two windows [t – wi, t] and [t – wi/2, t]; with
true populations: Nw & Nw’
• Thus, for a transition to have happened, we need the difference
between the two estimates to be within the limit:
2(σw + σw’)
SMURF – Multi-tag aggregate
• To calculate the estimate of population count, we use
π-estimators; The estimated population count is given
• Similarly by π-estimators, and assuming independence
across different tags, the variance of the estimate is
• Here πi is probability of reading the tag i at least once
during the whole window, given by 1 – (1 – pi
The Road ahead…
• Applications in RFID do not accept any delays in the
• Data is either present in the cache or the database; data
in the database increases processing time and data in
cache does not understand SQL like queries
• Anomaly detection in object tracking is also an
important part of object tracking
• Issues like untraceability, forward security, and database
desynchronization are still not completely resolved.
• One more serious problem with RFID is counterfeiting
• In the next stage we expect to look into some of these
• Xiaolei Li, Hector Gonzalez, Jiawei Han and
Diego Klabjan. Warehousing and analyzing
massive RFID data sets. ICDE, 2006.
• Fusheng Wang and Peiya Liu. Temporal
management of RFID data. VLDB, 2005.
• Timothy Chorma, Ying Hu, Seema Sundara and
Jagannathan Srinivasan. Supporting RFID-based
item tracking applications in oracle DBMS using
a bitmap datatype. VLDB, 2005.
• Minos Garofalakis, Shawn R. Jeffery and Michael J.
Franklin. Adaptive cleaning for RFID data streams.
• J. Franklin, Wei Hong, Shawn R. Jeffery, Gustavo
Alonso and Jennifer Widom. Declarative support for
sensor data cleaning. In Pervasive, 2006.
• Sridhar Ramachandran Sudarshan S. Chawathe, Venkat
Krishnamurthy and Sanjay E. Sarma. Managing RFID
data. VLDB, 2004.