On the Impact of Data Collection on the Quality of Signal Strength Signatures in
                           Wi-Fi Indoor Localization
                     John Nicholson and Vladimir Kulyukin
               Computer Science Assistive Technology Laboratory
                        Department of Computer Science
                              Utah State University
                             Logan, UT 83422-4205
ABSTRACT

Wi-Fi signals can be used to localize navigators at topological landmarks in indoor and
outdoor environments. A major issue with Wi-Fi topological localization is calibration.
This paper describes the impact of data collection on the quality of signal strength
signatures.

KEYWORDS

Visual impairment; blindness; assisted navigation; indoor localization; Wi-Fi; 802.11.

BACKGROUND

Using Wi-Fi 802.11 signals for localization is growing in popularity [1,2,3] due to their
wide deployment and affordability. Some projects, such as PlaceLab [1], are using Wi-Fi
to replace or supplement GPS in outdoor environments. The objective is to bring the
calibration time to a minimum by pairing single GPS readings to available Wi-Fi signal
strengths at known locations. However, the reported localization accuracy fluctuates
between 13 and 40 meters, which may be too inaccurate for the indoor wayfinding needs
of the visually impaired.

Another method for Wi-Fi indoor localization is to create a topological map of the
environment and develop signal signatures of selected landmarks through data collection
and pre-processing. The quality of signal signatures is critical, because Wi-Fi signals vary
over time and are susceptible to interference that comes from other wireless devices in
the same frequency range, solid objects, human bodies, and multi-path issues [2].

HYPOTHESIS

It is hypothesized by the investigators that the quality of signal signatures is dependent on
the time of day when data collection occurs.

METHOD

--------------------------------
Insert Figures 1, and 2 here
--------------------------------
Data collection was done with a wearable multi-sensor wayfinding test bed called the
Wayfinder (see Figure 1). The device uses a wireless card that allows the signal strength
to be collected from five wireless access routers placed at different locations in the USU
Computer Science Department (see Figure 2). Data were collected two ways: statically
and dynamically.
---------------------------
Insert Figure 3 here
---------------------------
Static data were collected over a period of a month and a half. Data collection for
locations 1 through 5 was completed first and then the remaining locations' data were
collected. Data were collected once per day for each location in a group (1-5 and 6-12).
Locations were collected one immediately after another. Locations had at least two
collection positions. A collection position (see Figure 3) is where the data collector stood
while gathering signal strength information. As in other systems [2, 3], the user’s
orientation is taken into consideration, so at each collection position, data were collected
with the collector facing each direction of the hall for 2 minutes at a time. For example,
if a collection position was in a hall which ran north/south, then data were collected for 2
minutes facing north and then 2 minutes facing south. Directionality was taken into
consideration because of the effect of the human body on the signal strength. Data
collection was performed on 10 different days so 20 minutes of data were collected at
each collection position for each direction of the position's hall, in other words a total of
40 minutes data for a collection position.

Dynamic data were collected by walking a series of four routes around the CS
Department. Each route was walked 15 times in both directions for a total of eight
routes. Figure 2 shows the path for one of the routes. In order to record when the user
was at the locations, masking tape was placed on the floor 0.5 meters before and after
each collection position. During a walk, the user pressed a key on the system to record
whenever they passed over a piece of tape. All walks for all routes were completed on
the same day in one data collection session. Note that although the static data was
collected over multiple days, all dynamic data was collected on a single day.

Naive Bayes and C4.5 were used to pre-process the collected data. Thus, signal
signatures were numerical classes created by these two algorithms. To measure the
impact of data collection on the quality of signal signatures, three types of validation
were attempted at run time: 1) static on static, 2) static on dynamic, and 3) dynamic on
dynamic. For the static on static validation, the signatures were created from each day of
static data and validated with the other static datasets. For the static on dynamic
validation, the signatures were created from each day of static data and validated with the
dynamic data. Finally, for the dynamic on dynamic test, the signatures were created from
the dynamic data for each route and were validated with the dynamic data from all the
routes. If the training dataset and the validation datasets were the same, e.g. the same day
of static data, then the dataset was split so that 66% of the data were used for training and
33% of the data for validation. Otherwise, 100% percent of the training dataset and 100%
of the validation dataset were used.

RESULTS
---------------------------
Insert Tables 1, 2, 3, and 4 here
---------------------------
The static-on-static and the dynamic-on-dynamic tests had the same training and
validation dataset. In these cases, both classifiers, Bayes and C4.5, tend to give accurate
results: 94% accuracy or higher. However, this performance does not carry over to
training and validation on different data sets. In other words, when signal signatures are
created from a data set collected on a day different from the day of the validation data set,
localization accuracy varies. Although the dynamic data were only collected on one day,
it is reasonable to conjecture that a walk taken over the same route on a different day will
display the same problems as static data on different days, but this still needs to be
verified. The quality of signal signatures appears to be dependent on the time of day
when data collection occurs. It remains to be investigated how the quality of signal
signatures depends on the amount of collected data.

REFERENCES

1. Cheng, Y., Chawathe Y., LaMarca A., Krumm J. (2005). Accuracy characterization
for metropolitan-scale Wi-Fi localization. Proceedings of the 3rd international conference
on Mobile systems, applications, and services. Seattle, Washington.
2. Ladd, A., Bekris, K., Rudys A., Wallach, D., and Kavrakia, L. (2004). On the
Feasibility of Using Wireless Ethernet for Indoor Localization. Transactions on Robotics
and Automation, 20, No. 3.
3. Seshadri, V., Zaruba G. V., and Huber M. (2005). A Bayesian Sampling Approach to
In-door Localization of Wireless Devices Using Received Signal Strength Indication.
Third IEEE International Conference on Pervasive Computing and Communications,
2005. Kauai Island, Hawaii.

ACKNOWLEDGMENTS

The study was funded by two Community University Research Initiative (CURI) grants
from the State of Utah (2003-04 and 2004-05) and NSF Grant IIS-0346880. The authors
would like to thank Mr. Sachin Pavithran, a visually impaired training and development
specialist at the USU Center for Persons with Disabilities, for his feedback on the
localization experiments.

Author Contact Information:

Vladimir Kulyukin, Ph.D., Assistive Technology Laboratory, Department of Computer
Science, Utah State University, 4205 Old Main Hill, Logan, UT 84322-4205, Office
Phone (435) 797-8163. EMAIL: vladimir.kulyukin@usu.edu.
---------------------------
Figure 1: Wayfinder System
---------------------------




Alternative Text Description for Figure 1.
Figure 1 shows a photograph of the current Wayfinder system prototype . The system is
mounted to a vest and does not require the user to carry anything with their hands. It has
a GPS unit on one shoulder and a compass on the other shoulder. The computation unit
sits in front on the user's chest. A numeric keypad sits in front and allows the user to
respond to system prompts. The system uses speech trhough an attached headphone.




---------------------------
Figure 2: Map of the USU Computer Science Department with an example route shown.
Black circles represent access points. Circled numbers represent locations. The route
covers locations 1 to 5.
---------------------------
Alternative Text Description for Figure 2.
Figure 2 shows a map of the Utah State University Computer Science Department. It
shows the locations of five access points in the department. It also shows the locations
which are used for localization purposes. The twelve locations are the intersections of
halls were a person can turn. The figure also shows an example route which take a path
from location 1 to location 5.




------------------------------------------------
Figure 3: Collection positions at a corner location. Black dots represent collections
positions. Collection positions were 1.5 meters from the actual location. Narrow halls
had one collection position, wide halls had two.
------------------------------------------------




Alternative Text Description for Figure 3.
Figure 3 shows an example of collection positions for a corner where a narrow hall and a
wide hall intersect. There is one collection position in the middle of the narrow hall 1.5
meters from the corner. There are two collection positions in the wide hall which divide
the width of the hall into thirds. They are also 1.5 meters from the corner.
---------------------------
Table 1: Static on Static results when the training dataset and the validating datasets are
on the same day. The columns for Bayes and C4.5 are the percent of validation samples
which were correctly classified by that algorithm.
---------------------------
Dataset Day                          Bayes                       C4.5
2005-01-03                           0.982892                    0.999602
2005-01-04                           0.979402                    0.999830
2005-01-05                           0.970853                    0.999678
2005-01-11                           0.988162                    0.999830
2005-01-12                           0.968064                    0.999016
2005-01-13                           0.992520                    0.999792
2005-01-20                           0.954853                    0.998335
2005-01-26                           0.943970                    0.998845
2005-02-01                           0.974570                    0.999659
2005-02-02                           0.957489                    0.997861
---------------------------
Table 2: Static on Static results when the training dataset and the validating datasets
are on different days. The data from the training day was used classify all the other days.
The average columns are the average of the results for each validation day. The max
columns are the highest percent correct achieved for a day. The min columns are the
lowest achieved.
---------------------------
Training Bayes              Bayes Max Bayes Min C4.5               C4.5 Max C4.5 Min
Day            Average                               Average
2005-01-03 0.963299 0.985881 0.938110 0.947295 0.985117 0.900284
2005-01-04 0.936516 0.979617 0.873530 0.928698 0.979009 0.856595
2005-01-05 0.946472 0.986248 0.871559 0.920312 0.957741 0.869446
2005-01-11 0.949799 0.978475 0.930835 0.952160 0.976834 0.925226
2005-01-12 0.940417 0.986158 0.854863 0.929767 0.991057 0.859771
2005-01-13 0.937436 0.973018 0.856570 0.920374 0.972020 0.845039
2005-01-20 0.912735 0.978265 0.827666 0.887077 0.956317 0.753148
2005-01-26 0.936857 0.979829 0.843126 0.943860 0.975973 0.915822
2005-02-01 0.935468 0.989429 0.865968 0.931121 0.976527 0.867669
2005-02-02 0.934006 0.989783 0.846740 0.930395 0.981845 0.836240




---------------------------
Table 3: Static on Dynamic results. The Route column is the route used for validation.
The route "both" means data from both routes 1 and 5 were used. The columns for Bayes
and C4.5 are the percent of validation samples which were correctly classified by that
algorithm.
---------------------------
Training Day               Route   Bayes      C4.5
2005-01-03                 1       0.916771   0.872834
2005-01-03                 5       0.889491   0.867207
2005-01-03                 both    0.903284   0.870052
2005-01-04                 1       0.891945   0.885039
2005-01-04                 5       0.860265   0.866719
2005-01-04                 both    0.876283   0.875982
2005-01-05                 1       0.885456   0.789962
2005-01-05                 5       0.894362   0.878775
2005-01-05                 both    0.889859   0.833870
2005-01-11                 1       0.840686   0.896172
2005-01-11                 5       0.868668   0.865136
2005-01-11                 both    0.854520   0.880828
2005-01-12                 1       0.888492   0.879324
2005-01-12                 5       0.876279   0.860753
2005-01-12                 both    0.882454   0.870142
2005-01-13                 1       0.804310   0.785378
2005-01-13                 5       0.904165   0.880723
2005-01-13                 both    0.853677   0.832516
2005-01-20                 1       0.790439   0.865750
2005-01-20                 5       0.874452   0.856612
2005-01-20                 both    0.831974   0.861232
2005-01-26                 1       0.879086   0.908912
2005-01-26                 5       0.890283   0.877679
2005-01-26                 both    0.884621   0.893471
2005-02-01                 1       0.799131   0.879145
2005-02-01                 5       0.908427   0.879688
2005-02-01                 both    0.853165   0.879414
2005-02-02                 1       0.826874   0.891647
2005-02-02                 5       0.891805   0.892840
2005-02-02                 both    0.858975   0.892237
---------------------------
Table 4: Dynamic on Dynamic results. The route "both" means data from both routes 1
and 5 were used. The columns for Bayes and C4.5 are the percent of validation samples
which were correctly classified by that algorithm.
---------------------------
Training Route             Validation Route   Bayes             C4.5
1                          1                  0.944843          0.996148
1                          5                  0.603385          0.713833
1                          both               0.776406          0.856777
5                          1                  0.554921          0.642496
5                          5                  0.982274          0.999463
5                          both               0.767105          0.818549
both                       1                  0.882062          0.991129
both                       5                  0.899476          0.991963
both                       both               0.890315          0.991856

On the Impact of Data Collection on the Quality of Signal Strength in Wi-Fi Indoor Localization

  • 1.
    On the Impactof Data Collection on the Quality of Signal Strength Signatures in Wi-Fi Indoor Localization John Nicholson and Vladimir Kulyukin Computer Science Assistive Technology Laboratory Department of Computer Science Utah State University Logan, UT 83422-4205 ABSTRACT Wi-Fi signals can be used to localize navigators at topological landmarks in indoor and outdoor environments. A major issue with Wi-Fi topological localization is calibration. This paper describes the impact of data collection on the quality of signal strength signatures. KEYWORDS Visual impairment; blindness; assisted navigation; indoor localization; Wi-Fi; 802.11. BACKGROUND Using Wi-Fi 802.11 signals for localization is growing in popularity [1,2,3] due to their wide deployment and affordability. Some projects, such as PlaceLab [1], are using Wi-Fi to replace or supplement GPS in outdoor environments. The objective is to bring the calibration time to a minimum by pairing single GPS readings to available Wi-Fi signal strengths at known locations. However, the reported localization accuracy fluctuates between 13 and 40 meters, which may be too inaccurate for the indoor wayfinding needs of the visually impaired. Another method for Wi-Fi indoor localization is to create a topological map of the environment and develop signal signatures of selected landmarks through data collection and pre-processing. The quality of signal signatures is critical, because Wi-Fi signals vary over time and are susceptible to interference that comes from other wireless devices in the same frequency range, solid objects, human bodies, and multi-path issues [2]. HYPOTHESIS It is hypothesized by the investigators that the quality of signal signatures is dependent on the time of day when data collection occurs. METHOD -------------------------------- Insert Figures 1, and 2 here -------------------------------- Data collection was done with a wearable multi-sensor wayfinding test bed called the Wayfinder (see Figure 1). The device uses a wireless card that allows the signal strength
  • 2.
    to be collectedfrom five wireless access routers placed at different locations in the USU Computer Science Department (see Figure 2). Data were collected two ways: statically and dynamically. --------------------------- Insert Figure 3 here --------------------------- Static data were collected over a period of a month and a half. Data collection for locations 1 through 5 was completed first and then the remaining locations' data were collected. Data were collected once per day for each location in a group (1-5 and 6-12). Locations were collected one immediately after another. Locations had at least two collection positions. A collection position (see Figure 3) is where the data collector stood while gathering signal strength information. As in other systems [2, 3], the user’s orientation is taken into consideration, so at each collection position, data were collected with the collector facing each direction of the hall for 2 minutes at a time. For example, if a collection position was in a hall which ran north/south, then data were collected for 2 minutes facing north and then 2 minutes facing south. Directionality was taken into consideration because of the effect of the human body on the signal strength. Data collection was performed on 10 different days so 20 minutes of data were collected at each collection position for each direction of the position's hall, in other words a total of 40 minutes data for a collection position. Dynamic data were collected by walking a series of four routes around the CS Department. Each route was walked 15 times in both directions for a total of eight routes. Figure 2 shows the path for one of the routes. In order to record when the user was at the locations, masking tape was placed on the floor 0.5 meters before and after each collection position. During a walk, the user pressed a key on the system to record whenever they passed over a piece of tape. All walks for all routes were completed on the same day in one data collection session. Note that although the static data was collected over multiple days, all dynamic data was collected on a single day. Naive Bayes and C4.5 were used to pre-process the collected data. Thus, signal signatures were numerical classes created by these two algorithms. To measure the impact of data collection on the quality of signal signatures, three types of validation were attempted at run time: 1) static on static, 2) static on dynamic, and 3) dynamic on dynamic. For the static on static validation, the signatures were created from each day of static data and validated with the other static datasets. For the static on dynamic validation, the signatures were created from each day of static data and validated with the dynamic data. Finally, for the dynamic on dynamic test, the signatures were created from the dynamic data for each route and were validated with the dynamic data from all the routes. If the training dataset and the validation datasets were the same, e.g. the same day of static data, then the dataset was split so that 66% of the data were used for training and 33% of the data for validation. Otherwise, 100% percent of the training dataset and 100% of the validation dataset were used. RESULTS
  • 3.
    --------------------------- Insert Tables 1,2, 3, and 4 here --------------------------- The static-on-static and the dynamic-on-dynamic tests had the same training and validation dataset. In these cases, both classifiers, Bayes and C4.5, tend to give accurate results: 94% accuracy or higher. However, this performance does not carry over to training and validation on different data sets. In other words, when signal signatures are created from a data set collected on a day different from the day of the validation data set, localization accuracy varies. Although the dynamic data were only collected on one day, it is reasonable to conjecture that a walk taken over the same route on a different day will display the same problems as static data on different days, but this still needs to be verified. The quality of signal signatures appears to be dependent on the time of day when data collection occurs. It remains to be investigated how the quality of signal signatures depends on the amount of collected data. REFERENCES 1. Cheng, Y., Chawathe Y., LaMarca A., Krumm J. (2005). Accuracy characterization for metropolitan-scale Wi-Fi localization. Proceedings of the 3rd international conference on Mobile systems, applications, and services. Seattle, Washington. 2. Ladd, A., Bekris, K., Rudys A., Wallach, D., and Kavrakia, L. (2004). On the Feasibility of Using Wireless Ethernet for Indoor Localization. Transactions on Robotics and Automation, 20, No. 3. 3. Seshadri, V., Zaruba G. V., and Huber M. (2005). A Bayesian Sampling Approach to In-door Localization of Wireless Devices Using Received Signal Strength Indication. Third IEEE International Conference on Pervasive Computing and Communications, 2005. Kauai Island, Hawaii. ACKNOWLEDGMENTS The study was funded by two Community University Research Initiative (CURI) grants from the State of Utah (2003-04 and 2004-05) and NSF Grant IIS-0346880. The authors would like to thank Mr. Sachin Pavithran, a visually impaired training and development specialist at the USU Center for Persons with Disabilities, for his feedback on the localization experiments. Author Contact Information: Vladimir Kulyukin, Ph.D., Assistive Technology Laboratory, Department of Computer Science, Utah State University, 4205 Old Main Hill, Logan, UT 84322-4205, Office Phone (435) 797-8163. EMAIL: vladimir.kulyukin@usu.edu.
  • 4.
    --------------------------- Figure 1: WayfinderSystem --------------------------- Alternative Text Description for Figure 1. Figure 1 shows a photograph of the current Wayfinder system prototype . The system is mounted to a vest and does not require the user to carry anything with their hands. It has a GPS unit on one shoulder and a compass on the other shoulder. The computation unit sits in front on the user's chest. A numeric keypad sits in front and allows the user to respond to system prompts. The system uses speech trhough an attached headphone. --------------------------- Figure 2: Map of the USU Computer Science Department with an example route shown. Black circles represent access points. Circled numbers represent locations. The route covers locations 1 to 5. ---------------------------
  • 5.
    Alternative Text Descriptionfor Figure 2. Figure 2 shows a map of the Utah State University Computer Science Department. It shows the locations of five access points in the department. It also shows the locations which are used for localization purposes. The twelve locations are the intersections of halls were a person can turn. The figure also shows an example route which take a path from location 1 to location 5. ------------------------------------------------ Figure 3: Collection positions at a corner location. Black dots represent collections positions. Collection positions were 1.5 meters from the actual location. Narrow halls
  • 6.
    had one collectionposition, wide halls had two. ------------------------------------------------ Alternative Text Description for Figure 3. Figure 3 shows an example of collection positions for a corner where a narrow hall and a wide hall intersect. There is one collection position in the middle of the narrow hall 1.5 meters from the corner. There are two collection positions in the wide hall which divide the width of the hall into thirds. They are also 1.5 meters from the corner.
  • 7.
    --------------------------- Table 1: Staticon Static results when the training dataset and the validating datasets are on the same day. The columns for Bayes and C4.5 are the percent of validation samples which were correctly classified by that algorithm. --------------------------- Dataset Day Bayes C4.5 2005-01-03 0.982892 0.999602 2005-01-04 0.979402 0.999830 2005-01-05 0.970853 0.999678 2005-01-11 0.988162 0.999830 2005-01-12 0.968064 0.999016 2005-01-13 0.992520 0.999792 2005-01-20 0.954853 0.998335 2005-01-26 0.943970 0.998845 2005-02-01 0.974570 0.999659 2005-02-02 0.957489 0.997861
  • 8.
    --------------------------- Table 2: Staticon Static results when the training dataset and the validating datasets are on different days. The data from the training day was used classify all the other days. The average columns are the average of the results for each validation day. The max columns are the highest percent correct achieved for a day. The min columns are the lowest achieved. --------------------------- Training Bayes Bayes Max Bayes Min C4.5 C4.5 Max C4.5 Min Day Average Average 2005-01-03 0.963299 0.985881 0.938110 0.947295 0.985117 0.900284 2005-01-04 0.936516 0.979617 0.873530 0.928698 0.979009 0.856595 2005-01-05 0.946472 0.986248 0.871559 0.920312 0.957741 0.869446 2005-01-11 0.949799 0.978475 0.930835 0.952160 0.976834 0.925226 2005-01-12 0.940417 0.986158 0.854863 0.929767 0.991057 0.859771 2005-01-13 0.937436 0.973018 0.856570 0.920374 0.972020 0.845039 2005-01-20 0.912735 0.978265 0.827666 0.887077 0.956317 0.753148 2005-01-26 0.936857 0.979829 0.843126 0.943860 0.975973 0.915822 2005-02-01 0.935468 0.989429 0.865968 0.931121 0.976527 0.867669 2005-02-02 0.934006 0.989783 0.846740 0.930395 0.981845 0.836240 --------------------------- Table 3: Static on Dynamic results. The Route column is the route used for validation. The route "both" means data from both routes 1 and 5 were used. The columns for Bayes and C4.5 are the percent of validation samples which were correctly classified by that
  • 9.
    algorithm. --------------------------- Training Day Route Bayes C4.5 2005-01-03 1 0.916771 0.872834 2005-01-03 5 0.889491 0.867207 2005-01-03 both 0.903284 0.870052 2005-01-04 1 0.891945 0.885039 2005-01-04 5 0.860265 0.866719 2005-01-04 both 0.876283 0.875982 2005-01-05 1 0.885456 0.789962 2005-01-05 5 0.894362 0.878775 2005-01-05 both 0.889859 0.833870 2005-01-11 1 0.840686 0.896172 2005-01-11 5 0.868668 0.865136 2005-01-11 both 0.854520 0.880828 2005-01-12 1 0.888492 0.879324 2005-01-12 5 0.876279 0.860753 2005-01-12 both 0.882454 0.870142 2005-01-13 1 0.804310 0.785378 2005-01-13 5 0.904165 0.880723 2005-01-13 both 0.853677 0.832516 2005-01-20 1 0.790439 0.865750 2005-01-20 5 0.874452 0.856612 2005-01-20 both 0.831974 0.861232 2005-01-26 1 0.879086 0.908912 2005-01-26 5 0.890283 0.877679 2005-01-26 both 0.884621 0.893471 2005-02-01 1 0.799131 0.879145 2005-02-01 5 0.908427 0.879688 2005-02-01 both 0.853165 0.879414 2005-02-02 1 0.826874 0.891647 2005-02-02 5 0.891805 0.892840 2005-02-02 both 0.858975 0.892237
  • 10.
    --------------------------- Table 4: Dynamicon Dynamic results. The route "both" means data from both routes 1 and 5 were used. The columns for Bayes and C4.5 are the percent of validation samples which were correctly classified by that algorithm. --------------------------- Training Route Validation Route Bayes C4.5 1 1 0.944843 0.996148 1 5 0.603385 0.713833 1 both 0.776406 0.856777 5 1 0.554921 0.642496 5 5 0.982274 0.999463 5 both 0.767105 0.818549 both 1 0.882062 0.991129 both 5 0.899476 0.991963 both both 0.890315 0.991856