Information-Statistical Approach for a Strategic Planning of a ...

  • 205 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
205
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless Project Wei-Tsu Yang1 and Bon K. Sy2 1 Queens College/CUNY, Computer Science Department, Flushing NY 11367, U.S.A. weyang@scils.rutgers.edu 2 Queens College/CUNY, Computer Science Department, Flushing NY 11367, U.S.A. bon@bunny.cs.qc.edu Abstract. The objective of this paper is to apply an information-statistical data mining technique to achieve two specific goals related to building a community-based wireless infrastructure. The first goal is to discover from data the characteristics behind a successful project on building a community-based wireless infrastructure; e.g., NYCwireless. The second goal is to estimate the node distribution of a wireless infrastructure if such a community-based wireless project is to be invested and expanded to the Queens county of New York City. The first step is to discover statistically significant event patterns that attempt to explain the characteristics of the project --- NYCwireless. Then an information-statistical approach is applied to discover an optimal probability model based on Shannon entropy criterion for estimating the node distribution of a projected wireless infrastructure in Queens County of New York City. 1 Introduction NYCwireless [1] is a non-profit organization that provides free wireless Internet service over radio connections to mobile users in public spaces such as coffee shops and parks throughout Manhattan metropolitan area in New York City. Each node is operated independently by volunteers using their own equipment. By the end of April 2002, over ninety network nodes were listed in the NYCwireless database for the New York City metropolitan area. More than half of them (49) are located at Manhattan. On the contrary, Queens County accounts for less than 10% of the nodes. Yet from the population survey conducted by the U.S. Census Bureau [2], Queens County has a larger population size than Manhattan. In this paper, we attempt to answer two specific questions that may provide valuable information for strategic planning and to assist in the decision making process behind building a community-based wireless infrastructure in the Queens County of New York City: 1. What are the characteristics, quantified by statistical significant event patterns, of the active participants who contributed to the community-based wireless infrastructure in Manhattan?
  • 2. 2 (1) 2. Based on the information revealed by statistical significant event patterns, what is the optimal probability model, with respect to Shannon entropy criterion, that can be used as a basis for projecting the spatial distribution of wireless nodes in the Queens County of New York City if identical resources as that of NYCwireless invested in Manhattan are applied to Queens? 2 Information Statistical Approach Since the primary goal of our project is to discover the relationship between event patterns and the distribution of wireless nodes, information statistical approach is chosen. Information statistical approach fits very well for uncovering unknown event patterns that are statistically significant [3]. The choice of representation and characteristics for capturing the behavior of wireless node owners is still an open issue. In this research, we choose a representation framework that is based on multi- valued variables, and we apply an information statistical approach to reveal the information embedded in the event patterns. Information statistical approach for data mining is built upon the concept of patterns. Concept of patterns is common in the data mining community [4]. Grenander has discussed extensively a general concept of patterns from the perspective of applied mathematics with application to understanding the relationship between image set patterns and statistical geometry [5,6,7]. One notion of the concept of patterns that we have explored is to capture the meaning and the quality of the information embedded in data. In comparison to the concept of patterns discussed by Grenander, one interesting aspect found by us is the possibility of interpreting joint events of discrete random variables surviving statistical hypothesis test of interdependency as statistically significant association patterns. In doing so, significant previous works already established [8,9,10,11,12,13] may be used to provide a unified framework for linking information theory with statistical analysis. The significance of such a linkage is that it not only provides a basis for using statistical approaches for revealing hidden significant association patterns, but for using information theory as a measurement instrument to determine the quality of information obtained from statistical analysis. For further details on the application of statistical techniques for analyzing and discovering statistical patterns, and information theory for interpreting the meaning behind the statistical analysis, readers are referred to a report elsewhere [3]. A specific example will be shown later to illustrate the process of discovering significant event patterns using mutual information measure and chi-square analysis. In our proposed information-statistical data mining technique, the statistically significant event patterns just discovered was then used to identify an optimal probability model that maximizes Shannon entropy. An optimal model maximizing Shannon entropy has the property of minimizing undesirable bias contributed by the unknown information. As discussed elsewhere [14], identifying an optimal probability model based on the marginal and joint frequency information of discrete random variables is an optimization problem. Specifically, the optimization problem consists of a set of linear probability constraints, and a non-linear objective function
  • 3. Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless Project 3 due to Shannon entropy criterion. In this next section, we will describe how the proposed information-statistical approach can be applied to (1) identify the characteristics of active participants who contributed to the community-based wireless infrastructure in Manhattan, and (2) project the spatial distribution of wireless nodes in the Queens County of New York City based on the findings in (1). 3 Application of Data Mining to Wireless Project Feasibility Study Based on the mission statement of NYCwireless, all of the wireless access points are managed by independent volunteers who have a broadband Internet connection. An interesting question to ask, but difficult to answer, is that why they are willing to share their broadband Internet connection with the public at their own cost. Since each of them may have his/her own different reason(s), we will try to approach this from a slightly different perspective. Instead, we will attempt to find out what common characteristics of these volunteers share.. According to the surveys on the U.S. households with Internet access obtained from the National Telecommunications and Information Administration (NITA), and from the Economics and Statistics Administration (ESA) using U.S. Census Bureau Current Population Survey Supplements [1], family income, age, and educational attainment are three main factors affecting internet use in America. As such, these three factors are chosen as a basis for understanding the NYCwireless access point distribution. 3.1 Discovering the characteristics Grounded on these three parameters: family income, age, and educational attainment, we will proceed to attempt to understand the characteristics of the active participants who contributed to the community-based wireless infrastructure in Manhattan. The data set used in this study is drawn from the 1990 Census Data Lookup server [1]. The survey was conducted by the U.S. Census Bureau in 1990. We are aware of the issue in regard to "synchronize" data sets with different time frames. (see Discussion for detail). The data set used in this study will be referred to as DS1990. A total of 92 ZIP codes are listed in 1990 Census Data for Manhattan. However, only 38 of them actually have valid data. Thus, only these 38 zip codes in DS1990 are included in this study. Using zip code as a basis to partition the 1990 Census Data for Manhattan, four simple frequency counts were performed on each partition: (1) the number of individuals aged between 25 and 49, (2) the number of individuals with an education attainment at or more than college bachelor degree, (3) the number of individual with a personal income level more than $25,000, and (4) the number of wireless access points. We then computed the mean μ and the standard deviation σ for each one of the four frequency counts using the tallies of all zip codes. Four parameters are defined for the frequency information as relevant to zip codes. They are AG: individuals aged between 25 and 49 from all zip codes, ED: educational
  • 4. 4 (1) attainment at or more than college bachelor degree from all zip codes, IN: personal income level more than $25,000 from all zip codes, and NO: the number of wireless access points from all zip codes. To convert these four parameters into discrete random variables, each parameter will be quantized into four possible states {1, 2, 3, 4}, with a state representing an interval defined by unit deviation σ from the mean. For example, AG = 1 refers to the value of the frequency count of individuals aged between 25 and 49 in a zip codes that is one standard deviation less than the mean; i.e., μ-σ. Pr(AG=1) represents a percentage count a/b * 100% related to sample population partition by zip code. Within a zip code (let’s say indexed by i), we can count the number of individuals aged between 25 and 49 --- denoted by ti. We can then calculate the mean μ = (1/N)∑i=1b ti (and similarly the standard deviation σ ). Now we can go back to determine whether ti is less than μ-σ within a zip code. The counts of “ti less than μ-σ from all zip codes” is a, and b is the number of zip codes. The percentage of a/b*100% defines Pr(AG=1) (see Appendix 1 for detail) 3.2 Results A S-PLUS function is written for detecting event association based on the following statistical hypothesis: (1) where the mutual information analysis is represented by : (2) The function will return every found pattern in a list whenever . An instantiation of all four variables is an event pattern. In other words, an event pattern is a 4-tuple value-pair. Out of 256 (4*4*4*4) possible event patterns, our search function detects the following 26 events as significant event patterns (see Appendix 2 for detail). We then cross-validated the results by an alternative approach discussed in [3]. 18 out of 26 found significant patterns were chosen based on the union of two result sets. The next step is to derive the probability for each pattern for formulating probability constraints. Since X1:1 in our data set represents no wireless node, it does not provide useful answer to our prediction. Hence, we exclude all discovered significant patterns with X1:1. In addition, three marginal probability terms, Pr(X1:3) = 0.18, Pr(X2:4) = 0.16, and Pr(X3:4) = 0.26, are also left out from our probability constrains. It is because an optimal probability model will automatically get a value close to the actual one anyway. The optimal model is derived by applying the probability model discovery algorithm discussed elsewhere [14]. Out of 256 joint probability terms of the joint
  • 5. Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless Project 5 probability model, there are twenty-two non-zero probability terms (as listed in Appendix 3). Since we are interested in projecting the spatial distribution of wireless nodes in the Queens County in this project, we will focus on the most probable event patterns that are also statistically significant. Note that (X1:3, X2:3, X3:4, X4:4) is the most probable event pattern that is also statistically significant. In combination with other significant event patterns where (X1:3, X2:3, X3:3, X4:3), (X1:3, X2:3, X3:3, X4:4), (X1:3, X2:3, X3:4, X4:4), and (X1:3, X2:4, X3:4, X4:4), these event patterns altogether reveal the information that can be stated in the following sentence: In an area characterized by its zip code, if the frequency count of individuals aged between 25 and 49 is equal to or above the overall mean µ, and the frequency count of the individuals with an educational attainment level at or more than college bachelor degree is equal to or above the overall mean µ, and the frequency count of the individuals with a personal income level more than $25,000 is equal to or above the overall mean µ, then we can project there will be multiple wireless network nodes in that zip area. 3.3 Evaluation of our prediction Applying the above model to the adjusted (see Discussion for detail) census 1990 data for the Queens County, 14 zip areas (11104, 11354, 11355, 11357, 11365, 11367, 11368, 11372, 11373, 11374, 11375, 11377, 11385, 11435) are projected to have one or more wireless network nodes. As of 01/13/2003, among 63 zip areas in the Queens County, there are more than thirty wireless network nodes in the following 13 zip areas: 11103, 11104, 11354, 11355, 11359, 11364, 11365, 11367, 11373, 11374, 11375, 11416, 11428. The number of unique zip areas just listed is 19. Eight of the 13 actual zip areas (with wireless node(s) presence) are covered in the 14 projected zip areas. Therefore, there are (13 – 8) five false negative cases, and (14 – 8) six false positive cases. Hence, the false negative error rate is 5/19 = 26.3% and the false positive error rate is 6/19 = 31.5%. 4. Discussion The Census 1990 data set used for data mining has the following data quality issues: 1. For both Manhattan and Queens Counties, there is an integrity issue; e.g., sum of individuals does not equal to the reported total counts in some cases. This occurred in each one of the three selected factors. But a normalization factor has been applied to correct the problem. 2. The data on Census Data Lookup server was conducted in 1990; however, the wireless network node distribution data drawn from NYCwireless is the most update one. These two data sets are not in the same time frame, thus we have a data inconsistency problem.
  • 6. 6 (1) In order to calibrate our data sets, we estimated the 2001 population data based on the population percent change from 1990 to 2000 and applied the adjustment accordingly. In addition to data calibration in terms of time, we believe data calibration in terms of population density is essential as well. The population for sub areas in Manhattan varies a lot. As a result, if we simply use frequency counts drawn from the Census Data (with time adjusted) without proper calibration on data mining, it is subject to prejudice. To calibrate the frequency counts for each zip code in Manhattan, we multiply time-adjusted value by the ratio of total population of New York County over sub population of each zip code respectively. The data set pertinent to the Queens County is calibrated in the same manner. There are over 30 % of ZIP code data for Manhattan is missing in the 1990 census data set. Although this affects data quality, the data set is still by far the most comprehensive one in terms of the socioeconomic and demographic variables available. We choose to eliminate sub areas with data missing from our data analysis. Anchored in the discussion above, we are aware of potential bias that might exist in our data analysis due to data quality issues. In the future, when Census 2000 data or other reliable data sets on socioeconomic and demographic variables are available for public lookup, a follow up research based upon the new data set will be conducted. 5. Conclusion Several useful discoveries regarding the distribution of NYCwireless network nodes are found in the research. In particular, the number of individuals aged between 25 and 49, with an education attainment more than college bachelor degree, with a personal income level more than $25,000 are strongly associated with wireless access points distribution in Manhattan. Based on optimal model found from the information statistical approach, 14 zip areas in Queens County are projected to have more than one wireless node. Although our projection has relative high false positive rate, its false negative rate is very low. Given that the characteristics of the participants who are willing to contribute to the community-based wireless infrastructure is very difficult to defined, we believe our projection is valuable to predict the spatial distribution of wireless nodes in the Queens County of New York City. The performance of information statistical approach will be the focus of our follow up research. It is very interesting to see how other data mining algorithms perform against it in terms of false negative rate and false positive rate. Some other parameters such as political parties, profession, and business density of sub areas are not covered in this paper. Political party is a sociology factor that may provide insights into a person’s behaviors and decisions. A public minded person might adopt the idea of community-based NYCwireless project quicker than others. Profession of a person may also have a decisive influence on the ability of running a wireless access point. The business density of coffee shops and bars of sub areas is vital to evaluate the maximum beneficiaries for wireless access points. It is also a factor that may influence the existence of a wireless access point. A person might
  • 7. Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless Project 7 reluctant to joint the community if he or she does not expect anyone to use it in the neighborhood. These factors also will be the focus of our future study. 6. Acknowledgement This work is supported in part by a NSF DUE CCLI grant #0088778, and a PSC- CUNY Research Award. Reference 1. (WWW http://www.nycwireless.net/) NYCwireless 2. (WWW http://homer.ssd.census.gov/cdrom/lookup) 1990 Census Data. 3. Sy B.K., "Information-Statistical Pattern Based Approach for Data Mining," Journal of Statistical Computing and Simulation, 2001, Gordon and Breach Publishing Group, NJ, 69(2), 2001. 4. Fayyad, U. M. and Piatetsky-Shapiro, G. and Smyth, P., "From Data Mining to Knowledge Discovery: An Overview", in Advances in Knowledge Discovery and Data Mining, (editors: Fayyad, U. M. and Piatetsky-Shapiro, G. and Smyth, P. and Uthurusamy, R.), chapter 1, p 1-34, AAAI Press / MIT Press, 1996. 5. Grenander U., Chow Y. Keenan K.M. 1991, HANDS: A Pattern Theoretic Study of Biological Shapes, Springer-Verlag, New York. 6. Grenander U., 1993, General Pattern Theory, Oxford University Press, Oxford. 7. Grenander U., 1996, Elements of Pattern Theory, The Johns Hopkins University Press, ISBN 0-8018-5187-4. 8. Chen J. and Gupta A.K., "Information Criterion and Change Point Problem for Regular Models," Technical Report No. 98-05, Department of Math. and Stat., Bowling Green State U., Ohio. 9. Cover T.M. and Thomas J.A., Elements of Information Theory, Wiley 1991. 10. Good I.J., "Weight of Evidence, Correlation, Explanatory Power, Information, and the Utility of Experiments," Journal of Royal Statistics Society, Ser. B, 22:319-331, 1960. 11. Haberman S.J. "The Analysis of Residuals in Cross-classified Tables," Biometrics, 29:205-220, 1973. 12. Kullback S. and Leibler R., "On Information and Sufficiency," Ann. Math. Statistics, 22:79-86, 1951. 13. Kullback S., Information and Statistics, Wiley and Sons, New York, 1959. 14. Sy B.K., "Probability Model Selection Using Information-Theoretic Optimization Criterion," Journal of Statistical Computing and Simulation, 2001, Gordon and Breach Publishing Group, NJ, 69(3), 2001. 15. (WWW http://www.ntia.doc.gov/ntiahome/dn/html/Chapter2.htm) A Nation Online: How Americans Are Expanding Their Use Of The Internet. 16. (WWW http://www.ntia.doc.gov/ntiahome/dn/hhs/HHSchartsindex.html) U.S. households with Internet access. 17. (WWW http://bonnet2.geol.qc.edu/jscs9901.html) Information-Statistical Pattern based Approach for Data Mining,
  • 8. 8 (1) Appendix 1. The summarization of symbols and variables Variables Symbol State NO X1 {1,2,3,4} AG X2 {1,2,3,4} IN X3 {1,2,3,4} ED X4 {1,2,3,4} State description: X1: 1 = ‘ The number of wireless network node is below μ-σ’ 2 = ‘ The number of wireless network node is between μ-σ and μ’ 3 = ‘ The number of wireless network node is between μ and μ+σ’ 4 = ‘ The number of wireless network node is above μ+σ’ X2: 1 = ‘ Sub-population of those between 25 and 49 aged below μ-σ’ 2 = ‘ Sub-population of those between 25 and 49 aged between μ-σ and μ’ 3 = ‘ Sub-population of those between 25 and 49 aged between μ and μ+σ’ 4 = ‘ Sub-population of those between 25 and 49 aged above μ+σ’’ X3: 1 = ‘ Total population of personal income level more than $25,000 is below μ-σ’ 2 = ‘ Total population of personal income level more than $25,000 is between μ-σ and μ’ 3 = ‘ Total population of personal income level more than $25,000 is between μ and μ+σ’ 4 = ‘ Total population of personal income level more than $25,000 above μ+σ’ X4: 1 = ‘ Total population of educational attainment more than college is below μ-σ’ 2 = ‘ Total population of educational attainment more than college is between μ-σ and μ’ 3 = ‘ Total population of educational attainment more than college is between μ and μ+σ’ 4 = ‘ Total population of educational attainment more than college is above μ+σ’ Appendix 2. The result of significant event patterns discovery Event patterns Mutual Information Chisquare/2N X1:1, X2:1, X3:1, X4:1 3.583911 0.2651224 X1:1, X2:1, X3:2, X4:1 1.883471 0.02579801 X1:1, X2:2, X3:1, X4:1 2.261983 0.03953932 X1:1, X2:1, X3:2, X4:2 2.21362 0.03755223 X1:1, X2:2, X3:4, X4:4 1.592132 0.01771876
  • 9. Information-Statistical Approach for a Strategic Planning of a Community-Based Wireless Project 9 X1:1, X2:3, X3:3, X4:3 2.84367 0.06997035 X1:1, X2:3, X3:4, X4:4 2.106705 0.06682213 X1:1, X2:4, X3:2, X4:2 3.950586 0.3559456 X1:1, X2:4, X3:2, X4:3 3.172978 0.09381503 X1:1, X2:4, X3:3, X4:3 4.066063 0.1948605 X1:2, X2:1, X3:1, X4:1 4.284351 0.6922545 X1:2, X2:1, X3:2, X4:2 2.651025 0.05842556 X1:2, X2:2, X3:1, X4:1 2.37746 0.04458763 X1:2, X2:2, X3:2, X4:1 3.261983 0.3038388 X1:2, X2:3, X3:2, X4:3 2.066063 0.03192413 X1:2, X2:3, X3:3, X4:4 2.736755 0.1267294 X1:2, X2:4, X3:4, X4:4 2.444575 0.0477283 X1:3, X2:3, X3:3, X4:3 3.736755 0.1500842 X1:3, X2:3, X3:3, X4:4 2.514363 0.05116419 X1:3, X2:3, X3:4, X4:4 3.99979 0.7400093 X1:3, X2:4, X3:4, X4:4 3.222182 0.09788331 X1:4, X2:1, X3:1, X4:2 4.351465 0.24293 X1:4, X2:2, X3:2, X4:2 4.329097 0.4776154 X1:4, X2:2, X3:2, X4:3 3.55149 0.1290799 X1:4, X2:3, X3:3, X4:4 2.736755 0.06336469 X1:4, X2:3, X3:4, X4:4 2.222182 0.03789873 Appendix 3. Probability Model for Pattern Inference Table 1. Probability model for pattern inference Index X1 X2 X3 X4 Pr(X1,X2,X3,X4 ) 3 1 1 1 3 0.108 4 1 1 1 4 0.0174 18 1 2 1 2 0.0046 24 1 2 2 4 0.0462 33 1 3 1 1 0.0888 38 1 3 2 2 0.0452 65 2 1 1 1 0.079 69 2 1 2 1 0.0386 70 2 1 2 2 0.026
  • 10. 10 (1) 76 2 1 3 4 0.0444 85 2 2 2 1 0.079 108 2 3 3 4 0.053 152 3 2 2 3 0.026 171 3 3 3 3 0.026 172 3 3 3 4 0.026 176 3 3 4 4 0.105 192 3 4 4 4 0.026 194 4 1 1 2 0.026 214 4 2 2 2 0.053 215 4 2 2 3 0.026 236 4 3 3 4 0.026