Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Method to Estimate the Extent of Functional Urban Areas at the Global Level


Published on

This document describes a method to define the boundaries of urban economic agglomerations – or functional urban areas (FUAs) – at the global scale, to cover countries where the official EC-OECD method is not applicable due to lack of commuting flows data. The method builds on the EC-OECD definition of FUAs, which are composed of urban centres least 50 thousand people surrounded by interconnected commuting zones.

Published in: Government & Nonprofit
  • Be the first to comment

  • Be the first to like this

A Method to Estimate the Extent of Functional Urban Areas at the Global Level

  1. 1. 1 A Method to Estimate the Extent of Functional Urban Areas at the Global Level1 Preliminary version – Please do not quote January, 2018 Ana I. Moreno Monroy, OECD Marcello Schiavina, Joint Research Centre, European Commission Paolo Veneri, OECD This document describes a method to define the boundaries of urban economic agglomerations – or functional urban areas (FUAs) – at the global scale, to cover countries where the official EC-OECD method is not applicable due to lack of commuting flows data. The method builds on the EC-OECD definition of FUAs, which are composed of urban centres least 50 thousand people surrounded by interconnected commuting zones. The estimation of commuting zone borders relies on the use of population gridded data with regular cells of one-km2 across the globe. The probability that a one km2 cell with at least 300 inhabitants is part of a FUA is the outcome of a logit model that relies on characteristics at both cell and country level. Results show that around 55% of the world population live in 9840 FUAs, out of which 12% live in their commuting zones. Introduction A commonly accepted definition of what a city is requires identifying agglomerations of people in space. High density of residential population is generally the key information required to identify such agglomerations. However, people move daily across space from their place of residence for several reasons, such as work, consumption and recreation. The space encompassed by these daily movements often crosses local and regional administrative boundaries and extend beyond theimmediate boundaries of the highest density areas. This makes both administrative and morphological definitions not fully adequate to detect the actual geographies of the interactions that define a city from a socio-economic – i.e. or functional – point of view. In order to overcome this challenge many countries started to identify functional definitions of cities to provide policy makers, analysts and citizens a needed benchmark geography to effectively design urban policy and make sound international comparisons. Among the existing functional definition of cities are the Functional urban areas (FUAs) – also referred to as urban economic agglomerations from now on. FUAs consist of high density places called urban centres, which are surrounded by commuting zones. Commuting zones encompass the area of influence of an urban centre from the point of view of the labour market. Each urban centre is associated to one single FUAs, while there can be several urban centre belonging to the same FUAs (polycentric) when their labour market areas overlap. Commuting zones are identified using data on commuting flows between small local administrative or statistical enumeration units and provide an approximation of the spatial extent of the labour market. The sum of the urban centres and their surrounding commuting zone determines the total extent of FUAs. –FUAs are used as units of analysis in the OECD Metropolitan Database as well as in the Urban Audit Database 1 Any view expressed herein are those of the authors and do not necessarily those of the OECD, the European Union or its member countries. Any error remains the authors’.  Email: (Contact author).
  2. 2. 2 published by Eurostat for European countries. The method to consistently identify FUAs across countries was developed by the OECD jointly with the European Commission (OECD, 2012). The main advantages of FUAs are twofold: they constitute a benchmark geography to account for the socio-economic extent of cities and to design urban policy and also allow robust comparisons of FUA-level information across different countries. Up to now, the EC-OECD FUA definition has been applied in most OECD countries and Colombia, mostly because data on commuting flows is not usually available in non-OECD countries. The work summarized in this document aims at extending the concept of FUAs to cover countries where the EC-OECD FUAs definition cannot be applied due to lack of commuting flows data. The document outlies a method to approximate the boundaries of FUAs, based on the idea that the extent of commuting zones around any urban centre can be inferred from the properties of known functional urban areas. The proposed method is fully reproducible and applicable to any country as it relies only on two publicly available one-km2 gridded global datasets. The first is the satellite-derived Global Human Settlements Layer (GHSL) (Pesaresi et al, 2016), which provides built-up and population distribution grids, and a settlements classification according to their density. The second is a global travel impedance matrix (Weiss et al, 2018), which allows to consistently estimate travel times between any given cell and their closest urban centre. Compared to other alternatives, data derived from satellite-based imagery to classify urban patterns has the advantages of being low cost, comparable and available in low-income settings (Burchfield et al, 2006; Pesaresi et al, 2016). The proposed method uses urban centres of at least 50 thousand people identified using the GHSL. It then approximates commuting zones by relying on the estimated probability that a cell with a certain population density is part of a FUA, conditional on the distance of the cell to the most proximate urban centre, the size of the urban centre and size of the cell. The parameters used to calculate this probability are the outcome of a logit model where each one-km2 cell with at least 300 inhabitants in OECD countries is classified as one if they fall within a FUA border and zero otherwise. The method presented here allows the estimation of 9 840 FUA borders based on 11 223 urban centres in 136 countries. The results show that in 2015, about 3.63 billion people or 55% of the total population of selected countries lived in FUAs, out of which 12% lived in commuting zones. North America hosts the largest percentage of people in FUAs across regions of the world (69%), followed by Latin America (59%). In more developed countries, a larger share of people living in FUAs are in commuting zones (23%) compared to least developed (10%) and less developed countries (9%). The document is organized as follows. Section 2 outlines the empirical approach, including subsampling choices and the method to draw borders. Section 3 presents the results. Section 4 concludes. Technical details on the estimation and validation can be found in Annex B. Estimation method Sub-selecting cells above a certain population threshold The empirical approach uses information contained in EC-OECD Functional Urban Area (baseline) boundaries to predict boundaries in countries where there is no information on commuting flows. This section explains some methodological choices regarding which information is used for the estimation. A FUA is composed of an urban centre and a commuting zone, made up of cells with varying populations but with no observed contiguity of high density cells. Any existing FUA can be broken down into one-km2 cells. Figure 1 summarizes the hierarchy of cell aggregations within and outside FUAs.
  3. 3. 3 Figure 1. Hierarchy of spatial aggregations of grid cells Source: Elaboration based on Dijkstra and Poelman (2014). Generally, the method to construct FUA boundaries involves two steps, summarized in Figure 2: Step 1: Assign every cell to a unique urban centre, to ensure there is no overlap of any two or more FUAs. Step 2: Demarcate the dividing line between sub-group of cells that belong to a FUA and those that do not. Figure 2. Simplified outline of method to estimate FUA borders With no information on commuting flows available for all countries, an approximation of the boundaries of commuting zones can be identified by modelling the probability of commuting from each cell to the closest urban centre, and establish a threshold value for this probability to separate cells that belong to a FUA from those that do not. Cell (1-km2) Urban centre Area containing 50 000 people or more made up of contiguos cells each with 1 500 people Compact suburb Area containing 5 000 people or more made up of contiguous cells each with more than 300 people people Low density/Rural area Sum of cells each with population lower than 300 people
  4. 4. 4 For step 1 it is assumed that the closest urban centre will have the highest probability of commuting amongst all other urban centres. This is a simplifying assumption given the lack of commuting data to properly assign urban centres based on commuting intensities. A dummy equal to 1 is then defined if the cell falls within FUA borders and zero if it falls outside. In this case, every cell in the country is uniquely assigned to an urban centre based on proximity to it and assigned a value of 1 (0) if it falls within (outside) FUA borders. A question that arises at this point is whether steps 1 and 2 above should be implemented for all populated cells or for a subgroup. Consider the example of Greater Adelaide (Australia), illustrated in Figure 3. Panel a) shows the baseline FUA borders over a terrain map. Panels b) and c) show a close-up on a typical compact suburban area outside the urban centre. Cells in suburban areas have higher density than cells in outer suburban areas and are consequently easier to distinguish from rural cells with lower densities right outside FUA borders. If boundaries rely on higher density population cells, only a share of all low density cells beyond the estimated border (i.e. those corresponding to outer suburbs) are left out of FUAs. Additionally, these areas may have been included in the baseline FUA borders because they were part of a small administrative unit, so that excluding these low density cells actually partly corrects for a previous aggregation bias. To further verify that 300 persons per km2 is a suitable threshold, it is worth checking how population density behaves as distance from the city centre increases below and above this threshold. Urban economic theory predicts that there should be a lower proportion of higher density cells as distance to the city centre increases (Clark, 1951; Brueckner, 1987). This decline should be more pronounced for higher thresholds (e.g. 300 persons per km2) than for lower thresholds (e.g. 100 persons per km2). Figure 3. Greater Adelaide (AUS) FUA borders, example of a compact suburb and border based on cells with 300 people or more a) b) c) Source: ESRI World imagery, GHSL provided by the Joint Research Centre, European Commission; FUA Boundaries provided by OECD. A faster decline implies that for a given distance from the city centre, it will be more likely to predict correctly whether a cell belongs to an FUA for cells with higher population (density) (i.e. there will be a higher proportion of true positives). On the other hand, including only cells with relatively high population values means that it will be more likely to leave out lower density cells that belong to FUAs (that is, there will be a higher proportion of false negatives). To obtain an average density profile across all FUAs the following strategy was adopted. First, for each FUA, the cell with the highest population (a unique “city centre” cell) is identified. The distance from each cell (of any population size above zero) to the city centre of each FUA is subsequently estimated. For each FUA, the number of cells above a certain population threshold p ̅ (100 or 300 inhabitants) is taken into account, which is equivalent to summing the area covered by those cells.
  5. 5. 5 Figure 4 plots the average percentage of cells above and below two population thresholds at each decile of distance from the city centre across all existing FUAs. The figure is mostly representative of the profile observed across individual FUAs and suggests that the assumption of distance decay holds, as the percentage of cells above p ̅ decreases as distance from the city centre increases. Figure 4. Average percentage of cells in each distance to city centre decile by cell type, baseline FUAs Source: GHSL provided by the Joint Research Centre, European Commission; FUA Boundaries provided by OECD. Given the aim of determining which cells belong to FUA boundaries (i.e. determine the edge of FUAs), predicting which cells will be inside or outside FUAs based on distance will be more likely for cells above p ̅=300 than for all cells. As anticipated, in both the inner and outer parts of the edge of the FUA there is a large percentage of low density cells. This means that the probability of finding a higher density cells will be smaller at the outer end of the edge. As the line corresponding to p ̅=300 decays more rapidly as distance from the centre increases, it is preferred over p ̅=100. The choice of p ̅=300 has the advantage of considerably lowering the number of computations. In addition, it corresponds to the threshold officially applied by Eurostat to define the ‘urban clusters’ in the Degree of Urbanisation taxonomy (Dijkstra and Poelman, 2014). From here onwards, when the term “cell” is used, it will refer to a cell with a population of at least 300 people, unless otherwise stated. On the other hand, the cost of choosing p ̅=300 over p ̅=100 is that people in low density areas living in outer suburbs and actually commuting to an urban centre are left out. However, once the estimated border is drawn, this is likely to be only a small share of the rural population within FUAs. Urban centres definition and border drawing The urban centres used as target of commuting areas are based on the one-km2 population grid (GHS- SMOD) (see Annex A for more details on sources). In line with the Degree of Urbanisation taxonomy (Dijkstra and Poelman, 2014), these urban areas are referred to as “urban centres” throughout the document (see Figure 1 for a full description of cell aggregation categories). To simplify the procedure, these areas are split by country, to avoid trans-boundary urban centres, dropping parts that do not respect the total population constrain. To avoid having unrealistically large urban centres in terms of surface and population (an issue in some highly dense Asian countries such as Bangladesh), two simple rules are imposed: first, a FUA can only have one urban centre of half a million inhabitants or more; and second, urban centres with more 20 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Average%totalcityarea Decile of distance to city centre p < 300 p > 300 p < 100 p > 100
  6. 6. 6 million inhabitants and more than 2 500 km2 are split if they have at least two hypercores. Hypercores are areas within large urban centres containing 1 000 000 people or more made up of contiguos cells each with more than 3 000 people. This splitting accrues the six largest FUAs and makes it possible to consider very high density areas within these FUAs as independent urban centres. Step 1 described above is implemented by assigning each populated cell to the closest urban centre. The distance and the selection of the closest urban centre within the country are implemented using the Global travel impedance grid (see Annex A for more details). The implementation involves clipping the global travel impedance matrix on the country extent and calculating the distance (in minutes) between each cell and the edge of the closest urban centre. Calculating the distance to the edge of urban centres increased the computational weight compared to calculating distances to a single point in each FUA (e.g. the cell with highest density) but was preferred for two reasons. First, urban centre centroids might fall by chance in very remote areas thus artificially inflating travel times and inducing the selection of a less optimal destination. Second, the same population cell could be far away from the centroid, if the adjacent urban centre is very large, and closer to the centroid of another small but distant urban centre. All cells selected through step 2 are combined into a polygon. FUAs with touching boundaries that belong to urban centres that are within 5-km from each other are consolidated into polycentric FUAs. Further aggregations of multiple urban centres into complex polycentric structures or mega-regions (Nelson and Rae, 2016) are in principle possible but fall outside the scope of this work. Results A total of 9 840 FUA borders based on 11 223 urban centres in 136 countries are obtained and validated based on the predicted probabilities from logistic regression on about half a million observations on distance to closest centre, centre size and country characteristics. Details of the econometric procedure are described in Annex B. Table 1. Estimated FUA area and population across development classes and regions, 2015 # FUAs FUA area (km2) FUA Pop. (millions of persons) FUA % in Total Pop (millions of persons) Commuting Pop.(millions of persons) Commuting Pop. % in FUA Development Class Least Developed 603 59 794 195 363 55.23 20 105 10.29 Less Developed, excluding least developed 7 898 852 553 2 709 535 54.21 251 176 9.27 More Developed 1 339 833 788 72 5319 58.00 167 098 23.04 World Region Africa 1 203 98 468 351 204 55.23 23 788 6.77 Asia 6 478 781 324 2 316 730 54.44 248 062 10.71 Europe 930 254 145 355 189 48.14 68 021 19.15 Latin America & Caribbean 882 103 688 338 907 58.59 19 082 5.63 North America 296 479 814 246 504 68.94 74 583 30.26 Oceania 51 28 696 21 685 57.69 4 843 22.34 World Total 9 840 1 746 135 3 630 218 54.98 438 379 12.08 Source: See Annex A. These estimated FUAs cover a total area of 1 746 135 km2, out of which 622 670 km2 correspond to urban centre areas (Figure 5). United States is the country with the largest FUA area coverage (450 611 km2).
  7. 7. 7 Figure 5. FUA boundaries in the world and detail for North America World
  8. 8. 8 Detail: North America
  9. 9. 9 In 2015, about 3.63 billion people or 55% of the total population of selected countries lived in FUA, out of which 12% (or 6.6% of the world’s population) lived in commuting zones. From a regional perspective, North America has the highest share of population in urban agglomerations (70%), followed by Latin America and the Caribbean (59%). Across development classes, the share of people living in the commuting zones of FUAs is by far largest in more developed countries (23%). In absolute numbers, less developed countries host the largest volume (248 062 million people). Interestingly, while Latin American and the Caribbean and Africa have relatively high shares of population living in FUA, they have the smallest shares of FUA population in commuting zones. This result suggests that in these two regions, urban growth has concentrated mostly at the fringes of existing urban centres. India with 754 million people in 2 886 FUAs and China with 642 million people in 1 641 FUAs are the two top countries in the world in term of number of FUAs and proportion of national population living within them. The largest FUAs in the world is Greater Tokyo with a population of over 36 million (of which 8% live in the commuting zone) , followed by Greater Delhi with around 28.5 million people in urban centres and 2.4 million in commuting zones. In terms of area coverage, the estimated FUAs cover a significantly smaller area than the baseline FUAs. This does not come as a surprise, since the latter method rely on administrative aggregations that can include areas that are partially unpopulated or without potential commuters. As such, the estimated FUAs are likely to give a more realistic approximation of the extent of commuting zone area coverage. Figure 6 illustrates this for the case of Bogotá, Colombia. The baseline FUA borders (in white) partly aggregate natural reserve areas on the east and the south where there are no roads or settlements. The estimated FUAs (in grey) demarcate a tighter southern and eastern border. Figure 6. Baseline and Estimated FUA borders, Bogotá Source: Open Street Maps, GHSL provided by the Joint Research Centre, European Commission; FUA Boundaries provided by OECD. In line with expectations, commuting zones cover a proportionally larger area in more developed countries: the average FUA-to-centre area ratio for more developed countries is 2 times higher than for less developed countries and 2.5 times than for least developed countries. In 12% of the cases, estimated FUAs correspond to urban centres (i.e., they do not have a commuting zone). Almost all (99%) FUAs with small or no commuting zones are located in least (10%) and less (89%) developed countries. Contemporary large urban agglomerations are often polycentric. Even if they have more than one single urban centre within the FUA area they should be considered as a single daily urban system. In order to adapt the method presented here to the existence of polycentric FUAs, the merging procedure was applied
  10. 10. 10 in 695 cases. In 74% of the cases this procedure involved merging two urban centres, and merged up to 31 urban centres. With the exception of London, FUAs with more than 10 urban centres are located in Asia (10) and Africa (3). The consolidation procedure for polycentric FUAs possibly underestimates the number of polycentric FUAs as it cannot account for cases where urban centres are located further than 5-km away from each other (e.g. as in the case of separate FUAs for Greater Bogotá and Chía-Cajicá in Figure 6). Conclusions Any assessment of global trends in urban development needs to rely on a consistent definition of what is a city and its economic area of influence. This document proposed a method to define urban economic agglomerations for countries where there is no commuting flows data around the entire world through a simple method that uses the concept of FUAs as defined by the OECD in collaboration with the European Commission (OECD, 2012). By using global grids of population and travel impedance at one-km2 , the method uses the information available on the commuting zones in 31 OECD countries to predict the extent of commuting zones all around the world. The estimated extent of FUA boundaries in OECD countries turned out to be satisfactorily accurate in terms of population with respect to the original FUAs identified by aggregating the local administrative units to the high-density urban centres based on the intensity of commuting flows. The major contribution of this work is to provide estimated boundaries to compare urban agglomerations – as well as suburban population – in a meaningful and robust way that was not possible before. The estimated boundaries rely on the view that cities are defined as agglomeration of people, not of buildings or lights. In this sense, the estimated FUA boundaries can help assessing the possible bias generated by the shape of such administrative boundaries, which often include areas that are poorly connected to the high-density urban centre or even inhabited. The method proposed in this document relies on a training set that is built on a sample of countries which are mostly developed. This might represent a limitation in the capacity to predict with sufficient accuracy the extent of FUAs in the least developed countries. However, the global impedance grid used to compute the distance of each cell to its closest high-density urban centre embeds the different costs of moving from different points in space according to the level of infrastructure and the geographical characteristics of the specific area under study. In addition, controls on the number of vehicles and income level should further mitigate a possible bias of the sample used for the training set using in the prediction. References Bates, D., Mächler, M., Bolker, B. and Walker, S., 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1. Brueckner, J. K., 1987. The structure of urban equilibria: A unified treatment of the Muth-Mills model. In: E.S. Mills (Ed.) Handbook of Regional and Urban Economics, 2, 821-845. Burchfield, M., Overman, H.G., Puga, D. and Turner, M.A., 2006. Causes of sprawl: A portrait from space. The Quarterly Journal of Economics, 121(2), 587-633. Clark, C., 1951. Urban population densities. Journal of the Royal Statistical Society. Series A (General), 114(4), 490-496. Cormen, T. H.; Leiserson, C. E.; Rivest, R.L.; Stein, C., 2001. "Section 24.3: Dijkstra's algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw–Hill. pp. 595–601. Dijkstra, L., and Poelman, H., 2014. A harmonised definition of cities and rural areas: the new degree of urbanisation, Regional Policy Working Papers No. 01, European Commission.
  11. 11. 11 Edelsbrunner, H., and Mücke, E.P., 1994. Three-dimensional alpha shapes. ACM Transaction on Graphics, 13(1), 43–72. Freire, S, MacManus, K., Pesaresi, M., Doxsey-Whitfield, E., and Mills, J., 2016. Development of New Open and Free Multi-Temporal Global Population Grids at 250 m Resolution. In: Proceedings of the 19th AGILE conference on geographic information science, June 14–17, Helsinki. Freire, S, Schiavina, M., Florczyk, A.J., MacManus, K., Pesaresi, M., Corbane, C., Bokovska, O., Mills, J., Pistolesi, L., Sqires, J., and Sliuzas R., 2018. Enhanced data and methods for improving open and free global population grids: putting ‘leaving no one behind’ into practice. International Journal of Digital Earth, in press. Nelson, G.D. and Rae, A., 2016. An economic geography of the United States: From commutes to megaregions. PloS One, 11(11), p.e0166083. OECD, 2012. Redefining ‘urban’. A new way to measure metropolitan areas. OECD Publishing, Paris Pesaresi, M., Ehrlich, D., Ferri, S., Florczyk, A., Freire, S., Halkia, M., ... & Syrris, V., 2016. Operating procedure for the production of the Global Human Settlement Layer from Landsat data of the epochs 1975, 1990, 2000, and 2014. Publ. Off. Eur. Union. Weiss, D.J., Nelson, A., Gibson, H.S., Temperley, W., Peedell, S., Lieber, A., Hancher, M., Poyart, E., Belchior, S., Fullman, N. and Mappin, B., 2018. A global map of travel time to cities to assess inequalities in accessibility in 2015. Nature, 553(7688), 333. ANNEX A Data sources and description Global Human Settlement Layers (GHSL): The Global Human Settlement Population (GHS-POP) and settlement model (GHS-SMOD) contaning information on population by 1 km2 grid cells and their types are provided by the Joint Research Centre of the European Commission and are available for download at Technical details can be found in Pesaresi et al (2016). Global travel impedance grid: This grid represents time associated with moving through grid cells, quantified as a movement speed within a “friction” grid (30 arcsec resolution). The unit of measurement in grid is minutes required to travel one kilometre. Information on roads (fastest type in grid takes precedence over others, with speeds given by OSM tables), railroads, water bodies and movement over land is used to characterize each grid cell. Technical details can be found in Weiss et al (2018). The grid is freely available for download at . Country borders: As country borders we use the GADM country boundaries available at Country-level additional information: Vehicles per capita data from; GDP per capita was downloaded from the World Bank Open Data website, Night-time lights: For the sum of nightlights in urban centres the source is Version 1 Nighttime VIIRS (Visible Infrared Imaging Radiometer Suite) Day/Night Band Composites suite produced by The Earth Observations Group (EOG) at NOAA/NCEI ( These grids span the
  12. 12. 12 globe from 75N latitude to 65S and have a resolution of 15 arc-second in WGS84 geographic coordinates (EPSG 4326). The yearly "vcm-orm-ntl" (VIIRS Cloud Mask - Outlier Removed - Nighttime Lights) layer was selected, showing the cloud-free average radiance emitted by Earth (expressed as nW cm^-2 sr^-1) with outlier removal process to filter out fires and other ephemeral lights. This layer has been warped to the GHS-SMOD grid by oversampling at 50m in Mollweide projection (EPSG 54009), with nearest method, then aggregated at one-km2 by averaging values. ANNEX B Model specification, selection and validation Model specification The data used in this work has a nested or block structure, since each cell is assigned to an urban centre and urban centres belong to countries. In this case, the independence assumption across cells may not hold. The first chosen specification choice is therefore a mixed-effects logit model with random intercepts in two levels, where cells are nested in urban centres which are in turn nested in countries 𝐹𝑈𝐴𝑖𝑗𝐶 = 𝛾0 + ∑ 𝛾 𝑝 𝑘=1 𝑘 𝑡𝑖𝑗𝐶 𝑘 + 𝛾3 𝑝𝑜𝑝𝑖𝑗𝐶 + 𝜖𝑖𝑗𝐶 (1) Where 𝐹𝑈𝐴𝑖𝑗𝐶 is a dummy variable that takes the value of one if the cell i in country C linked to urban centre j falls within FUA borders and zero otherwise; t_ijC is the travel time from i to j; and 𝑝𝑜𝑝𝑖𝑗𝐶 is the population of cell i. This specification includes random intercepts clustered by urban centre and country, so that each urban centre has its own random intercept varying within each country. In this and the following specifications the second degree orthogonal polynomial of distance is specified after verifying its statistical significance and contribution to variance explanation as captured by the Bayesian Information Criteria (BIC). The estimation is done via a Maximum Likelihood estimator. A possible caveat of using a model specification such as the one described by (1) is that the purpose of the estimation is to make out of sample predictions, in which case the random intercept estimates may carry little useful information. In particular, outside the sample of OECD countries there could be countries with larger or smaller urban centres, and country-level variables such as the level of economic development would help scale possible country-level effects. A general specification including these variables and their polynomial terms is: 𝐹𝑈𝐴𝑖𝑗𝐶 = 𝛿0 + ∑ 𝜏 𝑘 𝑡 𝑝 𝑘=1 𝑖𝑗𝐶 𝑘 + 𝛿3 𝑝𝑜𝑝𝑖𝑗𝐶 + +𝛿4 𝑠𝑖𝑧𝑒𝑗𝐶 ∗ 𝑑𝑖𝑠𝑡𝑖𝑗𝐶 + ∑ 𝜑 𝑘 𝑝 𝑘=1 𝑠𝑖𝑧𝑒𝑗𝐶 𝑘 + ∑ 𝜔 𝑘 𝑝 𝑘=1 𝐺𝐷𝑃𝐶 𝑘 + ∑ 𝜑 𝑘 𝑝 𝑘=1 𝑐𝑎𝑟𝑠 𝐶 𝑘 + 𝜖𝑖𝑗𝐶 Where 𝑠𝑖𝑧𝑒𝑗𝐶 is the size of urban centre j, proxied by area, population or night-time lights; 𝑠𝑖𝑧𝑒𝑗𝐶 is the cell’s population;, 𝐺𝐷𝑃𝐶 𝑘 is GDP per capita and 𝑐𝑎𝑟𝑠 𝐶 𝑘 is the number of vehicles per 1 000 inhabitants. Estimation, cross validation and performance After excluding cells with distances to the most proximate urban centre above 300 minutes, the sample size is 498 702 observations across 31 countries: Colombia plus all OECD countries except those with no available FUAs (New Zealand, Israel, Turkey and Lithuania) and those with only one FUA
  13. 13. 13 (Luxemburg and Iceland). As desirable, the proportion of 1s and 0s in the sample is almost balanced: the proportion of cells within FUAs (1s) is 48.4%. The BIC supports the inclusion of an interaction between size of the urban centre and distance, and k = 1, 2 for distance, size of urban centre, GDP per capita and vehicles incidence. To verify the statistical significance of the variables, a logistic regression with clustered errors was performed. The inclusion of other country-levels variables such as population was not supported by p-values. Figure B.1 summarizes the variable importance score (absolute value of the z-stats) of the most general model. Distance from each cell to the urban centre is the most powerful predictor of the probability of belonging to a FUA, followed by size of the urban centre. Figure B.1. Variable importance score Validation involves predicting values on a test set based on the model for a training set. Training and test sets were defined in two ways. First, to get information about possible performance biases against countries with certain characteristics such as lower income or large surface area, 31 training/test sets were built where one country was excluded/included at a time. Second, to guide the model choice, 100 training and test sets were built based on random samples (without replacement) of the 1 413 urban centres in the dataset and compute median performance measures over these 100 samples. The probability threshold for the validation exercise is set at 0.75, whereas for the implementation the optimal threshold has been determined by minimizing the error of positives and negatives predictions against the actual positives and negatives (i.e. sum of square error) in the training set, and it is slightly lower (0.7192372). The main objective when estimating borders is to predict positives (1s) correctly more so than predicting negatives (0s) correctly. For this reason the model choice was based on the Area Under the Relative Operating Characteristic (AUROC)2 . The AUROC measures the probability that the classifier will rank a randomly chosen 1 higher than a randomly chosen 0, with a minimum value of 2 The Relative Operating Characteristic (ROC) plots the proportion of estimated 1s that are correctly considered as 1s (true positive rate) versus the proportion of 0s that are mistakenly considered as 1s (false positive rate) 0 50 100 150 200 250 pop. cell GDP per capita cars x 1 000 inh.^2 pop. urban centre*distance GDP per capita^2 cars x 1 000 inh. size urban centre size urban centre^2 distance cell to urban centre^2 distance cell to urban centre Absolute z value
  14. 14. 14 0.5 (pure chance) and 1 (perfect agreement between predicted and observed values), with values higher than 0.7 considered acceptable. As a robustness check, reference is made to the following model performance criteria: balanced accuracy (number of correct predictions divided by number of predictions obtained on either class (0s and 1s)); misclassification error (the percentage of observations (1s and 0s) not correctly predicted); and specificity (percentage of 0s correctly predicted as 0s). The number of models to be compared was narrowed down based on performance statistics to three models based on the p-values of individual variables: the nested specification (nested); a specification with k=2 for distance, urban centre size and vehicles incidence and k=1 for GDP per capita and with 𝛿4 = 0 (spec. 1); and a specification with k=2 for distance, urban centre size, vehicles incidence and GDP per capita (spec. 2). In these specifications, urban centre size was measured in terms of its total population. To aid convergence, the values of distance, urban centre and cell population, GDP per capita, and vehicles incidence for the nested models were log transformed. In the random samples case, the performance of this proxy (population) is compared against two alternatives: area and night- time data. Figure B.2 displays the AUROC values for the country-level predictions. The weighted means by number of urban centres in each country are 0.85495, 0.86957 and 0.87089 for the nested, spec. 1 and spec. 2 cases, respectively, making spec. 2 preferable based on this criteria. For this specification, the country values range from 0.76576 (Portugal) to 0.98651 (Latvia). Notably, the model’s predictive ability does not show any evident pattern with respect to country characteristics such as land size, population density or level of economic development, and are all above the acceptable level of 0.7. These results also confirm that on average, the nested model is not superior to alternative models including country-level variables as predictors. Figure B.2. AUROC by country and specification The additional performance metrics reveal that the proposed specifications are all better at predicting 1s than 0s. The weighted balanced accuracy average for the three cases is 0.71517, 0.71248 and 0.71380, and ranges from 0.55072 (Slovakia) to 0.79992 (Switzerland). The lower values are mostly explained by low specificity rates, with weighted averages of 0.52535 (nested), 0.49145 (spec. 1) and 0.48796 (spec. 2), and a minimum value of 0.10301 (Slovakia). The worse performance in terms of specificity is expected, as distance is not as powerful of a predictor for differentiating cells outside FUAs as it is for cells inside FUAs. Nevertheless, the weighted misclassification error is 0.23591, 0.21889 and 0.21713, still within an 0.7 0.75 0.8 0.85 0.9 0.95 1 AUROC nested spec. 1 spec. 2
  15. 15. 15 acceptable range for the three proposed specifications.Using the AUROC as selection criteria, the performance of several specifications of 𝛿4 and k is compared for the country-level variables (these results are available upon request). All the metrics favour spec. 2, although differences in average performance across specifications are small. Table B.1 shows the median values over 100 runs for spec. 2 for three different proxies for urban centre size: population, area and night-time data. Table B.1. Median of performance metrics over 100 random samples Proxy AUROC Balanced Accuracy Specificity Misclassification error Population 0.862924 0.709489 0.4629413 0.2125 Area 0.861503 0.703376 0.4499832 0.2145 Night-time lights 0.862459 0.704844 0.4476015 0.2137 While the average results support the choice of population over the remaining alternatives, the model is clearly robust to size proxy alternatives. The preferred specification is then: 𝐹𝑈𝐴𝑖𝑗𝐶 = 𝛿0 + ∑ 𝜏 𝑘 𝑡2 𝑘=1 𝑖𝑗𝐶 𝑘 + 𝛿3 𝑝𝑜𝑝𝑖𝑗𝐶 + +𝛿4 𝑝𝑜𝑝𝑗𝐶 ∗ 𝑑𝑖𝑠𝑡𝑖𝑗𝐶 + ∑ 𝜑 𝑘 2 𝑘=1 𝑝𝑜𝑝𝑗𝐶 𝑘 + ∑ 𝜔 𝑘 2 𝑘=1 𝐺𝐷𝑃𝐶 𝑘 + ∑ 𝜑 𝑘 2 𝑘=1 𝑐𝑎𝑟𝑠 𝐶 𝑘 + 𝜖𝑖𝑗𝐶 (3)