Transcript of "Dc+big+data+exploration+final+report"
World Bank Group Financesfinances.worldbank.org@WBOpenFinancesDC Big Data ExplorationFinal ReportMarch 15-17, 2013
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 2TABLE OF CONTENTSACKNOWLEDGEMENTS 3EXECUTIVE SUMMARY 4DETAILED PROJECT REPORTS 7PREDICTING SMALL-SCALE POVERTY MEASURES FROM NIGHT ILLUMINATION 7BACKGROUND AND PROBLEM STATEMENT 7DATASETS AVAILABLE 7KEY FINDINGS 7METHODS AND ANALYSIS 7RECOMMENDATIONS AND NEXT STEPS 15ADDITIONAL RESOURCES 16SCRAPING WEBSITES TO COLLECT CONSUMPTION AND PRICE DATA 17BACKGROUND AND PROBLEM STATEMENT 17DATASETS AVAILABLE 17KEY FINDINGS 17METHODS AND ANALYSIS 18RECOMMENDATIONS AND NEXT STEPS 22ADDITIONAL RESOURCES 22LATIN AMERICA POVERTY ANALYSIS FROM MOBILE SURVEYS 23PROBLEM STATEMENT 23DATASETS AVAILABLE 23KEY FINDINGS 23METHODS AND ANALYSIS 23RECOMMENDATIONS AND NEXT STEPS 28ADDITIONAL RESOURCES 28MEASURING SOCIOECONOMIC INDICATORS IN ARABIC TWEETS 29BACKGROUND AND PROBLEM STATEMENT 29DATASETS AVAILABLE 29KEY FINDINGS 29METHODS AND ANALYSIS 30RECOMMENDATIONS AND NEXT STEPS 36ADDITIONAL RESOURCES 36ANALYZING WORLD BANK DATA FOR SIGNS OF FRAUD AND CORRUPTION 37DETAILED PROBLEM STATEMENT 37DATASETS AVAILABLE 37KEY FINDINGS 37METHODS AND ANALYSIS 38RECOMMENDATIONS AND NEXT STEPS 47ADDITIONAL RESOURCES 49UNDP RESOURCE ALLOCATION 50BACKGROUND AND PROBLEM STATEMENT 50DATASETS AVAILABLE 50KEY FINDINGS 51METHODS AND ANALYSIS 51RECOMMENDATIONS AND NEXT STEPS 54ADDITIONAL RESOURCES 54OTHER PROJECTS: A HEURISTIC TOOL FOR AUDITING AND SOCIAL NETWORK ANALYSIS FOR RISK MEASUREMENT 55NEXT STEPS 55
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 3ACKNOWLEDGEMENTSThe World Bank Group Finances team is grateful to all the partners that supported the DC DataDive on March 15-17:• UN Global Pulse• Qatar Computing Research Institute• UNDB• UNDPThe following groups from inside the Bank also contributed –• AFR• CTR• DEC• EXT• IEG• INT• LAC• OPCS• PREM• TWICT• WBIWe are also grateful to DataKind, the data ambassadors that it assembled, and the volunteerswho participated. The DC Big Data Exploration would not have occurred without them.The DC Big Data Exploration was preceded by a similar event that the UNDP organized in Viennain February.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 4EXECUTIVE SUMMARYDr. Jim Kim in a recent speech asked ‘what will it take for the World Bank Group to be at its beston every project, for every client, every day?’ His own prescription was that ‘We must…supportour clients in applying evidence-based, non-ideological solutions to development challenges…This is the next frontier for the World Bank Group…we need to continue investing in data andanalytic tools, building on the success of the Open Data initiative. Data are crucial to settingpriorities, making sound policy, and tracking results.’The age of big data contains the tantalizing promise of reshaping international development.There is overwhelming evidence already from the private sector that big data can betransformative. UPS used sensor data to save 30 million miles off its drivers routes. Reports claimthat predictive analytics has been worth about 23 billion to Target over 8 years. Services likeFarecast use vast amounts of seemingly unconnected data to recreate new information servicesthat would not have been possible a few years ago.The question then is how/whether big data has a role in international development. The UN ledGlobal Pulse initiative seeks to harness todays new world of digital data and real-time analyticsto gain a better understanding of changes in human well-being’. The World Bank too sees bigdata as a promising area – but one that needs further exploration.On March 15-17 DataKind, in partnership with the World Bank and its partners from UNDP,UNDB, UN Global Pulse, and the Qatar Computing Research Initiative, held the DC Big DataExploration to explore new ways of using big data to fight poverty and corruption. The eventdrew more than 120 pro bono data scientists from Washington DC and across the nation to theWorld Bank’s Preston Auditorium. Working alongside Bank experts on the Poverty and Fraud &Corruption teams, the data scientists uncovered new ways of collecting, exploring, andvisualizing data to maximize their impact. The collaboration between the two communitiesyielded new insights from World Bank data, devised new ways of using existing big data sourcesfor monitoring poverty and corruption, and created entirely new streams of data that the Bankand its partners can use in future research.Prior to the event, DataKind and the World Bank’s Poverty and Fraud & Corruption teamsidentified six key projects to tackle over the weekend. The projects were designed to addressthe Bank’s needs and generate tangible insights within a 24-48 hour period:Predicting Small-Scale Poverty Measures from Night Illumination: The team explored whethernighttime light imagery could be used to estimate sub-national poverty levels. Over theweekend, the team created software to overlay lighting information with other geospatialindicators (e.g. population, change in poverty) and performed a statistical analysis showing thatlighting levels in satellite images were predictive of poverty levels in 2001 in Bangladesh. TheBank can use these findings to carry out more sophisticated experiments relating nighttimelighting to poverty and to build software to monitor poverty in real-time from remote sensing.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 5Scraping Websites to Collect Consumption and Price Data: To combat the lack of price data,this team wrote software to scrape food prices from supermarket and cost of living websites,and other sources of food data. The results yielded real-time food price monitoring data forearly alerting to food crises, better information for battling inflation, and a richer perspective onfood data.Measuring Socioeconomic Indicators in Arabic Tweets: This project analyzed more than 25GBof Arabic tweets to see if they could estimate socioeconomic conditions from what people weresaying on social media. The team wrote code to track key socioeconomic terms over time (e.g.,“bankrupt” or “food”) and estimated the time zones, locations, and gender of the authors fromtheir messages alone. These findings could be used in future work to design proper experimentsto test for socioeconomic differences in regions or demographics based on passively collectedsocial media data.Latin America Poverty Analysis from Mobile Surveys: This project analyzed Listening to LatinAmerica (L2L) mobile survey data to understand the socioeconomic conditions in Peru. The teamanalyzed basic survey results as well as discovered patterns in the survey response rates, e.g.mobile response rates yield very different answers to socially sensitive questions and economicincentives don’t seem to affect response rates. The Bank can use these findings to plan futuresurveys so that they collect more accurate information.Analyzing World Bank Data for Signs of Fraud and Corruption: This project combined the WorldBank’s internal supplier, contractor, and bidder data with external data to gain a richerperspective on how firms that had been bidding on Bank-financed contracts behaved. The teamcreated new unified databases that make analysis easier for the Bank, and identified interestingpatterns in debarred-to-non-debarred organization relationships, co-bidder patterns, and Banklending patterns over time.UNDP Resource Allocation: The UNDP provided capacity and project data in order tounderstand what skills and budgets created the best program results. The team exploredexpenditures by project and identified the types of projects and regions that hit their budgetgoals compared to those that did not.Heuristic Auditing Tool and Supplier Social Network Analysis: Two other projects outside ofDataKind were developed over the weekend as well: The first was a tool for automated auditingof bids; the second was a social network analysis tool for understanding the relationshipsbetween suppliers.The weekend in mid-March was just the start. The World Bank Group Open Finances team iscurrently working on the following steps for the program:• Holding an online competition to address an operational question designed to improvethe delivery of Bank projects;
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 6• Considering a partnership with DataKind to:o Work with DataKind’s vetted DataCorps consultants on larger Bank efforts withtangible deliverables;o Engage the data science community to add innovation capacity and data scienceexpertise to the Bank’s ongoing efforts. The DC community was invigorated bythe DC Big Data Exploration; the Bank and DataKind will continue to engage thegroups that participated to collaborate on these issues.• Collaborating with partners inside and outside the Bank to create an analytics programbased on big data techniques and tools like those that were used so successfully duringthe DataDive weekend.Finally it is worth noting that the promise of big data comes with numerous challenges –especially those related to privacy, data quality, attribution, and legal frameworks. The findingsfrom the DataDive are provisional – a number of methodological issues still need to beaddressed: e.g. sample size, selection bias and validity of sources. The promise may howeveroutweigh the perils and the Bank and its partners need to quickly build on the momentumachieved over the DataDive weekend.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 7DETAILED PROJECT RESULTSPREDICTING SMALL-SCALE POVERTY MEASURES FROM NIGHT ILLUMINATIONBackground and Problem StatementPoverty data collection is expensive and slow. To complement this effort, it would be helpful tofind potentially less accurate but cheaper, and more frequent ways to measure poverty. Theteam looked for patterns that could help build leading indicators of poverty by comparingexisting national poverty maps to other geospatial indicators to see if they were correlated or,even better, predictive. Nighttime illumination recorded from satellite imagery was used as theindicator for this project. The team sought to identify whether light levels correlated withpoverty levels and, more importantly, if changes in light intensities could predict changes inpoverty level. If so, then light maps could potentially be used as proxies for poverty data.Datasets Available• The 2001 and 2005 poverty levels of Bangladesh at every Upazila (county administrationlevel), of which there are about 500.• Average nighttime illumination levels for every year from 2001 to 2005 from NOAA.Both datasets were available as raw GIS data. The team also built accessible CSV files containinghistograms of light per region that could be used for statistical analysis. See the AdditionalResources section for other datasets the team collected.Key Findings• Using regression models, the team found that lighting and census data from 2001 waspredictive of poverty levels that year. This finding holds promise for being able topredict poverty levels using satellite imagery of nighttime lighting.• The team did not find similar results for 2005 data nor for predicting the change inpoverty between 2005 and 2001. No statistically significant relationships were foundover the weekend.• Further exploration should be done of the 2005 data before ruling out any predictiveability from lighting maps.• Geospatial poverty data can be combined with other geospatial data and image-baseddata (such as remote imagery) to explore the relationships between different variables.The team wrote code to aid in extracting shapefile (.shp) regions from TIFF files (.tiff) aswell as interactive maps for researchers to use.• Census data combined with light intensity data does a slightly better job at predictingpoverty levels than just using light data aloneMethods and AnalysisThe team took two approaches to exploring the relationship of light intensity to poverty levels.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 8The first involved creating interactive maps that would allow researchers to overlay geographicdata with existing poverty maps. With these maps, researchers could visually understand whichindicators correlate with poverty. The second approach entailed building statistical models ofthe relationships between light intensity and poverty levels.Interactive Poverty MapsThe team collected poverty maps from Bangladesh from 2001 to 2005 as well as maps of thechange in poverty, light intensity, literacy, and total population. After converting all of the datainto a common format (not a simple task as some of the data was graphical and other data wasin geo format), the team created three major interactive visualizations, available at this site:1. Descriptive information: This interactive displays descriptive information about eachregion of Bangladesh for each of the datasets described above. For example, one canview the change in poverty from 2001 to 2005 by region on the map. Figure 1 shows anexample of these descriptive maps.2. Overlaid information: This interactive allows the user to overlay various indicators ontop of illumination data from satellite imagery. For example, one could compare lightintensity to a map of access to electricity or to the change in poverty levels between2001 and 2005. Figure 2 shows an example of an overlaid map.3. Timeline maps: The final interactive shows illumination changes for every year from2001 to 2005. By navigating across this timeline, one can see where regions haveincreased or decreased in illumination levels over time. Figure 3 shows a screen shot ofthe Timeline maps tool.With these tools, researchers can visually inspect the relationships between poverty and othergeospatial datasets. The code used here could be adapted to take in other geospatial or rasterimage data to add to the tool.Figure 1: A descriptive map highlights change in poverty in Bangladesh with the Upazilla“Dharmapasha.”
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 9Statistical Analysis of Light Intensity Levels and Poverty DataUsing the standardized data from both poverty maps of Bangladesh and satellite imagery ofnighttime illumination, the team set out to answer the following questions:• Is poverty correlated with light intensity?• Are changes in poverty correlated with changes in light intensity?Figure 2: An overlaid map of poverty in 2001 and light intensity.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 10To explore the relationship between poverty levels and light intensity, the team first needed toget a measure of light intensity at the same scale as the poverty data. Because the poverty datais on a regional level, the team created an “average light intensity” measure for each region.This measure was computed for each region by first getting an “average” intensity score usingthe equation:score = (%of pixelsatintensityi)⋅ ii=1255∑Once this score was computed for each region, a value was assigned to the region based on thepercentage of other regions that had a lower score during the same year. For example, if aregions average lighting was more than 80% of the other regions in 2001, it would be assignedan "average light intensity" of 80 for 2001.Figure 3: An interactive timeline map of light intensity from 2001 to 2005.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 11Looking just at the plot of light intensity vs. poverty in 2001 for each upazilla (county admin) inFigure 4, we can see a strong correlation between light intensity and poverty levels. The trend isless pronounced in the average light intensity plot for 2005, but still apparent.The team next built a number of models in an attempt to predict actual poverty levels from lightintensity and census data.Figure 4: Poverty vs. light intensity in 2001 and poverty vs. average light intensity in 2005.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 12Poverty vs. Light Intensity in 2001The first model the team built was a linear regression predicting poverty level from lightintensity alone using 2001 data. Figure 5 shows a plot of the predicted poverty levels in 2001and the actual poverty levels using light data alone. All hyper-parameters of the model areselected based on cross validation on 80% of the data. The model is then fit to the same 80%and used to predict the remaining 20%. The team set the alpha parameter to 10.0 for thesecomputations, which is recommended when using 80% of the data as cross-validation.The Root Mean Squared Error (RMSE) for the model using only an intercept term and the 2001average light intensity is 0.076982. The RMSE is a measure of the model’s accuracy where lowerFigures 5 (top) and 5 (bottom): Predicted poverty vs. true poverty in 2001 for the model fit using lightintensity and census data from 2001. (Figure 5: RMSE = 0.067650; Figure 6: RMSE =0.076982).
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 13numbers indicate higher-performing models. This low RMSE is indicative of a good fit, meaningthat this model is a promising sign that 2001 light intensities predicted 2001 poverty levels.Figure 5 shows actual poverty levels vs. the poverty levels predicted by the model.The team also included features from the 2001 census to see if the fit improved. In this case, theRMSE on the test data was 0.067650. This slightly lower score means that the census datacombined with light intensity data does a slightly better job at predicting poverty levels than justusing light data alone. Figure 6 shows the predicted poverty levels vs. true poverty levels usingthis model.The team also looked at how well the census features (i.e. lights in 2001, lights in 2005, andpoverty in 2001) could predict the change in poverty between 2001 and 2005. This RMSE was0.129603 for the model without 2001 census data (Figure 8) and 0.143962 for the model with2001 census data (Figure 9).Based on the results from these models, the team concluded that the combination of lightingand census data from 2001 was predictive of poverty that same year.Figure 6: Predicted poverty vs. true poverty in 2005 for the model fit using light intensityalone. (RMSE = 0.129964)
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 14Figure 7: Predicted poverty vs. true poverty in 2005 for the model fit using light intensity datafrom 2005 and census data from 2001. (RMSE = 0.144068)Figure 8: Change in poverty from 2001 to 2005 predicted against the actual change in povertyusing lights in 2001, lights in 2005, poverty in 2001, and census data in 2001. (RMSE =0.143962)
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 15Recommendations and Next StepsThe World Bank team should re-run these analyses using the most recent and detailed povertymaps available. If there are additional geospatial indicators, they should be included in theanalysis.Based on the preliminary results and findings of the weekend, there are two definitive areas forfurther exploration:• The World Bank can reproduce the analysis prototyped at the event using the mostrecent and detailed poverty data available in different countries. Testing the approachon richer poverty maps, as well as using supplemental geographic and census data couldhelp uncover a deeper connection between light and poverty.• The second area of research could be to identify additional geospatial data sources toincorporate into future versions. Initial sources identified, like available roads and nightillumination, could contribute to a leading proxy for poverty.Additional next projects can include adding more granular data and modeling techniques to thisprototype. Additional geospatial data, along with traditional indicators can supplement theproject for stronger correlations.Figure 9: Change in poverty from 2001 to 2005 predicted against the actual change in povertyusing lights in 2001, lights in 2005, and poverty in 2001. (RMSE = 0.129603)
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 16By identifying new sources, the World Bank has an opportunity to build on this and othergeospatial-data research efforts to provide timely and granular poverty measurements.ADDITIONAL RESOURCESAvailable on the Team’s Hackpado A 2006 NOAA paper on using night-lights and satellites to measure national poverty levels.They cite and address the measurement challenges that the World Bank faces in particular.http://www.ngdc.noaa.gov/dmsp/pubs/Poverty_index_20061227_a.pdfo Additional geospatial data and external resources (a detailed list).o UN studies on parking lot density and cellphone coverage predicting poverty levels:o Parking Lot Studyo Cellphone Coverage Studyo Code to convert GeoTIFFs to shapefileso Two academic papers created by the team:o Luminosityo Lighting Up Povertyo Interactive Visualizations
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 17SCRAPING WEBSITES TO COLLECT CONSUMPTION AND PRICE DATABackground and Problem StatementCollecting detailed food price data is not just important for poverty monitoring, but also criticalfor the economic management of a country. In 2009, Kenya had an official inflation rate of25%but a lending rate of below 20 percent. If these numbers were true, banks would have gonebankrupt. Banks thus had to “guess” inflation levels to set their interest rates.What’s more, existing food price data often takes a huge amount of time to collect and existsonly at the national level. Differences at the regional level are difficult to come by. In mostdeveloping countries, food makes up the largest share of inflation (often up to half).This project team, dubbed “Team Ndizi”, decided to supplement the World Bank’s current foodprice data by scraping new, more real-time food price data from other sources. Their goal was toidentify if:• Other data sources could be scraped to create more real-time estimates of food pricesand therefore estimate inflation rates and poverty.• There existed data sources that captured price data at regional and sub-national levelsso that prices could be compared across the country.Datasets AvailableThe team did not begin with any data but instead scraped a multitude of websites to createdetailed datasets of price information.Key FindingsTeam Ndizi proved how useful scraping can be for the World Bank and they were able to collectfood prices across a number of helpful sources.As with all other projects from the DC Big Data Exploration, the findings should be consideredprovisional, as there are a number of methodological issues that still need to be addressed, e.g.sample size, selection bias and validity of sources.• Global food prices could be scraped from humunch.com• Daily crop prices in Kenya could be scraped from mobile price providers like mFarmgoing back 1,000 days. This could be extended to other countries.• Prices for multiple common goods could be scraped from grocery store sites to create ahealthy “food basket.” The team found prices at the national and sub-national level andverified them against cost-of-living sites.• The team used the Wayback Machine to scrape historical data, and demonstrated that• pulling historical rice prices showed evidence of the Indonesian rice crisis before globalfood prices did.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 18Methods and AnalysisTeam Ndizi identified dozens of websites to be scraped and dozens of sources for official pricedata. They proved that it was feasible to scrape global and regional data from existing sourcesthat could then be used by the World Bank to track price changes, and monitor inflation.Scraping Global Banana PricesTeam Ndizi first turned their sights on http://www.humuch.com, a global price repository. Whilethe site allows users to look up specific commodity prices and map them, the data is not in a rawmachine readable format. Team Ndizi trained four of its members to scrape banana prices fromthe site and converted the data to a machine-readable format. Figure 11 shows a comparison ofbanana prices by continent as a proof of concept.Figure 11: A bar graph of banana prices by continent created using data scraped fromhumunch.com.Figure 10: A line plot of 2012 monthly average dry maize prices in five different regions inKenya. This plot was created from data the team scraped from mFarm.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 19Scraping Farm Data in KenyaThe team next looked to the site mFarm, which forwards crop price information to farmers inKenya via cellphone so they can make more informed decisions. By harvesting this data“exhaust” from the site, the team gleaned monthly price data for a range of crops across regionsof Kenya. This data could be used to follow sudden changes in food prices or as a comparison toprices in other parts of the world. Figure 12 shows a graph of dry maize prices for five regions inKenya across all of 2012. The team also began working on interactive line graphs and maps ofthis data so that the World Bank could easily access the data in a readable format.Scraping Grocery Prices from South AfricaThe team also scraped prices of a wide range of goods from Pick ‘n Pay’s grocery store websitesin South Africa. These sites provide food prices for all of their products but, again, they arelocked in the website and are not available for download and analysis. Team Ndizi’s freed thisdata using web scraping, and focused on 11 essential food types that to a balanced daily diet.With this data, the team could price a typical food basket for someone surviving on a 2,000calorie diet. They compared these prices across different countries as well as looked atbreakdowns of how much each good contributed to the total cost of the basket. Figure 13shows the daily cost per person for a balanced 2,000 calorie diet across four African countriesand the US. Figure 14 shows the cost of a balanced 2,000 calorie diet in South Africa and whatproportion of that cost is attributable to each good. The team also replicated this data for a fewcountries in Africa as well as sub-regions within each country, as shown in Figure 15. Theseresults are significant as they show that, with regular scraping, the World Bank can create real-time measures of food basket prices around the world, even at the sub-national level.Figure 12: The daily cost per person for a balanced 2,000 calorie diet across five countries (fourAfrican, one USA). The prices of 11 staple foods is included and visualized proportionally.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 20Perhaps most interestingly, Team Ndizi also validated their price scraping by comparing theirprice estimates against known cost of living calculators across the Web. Figure 16 shows TeamNdizi’s scraped prices for 11 basic food items alongside the prices reported by three well-knowncost of living websites: Numbeo, Xpatulator, and Expatistan. From this plot we can see twocompelling findings: first, the price that Team Ndizi found from scraping Pick ‘n Pay grocerystore websites in South Africa is never more than a few cents away from the average estimatefrom the other cost of living sites. Secondly, only Numbeo actually measures all of these items,while the other sites were lacking at least one product that Ndizi was able to measure.This data could be used in a number of different ways. Prices could be tracked over time or thecost of living could be computed using different food products. A full view of food prices acrossthe country (depending on the coverage of the grocery websites) could be provided to theWorld Bank.Figure 13: Proportional break down of costs for a healthy balanced diet in South Africa.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 21Indonesian Rice Crisis TrackingTeam Ndizi’s final project looked at rice prices in Indonesia over time. The data was scrapedfrom Carrefour Indonesia, a popular supermarket chain. The team also used the WaybackMachine to go back to historical versions of the website to collect data. Some experimentationwas done with pulling prices from Twitter data as well, but there was not enough time to createa full-fledge “universal” scraper from all sources.Figure 15: Food prices for 11 items that Team Ndizi scraped (orange) compared to prices fromknown cost of living websites.Figure 16: World food prices, as reported by major monitoring agencies (green and yellow) vs.prices of two brands of rice in Indonesia (blue and red).
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 22What was most interesting about the Indonesian price-scraping project was that it proved theimportance of having more real-time food data in anticipating and managing crises. The teamscraped the price of rice per kg for two different brands of rice (Si Pulen Crystal and Topi KokiSetra Ramos) in Indonesia from January 2010 to March 2013. Figure 16 shows these pricesplotted over time against World Food Prices and the Food and Agricultural Organization’s (FAO)data. What is striking from this plot is that the rice prices in Indonesia increased by a good$1/USD per kg above the world’s food prices. This shows that early signs of the rice crisis mayhave been observed in this data that wouldn’t have been observed by looking at global data.Moreover, the team scraped data that extends beyond estimates that the FAO was able toprovide. FAO data stopped in October 2012.Recommendations and Next StepsTeam Ndizi showed how easily food price data can be scraped and collected and has providedcode and data to replicate their work. Some areas for future work and more careful analysisinclude:• Using web-scraping techniques, as prototyped in the event, to create granular, and nearreal-time measures of food prices at sub-national levels;• Further examining the difference between scraped and official data as a tool for filling ingaps where the current measures of price data are aggregated or infrequent. TheIndonesian rice price exploration from the DataDive may serve as an example of theuseful perspective that could be gained from granular, local, and nearing real-time data;• More robust comparisons and correlations from the scraped price data to knowneconomic metrics and historical data. During the event, the team conducted some basicvalidation of the price scraped data by comparing their price estimates against knowncost of living calculators across the Web. The deeper dive will help the Bank determinehow useful this granular view of price data can be;• If the methods outlined prove to be useful, then a more basic ‘universal’ scraper couldbe built to automate the sampling of price data from sources around the Web.Additional ResourcesThe team’s HackPad contains an in-depth list of examples of related projects, the datasets theycollected, and the code needed to recreate their work.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 23LATIN AMERICA POVERTY ANALYSIS FROM MOBILE SURVEYSProblem StatementThe World Bank’s poverty team is interested in finding new ways to measure welfare in LatinAmerican countries. Two pilot surveys of household well being, called Listening to Latin America(L2L) used cell phones. In this project, teams used the L2L data from the Peru survey to addressthe following question:• Is it possible to draw inferences about changes in welfare at the national level using datacollected with cellphones, or a combination of this data and the national householdsurvey (ENAHO) data?Datasets Available• L2L Survey Data (SPSS Format, also CSV, currently)• Peru: Initial F2F Questionnaire• Honduras F2F Questionnaire*• Peru: Mobile Questionnaire• Honduras Mobile Questionnaire*• L2L Final Report• ENAHO data*The data was not analyzed at the event.Key Findings• More “Yes” answers were given to personal and negative questions when follow-upsurveys were delivered by SMS or phone call than in-person. This finding may indicatethat people’s impersonal answers are more honest than when they are face-to-face.Mobile data collection may therefore be more accurate, in some contexts, than face-to-face surveys.• Monetary incentives did not appear to influence response rates, regardless of thetechnologies involved.• The data from these surveys provides very rich detail about the Peruvian people andcould be used at a broad level to learn more about socioeconomic conditions in thiscountry.As with all other projects from the DC Big Data Exploration, the findings should be consideredprovisional, as there are a number of methodological issues that still need to be addressed, e.g.sample size, selection bias and validity of sources.Methods and AnalysisInitial Exploration of the Survey DataThe team first dove into understanding the questions asked in each survey, before looking at thebasic statistics of Peruvian respondents in aggregate. The data on the Peruvian people was so
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 24rich that it could (and perhaps should) be the subject of a research project in its own right. Withjust a quick look at the basic statistics, the team found the following basic statistics:• Family sizes of approximately three to six members;• In about half of the cases, neither of the respondents parents was educated;• The number of hours of work each respondent reported was normally distributedaround a mean of 40 hours with a standard deviation of about 15 hours;• Most respondents worked in farming and related fields, with small business andhousework being next most popular. The remaining occupations comprised the long tailof the rest of the data;• 40% of respondents have soil floors;• 8% of respondents had someone in the household lose their job in the last month;• 8% of respondents had someone in their household find a new job in the last month;• 7% of respondents had someone in their household miss school for lack of money in thelast month;• 13% of respondents had someone miss school due to sickness in the last month;• 11% of respondents had someone robbed in their household in the last month;• 9% of respondents had moved in the last three years.These findings begin to paint the picture of socioeconomic conditions in Peru. They may alreadybe known to the World Bank Poverty team, but the DataKind team felt it was worth bringingthese figures up in case any were new or surprising. In either instance, it is quite easy to repeatthis analysis for future survey data.Response Rate AnalysisThe team’s next analysis was designed to confirm the results in the published study, whichfocused primarily on response rates in the follow-up studies, and analyze which factors of thesurvey correlated with higher and lower response rates.The team first looked at response rates regardless of follow-up technology. Of the 1,600 peoplesurveyed, about 1,000 didnt respond to any follow-ups, while the other 600 responded to abouthalf of the follow ups. Given the fact that households had agreed to participate in the six-monthsurvey beforehand, the response rate seemed particularly low to the team. The team wonderedif phone network connectivity was playing a role here and recommends the World Bank explorerural versus urban response rates.Effects of Survey TechnologyThe team next looked at the effects of follow-up technology on response rates. The overall goalof this study was to evaluate the effectiveness of mobile-phone based surveying. Answering thisquestion could have far-reaching implications for the Bank’s ability to collect accurateinformation with the global reach and ease of using mobile technology.The data the team was working with contained responses to surveys for an initial survey andthen six follow-up surveys. The first survey was conducted in the traditional way: face-to-face
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 25and in person. The subsequent surveys were performed using one of three methods: humantelephone conversation (CATI), pre-recorded voice mobile phone interview (IVR), and text-based(SMS).The breakdown of response rates by technology is shown in the table below:Method Response RateCATI 50%SMS 30%IVR 25%Here again we see fairly low response rates. The aforementioned question of cellphonereliability would be important to address. Interestingly, person-to-person interviews yielded thehighest response rates, which may be due to the respondent’s feeling of responsibility toanother person that they do not feel when ignoring a pre-recorded phone call or an SMS.Incentives for ResponsesThe team also explored whether monetary incentives affected response rates (see Figure 17).The incentives, in this case, seemed to have little or no impact on response rate. The differencebetween no incentive and a small incentive showed no major increase and the increase of theincentive from $1-$5 appeared to have no impact on the response rate.Figure 17: Response rates over six months broken out by technology type and monetaryincentive offered (none, $1, $5).
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 26The lack of incentive motivation may be particular to this survey, so it is recommended that thesame analysis be performed on the Honduras data to see if similar findings to those in Peru areobserved. Moreover, a more rigorous analysis should be performed to see if there are truly nostatistically significant increases in responses based on incentives. A larger dataset may beneeded to answer this question confidently. While the financial incentives did not appear toincrease response rate on average, the team also thought it would be interesting to explore therelated question responses about financial hardships to explore if they affect the impact of themonetary incentive.Effects CATI vs. Technology-Intermediated Follow-upsThe team next looked to see if the presence of humans in the follow-up process (CATI) affectedthe responses in ways that purely technological follow-ups (SMS) didn’t. To address thisquestion, the team looked at the ratio of “No” to “Yes” answers to very personal questionsabout negative outcomes, i.e. “did something bad happen to you in the last X months?”In CATI follow-up surveys, the No: Yes ratio varies between 1:7 and 1:16, indicating that a fairlylow proportion of people are reporting bad incidents happening to them (5.9% to 12.5% ‘no’responses). When looking at responses to the identical questions on text-mobile surveys,however, the ratio of No: Yes answers drops the range from 1:3 to 1:6 (14%-25% ‘no’responses). In other words, about twice as many people report something bad happening tothem via impersonal SMS follow-up survey than report it in a CATI follow-up.There are many theories as to why this could be happening, not least of which is that the groupsurveyed via mobile may in fact have had more bad things happen to them. The experimentshould be repeated with other respondents to see if the trend is observed again. Other theoriesinclude:• Selection bias: people who have had a bad event happen to them are more likely torespond to the surveys in the first place.• Shame/Privacy concern: People under-report bad events happening to them in a face-to-face interview. There is precedence for this behavior: there is a body of literature onphysician illness that shows serious under-reporting of depression in in-personinterviews vs. anonymous interviews (~1% vs. 15% prevalence)Researchers should begin by exploring not just the ratio of No: Yes answers, but also theresponse rates for different types of questions to determine whether selection bias is at work.The team was interested in knowing whether respondents who had something bad happen tothem were more likely to respond because they had something to report.It would also be interesting to repeat this study to see if the same effect is observed forquestions that are considered “neutral” or “good”, e.g. did you find a job recently? A relatedquestion would be whether certain topics of questions (e.g. education, finance, household) wereaffected by the type of technology used in the surveys.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 27The team explored these ideas graphically as well. Figure 18 shows the percentage of “Yes”answers to each of seven questions in the Peru Mobile Questionnaire, where each linerepresents the responses for the face-to-face, first follow-up, and second follow-up phases ofthe interviews. From this figure, we see a general upward trend on almost all questions overtime. Without a doubt the percentage of “Yes” answers increases during follow-up 1 after theface-to-face. The team wondered what could account for such a pronounced increase in “Yes”answers during the follow-up sessions.The team dug in to see if talking to a human correlates with an under-reporting of answers. Theteam compared the percentage of “Yes” answers given to a very personal and negative question– whether the respondent had been robbed in the last month – across respondents thatreceived follow-ups via SMS, IVR, and CATI. The results in Figure 19, which graphs thepercentage of “Yes” responses given during the face-to-face, first follow-up, and second-follow-Figure 18: Percentage of “Yes” answers at each stage of the process for seven differentquestions.Figure 19: Percentage of “Yes” responses given at each stage of the interview process,broken out by technology.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 28up broken down by technology, could indicate that the more impersonal the mechanism, thehigher the reported incidence of robbery. We can see that the IVR recipients reported thehighest rate of robbery during their first follow-up, while they seemed about equal to the SMSgroup during the face-to-face and second follow-up. What is striking is that the CATI (human-voice interview) responses are consistently lower than the other two technologies.Again, these results could be due to legitimate differences in the groups (e.g. the SMS group, bychance, really did get robbed more often than the CATI group).Recommendations and Next StepsAs the data is so rich, the team felt it would be useful to analyze differences in responses toquestions, broken out by demographic groups, e.g. wealth, geographic location, educationalattainment, age, etc. Presuming that the target population is the marginalized and rural poor,analyses could shed light on ways to modify future mobile surveys to target this population.Some ideas for deeper analysis of the L2L survey:• On monetary incentives: the Bank could do a deeper dive into the data to determinewhether or not there is a relationship between the financial questions of the survey andthe impact of the monetary incentive as well as retention in the survey.• On the use of mobile for follow-up surveys: the team questioned whether wirelessconnectivity could play a role in the response rate of the follow-up surveys, especiallywhere respondents live in rural areas. The Bank should also consider exploring morefactors that may impact survey response rates.• The analyses in this report should be repeated on the Honduras data and compared toPeru to see if similar trends emerge.• Lastly, the World Bank should conduct a similar survey in Peru to see if the results canbe duplicated. There could be good evidence that mobile survey responses are morereliable and real-time than in-person surveys. The implications of this finding couldmean that the World Bank would not only have access to more people around the worldthrough mobile surveys than they could reasonably support in-person, but also that theresults would be more accurate and complete than are provided by in-person surveys.Additional Resources• Charles F. Turners work on mode effect on the collection of data regarding sensitive orrisky behavior. In particular, T-ACASI reduces bias in STD Measurements: The NationalSTD and Behavior Measurement Experiment• Mick Cooper’s work on mode comparison• Dr. Edith de Leeuws research regarding mode comparison• Eleanor Singer has a comprehensive article on the impact of incentives on responserates in household surveys.• CPS is a longitudinal survey that uses mixed methods that might be useful or surveymethodology.• R code to generate response rates by incentive and technology
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 29MEASURING SOCIOECONOMIC INDICATORS IN ARABIC TWEETSBackground and Problem StatementConventional poverty measures are time-consuming and expensive to collect. The World Bank isinterested in exploring alternative data sources for measuring poverty that are easier to collectand less expensive to update. It was hypothesized that monitoring conversations on Twitter mayshed light onto socioeconomic conditions based on what people talk about and attributes oftheir conversations. Their goal was to research what information could be drawn from thesetweets to inspire future experts. They explored questions such as:• Do the frequencies of key socioeconomic keywords (e.g. “broke” or “need money”)change over time and, if so, do those changes reveal anything interesting?• Can we learn about the social network structure of people tweeting to each other anddoes that teach us anything about their socioeconomic conditions?• What can we learn about someone tweeting from just their text or other aspects oftheir tweets?• Can we correlate any of the activity on Twitter with standard poverty indicators?Datasets Available• 25GB of Arabic tweets spanning a six-week period from November 2011 to January2012. Qatar Computing Research Initiative (QCRI) delivered the tweets, and claimed it tobe a nationally representative sample. The dataset was so large that the team storedthe data in a database on Amazon Web Services. The team used samples of the data tostudy their questions.• An English to Arabic translation of key socioeconomic terms.Key FindingsThe World Bank could monitor tweets and other social media channels to potentially learn moreabout a range of socioeconomic indicators:• The team found clear periodic cycles in features of the Twitter data. These could becorrelated against existing poverty indicators;• The team was able to identify a user’s location using only their message text and thetimes of day they tweeted;• Gender can likely be detected from language patterns in text and could thus be used asinput to socioeconomic modeling;• It is possible to infer a measure of social connectedness from the network of tweets.This measure could be correlated with socioeconomic conditions.As with all other projects from the DC Big Data Exploration, the findings should be consideredprovisional, as there are a number of methodological issues that still need to be addressed, e.g.sample size, selection bias and validity of sources.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 30Methods and AnalysisKeyword UsageThe team began their exploration by looking at the frequency of keywords over time. Using theEnglish to Arabic translation, the team came up with three categories of terms they would track:Everyday item terms, economic terms, and positive sentiment terms.“Everyday” item mentions:Gasoline نيزنبلورتبBread شيع جاصةنوبطةزبخزبخRice زر زرأMeat محل ةمحل موحلMilk نبلبيلحButter دبز ةدبزBeans ايبولةيبوللوفايلوصافCigarettes رئاجس ورراجCar/Auto ةرايس ةيبرع“Economic” mentions:Price رعس راعسا ةنمثأنمثMoney دوقنلامسولفFees/Bills تاقفنفيراصم ةرتكافريتاوفPurchasing/Buying ءارتشاءارشعفدCredit/Loan قوست ةراعإفلسفيلستدصقيديركضرقSalary/Pension بتارتابترمبترمشاعمWork/Job لغشلمعRain اتشةيرطمتاطقاستراطمأرطمDay(s), Week(s), Month(s) روهشنيرهشرهشعيباسانيعوبساعوبسأمايأنيمويموي
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 31“Positive sentiment” wordsطوسبم ةطوسبم ديعس سيوك شوخ بيط زاتمم ♥ ♡ ةعئار ةولح ةفحت ليمجThe team wrote code to count the occurrences of phrases in each category to see if theychanged significantly over time. Figure 20 shows these frequencies. While there may have beena slight increase in the total number of mentions over time, no significant trend was seen, norwas it possible for the team to identify sudden increases or decreases of phrases in this graph.With another poverty indicator, it could be possible to find correlations between the twodatasets.Next, the team turned to ways of controlling for different variables in the data once an indicatorwas determined for future projects. If they could write code to extract features of the tweets,such as periodic trends in tweet frequency or the gender of a user, those facts could help WorldBank researches correlate the Twitter data with other poverty measures or they could block forthese variables in an experiment.Figure 20: Mentions of everyday terms (red), economic terms (green), and positive terms(blue) over time.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 32Timing of TweetsFigure 21 shows the frequency of tweets by the day of the week while Figure 22 shows thepercentage of tweets by time of day. The team observed clear cyclic trends in when people weretweeting, therefore could be able to account for time of day in tweets when performing a realexperiment.Figure 21: Number of tweets by the day of the week. 28% of tweets occur on the weekends.The most tweets occur on Monday. All times are in Greenwich Mean Time.Figure 22: Percentage of tweets by time of day. All times are in Greenwich Mean Time.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 33While the plot for Figure 23 is for the entire dataset, we could isolate an individual user to theirrespective time zone. Knowing the user’s time zone would allow future researchers to accountfor time when running experiments. This data may also be available directly from Twitter.Country ReferencesThe team next looked to see if they could determine the origin location of each tweet based onmentions of countries.Figure 24 shows the number of times that each major Arab country was mentioned with ahashtag in a tweet. Bahrain was the most frequently mentioned country, followed by Syria.Many of the Bahrain tweets referred to the upcoming anniversary of the February 14 protests.Identifying the location of a tweet is important because it could help researchers infer thesocioeconomic conditions in that region. In Figure 24, the team also established methods fordetermining the origin country of a tweet using only the message text. While Twitter mayprovide this information to us automatically, other forms of social media may not. This code canbe adapted to estimate locations of messages so that researchers can account for regionaleffects when running experiments.Figure 24: Number of tweets by hour of day. All times are in Greenwich Mean Time.Figure 23: Number of mentions of each country, computed by counting country hashtags.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 34Gender IdentificationKnowing the gender of the author could be very beneficial to understanding one’ssocioeconomic conditions. It could also be another variable to account for when runningexperiments. The team identified resources that could be used to infer the gender of the authorof each tweet from the message text alone. Using the suffixes of words in Arabic may be astraightforward way of determining gender.Social ConnectednessResearch suggests that people who are more socially connected are more affluent than thosewho are not. To pursue this idea, the team explored frequency of tweeting and socialconnectedness of tweeters in the dataset.The team first counted the number of times each person tweeted during the three-monthwindow (Figure 25). Most people tweeted once or twice, with very few people tweeting more.Two accounts tweeted five and six times, one of which appears to be a news source. With theseresults, Bank experts could try to determine the socioeconomic status of the Twitter accountsinvolved to see if there is a correlation between tweet frequency and affluence.Figure 25: Groups of tweeters joined by the number of times they tweeted in this three-monthperiod.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 35The team searched for every tweet in which a user “@” mentioned another user, and usedthese connections to build the social graph of all of the users. Each node represents a Twitteruser and each line indicates that one of the two users it connects mentioned the other in atweet. Larger nodes have more connections. Colors indicate social groups, e.g. nodes connectedwith green lines have more friends in common with other nodes connected with green linesthan with nodes connected with red lines.From this graph, we can see that there are a few large nodes, namely the large red one towardthe upper left corner, which are highly socially connected. It would be interesting to look atthese individual accounts and see if their socioeconomic status can be determined and find acorrelation between connectedness of nodes and socioeconomic conditions.Figure 26: A connectedness graph showing Twitter users who mentioned one another. Eachnode represents a Twitter user and each line indicates that one of the two users it connectsmentioned the other in a tweet. Larger nodes have more connections.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 36Recommendations and Next StepsNone of the assumptions about frequency of keywords or locations mentioned can be verifiedwithout a good indicator to measure against. The World Bank experts should identify keyindicators that can be matched with the trends in the data or use their expertise to verify thatpatterns in the data track with some known qualitative measure of poverty:• Subject matter experts in poverty should team with data scientists to help advise on theproblem formulation beforehand as well as during the project;• More detailed information about the tweets should be secured. We suspect the data wereceived was trimmed of GPS data, free-text locations, and more.Additional Resources• The team’s HackPad page• The team’s project page on GitHub• Final presentation• UNGP projects on mining tweets for unemployment and crisis-related issues:o Study on monitoring crisis and stress (video)o Study on unemployment statistics• Additional information about Twitter dataset from Vienna Open Data Day• QCRI Permission to use data• Male/Female language differences, from Debra Tannen• Stopwords in social signaling• Kate Niederhoffer• Jamie Pennebakers group at UT-Austin
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 37ANALYZING WORLD BANK DATA FOR SIGNS OF FRAUD AND CORRUPTIONDetailed Problem StatementThe World Bank’s Fraud and Corruption team is faced with the weighty task of detectingindividuals and companies that misuse or misappropriate funds on Bank financed projects.Corruption can occur at almost any stage of the project pipeline, from design to bidding andfinal execution. It can be difficult for the Fraud and Corruption team to gain a full view ofpotential grievances because data about bidders, contractors, and contracts often live indifferent datasets around the Bank and are not consolidated. Moreover, a large amount of dataabout bidders and projects exists outside of the Bank in areas such as project implementationunits and more that could be used to gain more insights about the bidders and contactorsinvolved. The Fraud and Corruption team is often faced with the insurmountable task of trackingsuspicious companies by hand.The team explored how they could help strengthen and scale the World Bank’s methods usingnew data and analytical tools, focusing on the following main tasks:• Creating contractor profiles containing external corporate data such as location, chiefpersonnel, date incorporated, etc. Consolidating this information would help identifyundisclosed relationships between firms, and hopefully lead to a method to discovershell corporations.• Consolidate existing World Bank datasets and produce datasets from unstructuredsources within the Bank. Using this data the team explored and built tools to highlightcontractor behavior and activity, such as bidder relationships.Datasets Available• World Bank Project API• OpenCorporates• Major Contracts Awarded• World Bank Project Pages• Debarment DocumentsFor detailed lists of datasets used, see the Datasets sections of the team’s two HackPads:HackPad 1 and HackPad 2.Key Findings• Debarment data can be scraped to create a full list of all debarred companies, which canthen be analyzed. The team created ranked lists of countries by number of debarments,based on debarment type, and changes in debarments over time;• Using external corporation data, the team was able to measure relationships between“similar” suppliers. The team built network graphs that showed relationships betweendebarred and non-debarred firms that shared similar addresses, phone numbers,officers, or names;• The team was able to scrape co-bidder information from the Web and used that data tobuild social networks of co-bidders. This code could be used by the Bank to identifysuspicious activities between co-bidders;
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 38• The team proved it was feasible to combine disparate Bank datasets into more unifiedsupplier profiles. They wrapped this unified data into an API so that the Bank could haveconsolidated supplier information;• The team analyzed project approval trends over time and found increases in the numberof projects approved toward the end of each month and in the spring and early summer,specifically. The team did not draw any conclusions from this but this could prove quiteinteresting to explore further.As with all other projects from the Big DC Data Exploration, the findings should be consideredprovisional, as there are a number of methodological issues that still need to be addressed, e.g.sample size, selection bias and validity of sources.Methods and AnalysisThe major goal of the weekend project was to provide new datasets and algorithms that couldautomatically identify organizations, either bidders or contractors, as potential risks to the Bank.To this end, the team first created a number of datasets that they then analyzed for suspiciouspatterns.Historical Debarment DataThe first task the team tackled was compiling a list of historical debarred firms. With this list,one could compare incoming bidders and contractors against debarred companies to see if theyshare suspicious similarities, e.g. same address or phone number.The dataset was compiled using the help of the Wayback Machine, which allowed the team tosee the Bank’s list of debarred firms over time. This approach proved that scraping the Web fordata could be used to create a constantly updated list of debarred firms. The Bank, however,likely has this information internally. Digitizing it could sidestep the need for this approach.Figure 27 shows the average time of debarment by country for firms that are not bannedpermanently, color-coded by whether countries are borrowing or non-borrowing. Greece topsthe list for longest debarments and is a non-borrowing country.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 39Figure 27: Ranked list of countries by number of firms, along with proportions of firmspermanently debarred.Figure 28 shows countries as they are ranked by percentage of permanently debarred firms.Here Ireland and the United Arab Emirates top the list, with 100% of debarments in thesecountries permanent. However, they each have only a few firms debarred, so this is notcompletely surprising. The UK, in contrast, has a higher number of debarred firms than either,however only two-thirds of the debarments against UK firms are permanent. These patternsmay be interesting to investigate further.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 40Figure 28: Ranked list of countries by proportion of permanently debarred firmsvs. temporarily debarred.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 41The team also looked at the change in the average number of non-permanent debarments incountries before and after 2006. Figure 29 shows the changes in debarment rates as slopinglines, colored by whether they’re increasing or decreasing. From this graph, we can seeBangladesh, China, and the US increased the most between the two time periods, while the UK,Indonesia, and Sweden decreased the most. Figures like these might be interesting to the Bankteam in understanding where concentrations of debarments are changing over time. These maybe indicative of areas that are becoming more corrupt or that are improving over time.Figure 29: Changes in number of debarments by country before and after 2006.Each line connects a country’s debarment number between the two time periods.Red lines indicate decreases, green lines indicate increases.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 42Lastly, the team performed a visual analysis of the grounds for debarment for each country. ATableau report on grounds for debarment can be viewed here (Windows required), while a fullTableau report on the debarred data can be downloaded here. Two highlights from the reportsare shown below:Figure 30: World map of debarments by type in each country.In Figure 30, the size of each pie chart is proportional to the number of debarred firms. In Figure31, there appears to be few consistent trends across countries, each being unique in itscomposition of reasons for debarment.Figure 31: Ranked list of countries by most debarred firms, broken out by reason fordebarment.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 43Debarred and Non-Debarred Firm RelationshipsThe team next looked at relationships between debarred and non-debarred firms. To analyzethe similarities between these firms (e.g. similar addresses, phone numbers, names), the teamfirst had to supplement the contractor and bidder profiles with identifying information.OpenCorporates, a freely available database of company registrations, was merged with theBank’s list of non-debarred firms to add addresses, phone numbers, and officers.The team now had two lists of firms with identifying information such as address and officersincluded one for debarred firms and one for non-debarred firms. They built a networkvisualization to understand the relationships between debarred and non-debarred firms using asimple matching measure. Firms are represented as nodes: debarred nodes are red and non-debarred nodes are green. Nodes share an edge if they are considered “similar,” in the sensethat they share an address, a phone number, officers, or a similar spelling of their names. Figure32 shows one example of a network of connections between a major debarred company (theleft, large red node) and all other firms. Note the high number of non-debarred countriesconnected to it.Figure 32: Network diagram of connections between debarred (red) and non-debarred (green)firms. Advantages exist between firms if they share common attributes like addresses, phonenumbers, officers, or similar names.This figure indicates that suspicious relationships may exist between the companies. Furtherstudy should be done on the debarred/non-debarred groups that share edges to understandwhy they are linked and what that means.For future work, the team suggested developing an automated which flags contractors when: afirm’s geodesic distance to a debarred firm falls below a certain threshold; j of its k nearestneighbors have debarred histories; it is classified as a debarred firm using a supervisedclustering algorithm trained on a carefully vetted sample of the data.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 44Supplier Profile DatabaseHaving looked at the debarred companies specifically, the team next turned to building a fullSupplier Profile Database. Before the DC Big Data Exploration, much of the data from suppliersexisted across different datasets. The team created a unified database enabling users to drilldown on supplier information. This database was formed by combining results from search APIsto obtain supplier and Bank data. Code to link suppliers with the Bank’s projects can be foundhere and the full database code can be found on Cam Cook’s GitHub page.The Bank can use this project as a framework to develop a tool that can be used by both fraudand corruption examiners as well as implementing agencies to analyze contractors or potentialcontractors. Users can manually examine relationships between firms. If supplemented by dataon debarred firms examiners could identify relationships to known debarred firms or individuals.Mapping Bidder RelationshipsThe team next looked at interesting patterns in the relationships between bidders. To begin, theteam gathered the URLs of all award notices from the Bank. They then scraped these sites toproduce data about the award and about all the bidders involved. They then generated anetwork of relationships between bidders. Figure 33 shows networks of co-bidders, where eachnode is a firm and groups of nodes all bid on contracts together. The visualization only includesfirms that had bid on three or more contracts. Each edge indicates that the two connected firmsbid on a contract together, with darker, thicker edges indicating more co-bids. The node size isbased on number of bids and the bluer a node is, the more centrally connected it is.Figure 33: Clusters of common co-bidders. Only companies that bid on three ormore awards are included. Node sizes are proportional to number of bids andnodes are more blue the more central they are. Edge widths and colors areproportional to number of co-bids.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 45From the figure we can see some very interesting synergies. G3 appears to be a set of bidderswho have all bid on a contract with one another the same number of times. G1 has a highlyirregular pattern in which one central node co-bids with a few other partners, each of whomhave their own networks of co-bidding. Again a more in-depth analysis of the data needs to beundertaken to ascertain the relevance and accuracy of the findings. Code and data to buildthese networks can be found in the Additional Resources section.Having shown that mapping bidder relationships is possible, the Bank can extend this method toconduct analyses on the impact of project factors to the bidding process. Adding the debarmentdata and/or contractor profile would greatly help identify whether collusion is likely occurring.Fraud and Corruption staff could examine the degree of separation between firms; bidding firmswith common addresses, officers, etc. might be likely to be involved in collusion.Lastly, the team examined trends in the project approval process, specifically the number ofapprovals made by the World Bank per year. The team acquired all project data made by theWorld Bank between 1947 and 2012 and ran an analysis of the trends over time (Figures 34-35).Project Approval TrendsFigure 34: Total number of project approvals made by the Board per year.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 46Figure 35: The left chart shows approvals per month, where we see an increase in the number ofapprovals toward the end of each month. The right shows the aggregate of all approvals.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 47Figure 36: Project approvals by month over time. June approvals seem to increase most.Figure 36 shows the proportion of approvals by month of year over time along with trend linesfit by linear regression. Green lines indicate an increase in the number of approvals while redindicate a decrease. We can see that the spring and summer months have been rising in theproportion of approvals over time, meaning the Bank is granting proportionally more in latespring and early summer.Recommendations and Next StepsFor future projects on debarment, it would be useful to have the following:• A chronology of company/individual actions that resulted in debarment• For the companies/individuals that have been debarred:o Did the projects request extensions?o Did the projects request additional funding?o Were they “problem” projects?The Bank needs to create and maintain a formal "data warehouse" of their data that is cleansed,organized and well cataloged. The Bank should consider creating unified profiles for:• Countries• Suppliers• Project types• Projects• Evaluation types• Project activities• Project activity outcome types• Time periods
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 48To enable a proper, useful data warehouse, rigor and data cleansing/ETL (Extract, Transform andLoad) will need to be implemented.When studying contractor relationships, the team found that there are companies that may nothave the capacity to do the projects and therefore most likely hire subcontractors. Findingbetter data for subcontractors and individual consultants could greatly improve the quality ofthe results.The World Bank may want to supply governments with standardized forms or questionnaires toaccompany RFPs and ask for experience of individuals who will potentially be working on theprojects as well as the backgrounds of management teams. This form should be uniform for allprojects and ask specific questions such as capability of the contractors and subcontractors andtheir past experience dealing with similar projects and results. In the case of subcontractors, thesame form should be filled out. Also, the Bank may want to pick a few subcontractors at randomand interview them confidentially to find out more about the work being subcontracted.Note: Even though the governments are the ones who grant/award projects, the Bank canprovide this form for the governments and have it as a requirement in the RFPs. The team wasnot sure how much the Bank is involved in the RFPs and if RFPs are standardized or not.There are a rich number of future projects available from this weekend’s results. High potentialtopics include:• Creating automated algorithms to flag suspicious firms and companies as they comeinto the Bank’s pipeline; For example, building on the analysis at the DC Big DataExploration, the Bank may test out methods of flagging contractors when: a firm’sgeodesic distance to a debarred firm falls below a certain threshold; j of its k nearestneighbors has debarred histories; it is classified as an “at risk” firm using a supervisedclustering algorithm trained on a carefully vetted sample of the data.• Further analyzing the distribution of debarred firms over time using factors such as:country of origin, firm size, firm industry etc. The Bank may find it useful to investigatetrends such as locally high concentrations of debarred firms, or concentrations ofcertain types of misconduct over time.• Building a unified set of profiles for major Bank entities (e.g. suppliers, countries, etc.)beyond what the current APIs allow; developing a tool that can be used by both fraudand corruption examiners as well as implementing agencies to analyze existing orpotential contractors. For example: automate the process of cross-checking suppliersagainst debarred firms, and alerting users to known relationships to debarred firms orindividuals.• A deeper analysis of co-bidder relationships to automatically flag suspicious behavior;the World Bank can extend this method to conduct analyses on the impact of projectfactors to the bidding process. Adding the debarment data and/or contractor profilewould greatly help identify whether collusion is likely occurring. Fraud and Corruptionstaff could examine the degree of separation between firms; bidding firms withcommon addresses, officers, etc. might be likely to be involved in collusion.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 49• Factoring in new data streams in the vein of “civil witness” for understanding corruptionduring project execution.Additional Resources• Team HackPadso HackPad 1o HackPad 2• API for Supplier Profiles GitHub• Data visualization of disbarred firms and individuals (Excel file)• Code to generate network graphs of similarities between debarred and non-debarredfirms.• Code to scrape bidder information• The cleansed co-bidder data with co-bid groups included• Python code used to scrape and parse award notices available• Excel file used to create network diagram of co-bidders
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 50UNDP RESOURCE ALLOCATIONBackground and Problem StatementThe UNDP Capacity and Performance team was trying to improve UNDP’s ability to funddevelopment by examining the relationship between its staffing and its expenditure across theprograms.Over the last several years, UNDP is increasingly focusing on measuring and improving itsperformance. Its objective is to make sure that all resources that UNDP brings to developingcountries are used as effectively as possible, produce maximum value and lead to tangible andsustained improvements in people’s lives. To reach this objective, it is critical for UNDP to beable to monitor how well its offices are performing, especially in implementing concreteprograms and projects. UNDP must be able to identify and, ideally predict, weaknesses andpotential setbacks, and to take timely action to correct the course.The DataKind team joined with UNDP experts to use their data to understand how well theirprojects have been performing. The team chose expenditure as the measure of performance forthis analysis. They addressed the following questions:• Are women or men more likely to work in specific program areas?• What mix of workforce characteristics is associated with the greatest performance?• Can workforce characteristics accurately predict a downturn in performance?Datasets AvailableThe team compiled a dataset on the UNDP workforce, budget and expenditure from programsand projects that took place between 2008 and 2012. Each observation in the data representsan employee, a description of that employee, and the project he or she worked on between2008 and 2012. Only employees that worked on projects listed in the Budget and Expendituredata were retained in this dataset. The dataset is available here.Key Findings• Looking purely at budget and expenditure data, the team was able to classify UNDPprojects into four broad categories of efficiency and analyze the breakdown of each typeof project by country, region, time, and type.• Key drivers of efficiency were mostly related to characteristics of the project and notstaff characteristics. Of the staff characteristics in the data, the team found that theaverage number of years of service, total number of staff, and more recent projectswere indicators of whether a project is more likely to spend more than budgeted.As with all other projects from the DC Big Data Exploration, the findings should beconsidered provisional, as there are a number of methodological issues that still need to beaddressed, e.g. sample size, selection bias and validity of sources.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 51Methods and AnalysisThe team’s first challenge was to define a clear metric for the "success" of a project. UNDP didnot appear to have an internal measure of success so the team came up with measures thatcould be used to define "efficiency." In this analysis, an efficient project spends the budgetedmoney without too high of an overhead.The team first combined several data sets to get at these measures. First, they calculated theamount of overhead spent by country and year by adding the estimated salaries of personnelnot associated with any program. Then they allocated this overhead back to the programs,proportional to the amount expended on each program.Second, the team calculated the ratio of expended to budgeted money, ideally meeting orexceeding projected figures, or in the worst case scenario being unable to put money to use.Figure 38: Each point is a project, and large (million dollar or greater) projects are indicated byred dots. The four regions of project types are shown as well.Figure 37: Plot of (overhead/expended) vs. (expended/budget). Each point is a project, andlarge (million dollar or greater) projects are indicated by red dots.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 52• Couldn’t spend all their money (yellow)• Had high overhead (purple)• Spent more than allocated (red)• Were near target (green)These categories were then annotated back into the data files, containing properties ofprograms, as described below. The team also created an interactive visualization of these ratiosby year, available here.Properties of ProgramsThe team then sought to understand basic properties of program spending over time. The teamanalyzed 324 projects from 2012 and explored their ratio of expenditure to budget. The tablebelow shows the number of projects that fell into each type of expenditure ratio:Expenditure Ratio # of Projects with that RatioNA 45>1 25=1 11<1 206=0 24<0 13The largest category was for projects with an expenditure ratio <1. That means that two-thirdsof the projects underspent their budget. Breaking that category down further, the teamexplored what proportions of projects fell into more specific expenditure categories:Expenditure Ratio # of Projects with that Ratio0.95 - 0.99 470.80 – 0.94 620.50 – 0.79 530.00 – 0.50 44Here we see that 44 / 312, or about 14% of projects, spent less than half their budget. Dr. Harris,a consultant to the team, had quoted that almost a third of projects in 2012 spent less than 80%of their budgets. From this table, we can see that that number is exactly 97 / 312, or 31%. This isan issue that UNDP should explore more thoroughly as it seems troubling that about 1/3 ofprojects can’t spend their budgets.Programs by YearUsing updated versions of the staff and program files, the team then studied programperformance by year. Figure 40 shows program expenditure in each of the six regions from 2008to 2012. This figure shows a large dip in expenditure in the Oceania regions starting 2009 to2011. Oceania spent less than the other regions during these years. It turns out that the majority
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 53of the almost one third of projects mentioned in Dr. Harriss summary above are concentrated inthis region. The team conjectured that reshuffling of funds could be one contributing factor.UNDP should consider comparing other similar programs in the regions to help determinewhether UNDP projects overspent/underspent significantly.Performance Measures of ProgramsThe team took some initial steps toward identifying indicators for performance in the programdata. Trying to correlate staff expertise with performance proved difficult, as there were somany projects with 10 or fewer assigned staff.Turning back to rate of expenditure, the team then explored the ratio of expenditure/budget foreach individual bureau. For some bureaus, the histogram of expenditure/budget for theirprograms sharply peaked near 1, indicating good performance for the majority of projectscoming out of that bureau. In other bureaus, the distribution is more spread out between 0 and1, indicating large variation in how efficiently funds are disbursed. Deeper analysis of thebureaus with more varied distributions should be performed. The team also felt it would beinteresting to look at this observation alongside country development data.Figure 39: Program expenditure by year in each of the six major sites.
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 54Geographic Breakdowns of Program EfficiencyThe team also looked at geographic breakdowns of programs. Figure 40 shows the ratio ofexpenditure / budget by country. From this figure alone it is not clear if there are surprisingtrends, but the team would like UNDP to weigh in on what they see in this plot.Figure 40: Expenditure of projects by country.Recommendations and Next StepsFirst of all, a better measure of program efficiency and impact will help focus future analysis.Looking purely at expenditure and budget does not speak to the actual performance of theproject. Much can be learned by looking at the variables that effect program efficiency distinctfrom program performance.Some of the factors that UNDP could investigate:• Correlations between staff expertise and project performance using additional personalinformation about the team members as variables. Some variables that they mightconsider include: education level, ambition, and income level.• Using more granular data about the projects such as what was accomplished in a givenyear, the scope of the project, progress towards outcome metrics.• The team observed that in some regions, the expenditure: budget ratio varied morewidely than others. UNDP could further analyze the variance among expenditure:budget ratios by region.• The team observed that 1/3 of projects spent less than 80% of their budget. UNDPshould consider comparing similar programs across the regions to identify correlationsacross programs (size, sector and project).Additional Resources• Team HackPad
World Bank Group Finances @WBOpenFinancesfinances.worldbank.org 55ADDITIONAL PROJECTSThere were two additional projects at the DC Big Data Exploration that teams worked on:• Social networking analysis for risk measurement: Can you forecast project risk usingsocial networking analysis tools?• Can you use simple heuristic auditing to sniff out discrepancies in expenditure data:What do you do when you have the information but don’t know if it contains signalsabout potential fraud and corruption related risk?Because these projects were not setup through DataKind, we unfortunately, do not havedetailed information about the projects. However, the link above leads to the hackpadcontributed by the authors.Their involvement in the event shows that innovation can come froma wide community of innovators and technologists.Next StepsThanks to the Bank’s willingness to team with DataKind prior to this event, the volunteer teamswere able to deliver a huge amount of work to the Bank and its partners in just a short amountof time. The major takeaways across all projects were:• Huge amounts of data exist outside the Bank in the form of mobile, social media, andopen data that must be brought to bear on the Bank’s problems. Data scientists couldbe introduced to fill the capacity gap in using and understanding this type of data.• Greater effort needs to be made within the Bank to reconcile its data acrossdepartments for reusability and advanced analytics. One application could be creatingunified profiles for entities like suppliers.Events like DataDives, competitions, and startup weekends raise visibility for the Bank, unite thecommunity and generate ideas; however, this will not lead to sustainable change unless theBank commits serious resources to continuing the work and supporting it.Additional ReadingBlogs that may be relevant:1. Short recap blog - with links to raw project hackpads2. Chris Kreutzs recap of the DataDive in Vienna3. Max Richman on scraping pricing data to measure poverty4. Francis Gagnon on better data and the power of data visualization5. Ben Ranoust on using visual analytics to probe risk factors influencing project outcomes6. Marc Maxson on auditing the world - the sequel7. Dennis McDonald on learning from data explorations8. Giulio Quaggiotto and Prasanna Lal Das on personal data philanthropy9. Milica Begovic, Giulio Quaggiotto, and Ben Ranoust on social networking analysis fordevelopment10. Giulio Quaggioto, Anoush Tatevossian and Prasanna Lal Das set the stage.