Big data, democratized analytics and deep context,


Published on

Paper analyzes how big data, democratized analytics, deep context are changing how we think and do development. Outlines key new technologies, analysis techniques and tools that will have a major impact on development research. Classifies into data, analytics and feedback layer.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data, democratized analytics and deep context,

  1. 1. Big Data, Democratized Analytics and Deep Context will Change How We Think and Do Development Aniket Bhushan, Senior Researcher, The North-South Institute, Dec 21, 2011International development as a field of research and practice has been more of a laggard than aleader in using big data and powerful analytics. Much of the data is often dated or of poorquality. Huge areas including those of the greatest interest remain entirely unmapped, data pooror otherwise poorly understood.This situation is changing faster than anyone predicted and the set of tools driving this evolutionrepresents one of the most important trends in international development. The proliferation ofmobile technologies, computing power and democratization of analytics within an open-source,open-data chapeau will fundamentally change the way we think about and do development. Iprovide a synopsis of the most impactful developments in three key areas: the base (data) layer,the analysis layer, and the feedback (data) layer. Together these advances are changing how wethink about data in development, how we develop a deeper contextual understanding at a verymicro level without losing the ability to aggregate and generalize, and how we bring it alltogether in a meaningful way to make better decisions. 1
  2. 2. International institutions: Public sector World Bank, data at IMF, UN Data various levels Priavte sector (proprietary) Base layer: Big and Open Visualization Virtualization Analytics layer Anonymous - bigPurposive - Feedback push layer "crowds" 2
  3. 3. The Base Layer: Open Data and Big DataHow do we know what we know in the field of international development? What is theinformation, or evidence base, who generates it and how? There are at least three maincollectors, sorters and repositories of development information: international institutions (UN,World Bank, IMF etc), national and sub-national official public sector institutions and the privatesector. Change is afoot in each.Take for instance the open data push byinternational institutions such as the World Bankand African Development Bank, or the UN’s GlobalPulse initiative. Opening up the World Bank’sdatabank has made a huge amount of informationavailable to a wider range of stakeholders than everbefore, similarly Global Pulse is creating a platformto harness new data streams. Groups such asDevelopment Gateway and Aid Data have sized theopportunity to push openness further. A good example of what is made possible by this openingis a tool like Development Loop which not only plots all World Bank and African DevelopmentBank projects at precise geographic locations across Africa but also overlays the same withfeedback sourced from the intended beneficiaries of the project or initiative. This full circle or loop is a powerful reminder of the importance of data transparency and universal standards. Opening up aid data in a standardized format will make geocoding a potent tool for real transparency and accountability. To appreciate what a game-changer open data can be examine the current state of affairs. Research conducted by UK based Publish What You Fund recently in Uganda aimed at simply tallying up financial resources available for development,found that the government was unaware of the amount donors planned to spend in that year(2006-07). The planned expenditure was more than double what the government was aware of;indeed financial resources flowing into the country were far higher than had been estimated. Ortake another example, what the World Bank’s Chief Economist for Africa calls a “statisticaltragedy”. The majority of Africa’s population lives in countries that still use an outdated (1960s)method of national income accounting used to generate fundamental data points such as GrossDomestic Product (GDP). Ghana for instance only shifted to the 1993 UN system of national 3
  4. 4. accounts last year. When they did so they found their GDP was 62 percent higher thanpreviously thought, catapulting the country to ‘middle income’ status.If even the most basic information often taken for granted is riddled with problems then what dowe really know? Data reliability is one issue but time-lag is another. Most of the data used ininternational development is stale by the time it is called upon in decision making. Theinformation base we rely on in international development needs to be bolstered by buildingbridges with new sources and data-streams.Opening up proprietary private sector data and exposing it to the concerns in internationaldevelopment in the coming years will be a game changer. To date international institutions havemade the most progress towards data openness, select public sector authorities (for instanceunder the purview of the Open Government Partnership) are also making progress, but it is theprivate sector – the main repository of “big data” – that is the holy grail. If you total all the datacollected by the US Library of Congress (one of the largest public sector repositories) it wouldbe about 235 terabytes as of April 2011. Wal-Mart processes and stores about 2500 terabytes perhour! The big data revolution is changing business models fundamentally, and like capital andlabour data itself has become a commercial driver and firms like Google, Facebook, Twitters arethe first “data factories”, pioneers much like their antecedents in the industrial revolution. This is truly big data and its growth has exploded off the charts thanks in large part to the explosion of mobile sensors (the most important of which is the mobile phone but think also credit cards, laptops, GPS and everything from radio-frequency to QR codes), and the rapid democratization of high-power analytics. Whether it is geocoded mobile phone data modeled to track slum development or predict microfinance loan defaults or provide weather- indexed insurance to small farmers big data is already a game-changer in development and we have only begun to scratch the surface.The Analytics Layer: Virtualization, Visualization are driving DemocratizationAnalytics is simply the collection of tools and techniques used to make sense of data. High-power analytics proved a game-changer in the commercial sector and can now do the same in thesocial sector. At its core analytics is about unearthing and understanding relationships andpatterns. Analytics helped retailers discover unlikely trends, most famously that customers whocame in to buy diapers also tended to buy beer! It can do the same for complex social systems. 4
  5. 5. Developments in analytics have kept pace with the speed with which big data has grown.Bringing this capacity to bear on development challenges such as food security and urbanizationis just getting started. Before we look at some examples let us put in perspective what we mean by the explosion of mobile sensors. The mobile phone is already the most celebrated example; anyone who has visited Kenya (or really any African country) has seen vividly the power of mobile money. This is old news. Anyone who has followed elections in Kenya probably also heard of Ushahidi, a locally developed open-source platform to map reports of post-election violence which went live in 2008. This is also old news.What is new is what is now made possible by bringing to bear requisite analytics on the hugeproliferation of mobile sensors. Mobile phones have grown from under 750million with morethan two thirds in developed countries to over 5billion with about four times as many indeveloping countries as in the developed world. Of the 5billion about 1billion live on less than$5 a day. The developing world is the leading driver of mobile big data. This includes voice,text, financial, locational and positional information, which is now possible to overlay with thebase data layer described earlier (income, health, education and other indicators generated byofficial sources) to produce new insights into real behaviour and complex incentive structures.Sticking with Kenya, take the example of theEngineering Social Systems lab. Couplingterabytes of mobile phone data with Kenyancensus information ESS is modeling the growthof slums to inform urban planners about where tolocate services such as water pumps and publictoilets. In Uganda the same group is developingcausal structures of food security, in Rwandathey collected a sample of every phone-call overa four year period, coupled with a randomsurvey, to analyze how different people react to the same economic shock. What is reallyinteresting is the way experimental initiatives are being brought out of the lab into real world 5
  6. 6. application. The shift that is taking place is fundamental; away from models governing theory tomodels informed and built on real networks (see Reality Mining).For long development analysis at best has been limited to correlations and inferences based oncorrelations, for the first time big data coupled with high-power analytics is opening up thepossibility if not of entirely causal dynamics then at least more robust inferences. Our traditionalmethods of inquiry have conditioned us to think in terms of generalizing on the basis of randomsampling, for the first time the proliferation of mobile sensors is making possible highly targetedyet nonintrusive and anonymous inquiry.And this is not another story about mobile phones alone. The point is the rapid emergence ofaltogether new data-streams, in step with the development of analytical capacity to draw usefulinference out of them. Take Twitter for instance which generates information about the size ofthe entire US Library of Congress in two weeks and together with Facebook has already shownits efficacy during the Arab uprisings. At the heart of this evolution are open-source softwaresystems and tools that allow the simultaneous collection, categorization, and analysis of variousdata types from Twitter hashtags to videos to positional data and machine IDs. Swift Riverdeveloped by Ushahidi is an example of a free open-source platform that enables rapidsimultaneous filtering and verification of real-time data from channels like Twitter, SMS, Emailand others. It also visualizes the information in dashboards that the average user can understand.This is particularly powerful for monitoring immediate post-crisis developments when theinformation flow suddenly increases but is also only useful if immediately analyzed. Democratization of analytics driven by a commitment to open-source is furthered by virtualization of platforms and visualization of information (to make it engaging for the average user). An aspect of virtualization is community or crowd driven problem solving. Take for instance Data Without Borders, a pro-bono data scientist exchange. DWB organizes ‘data dives’ to help NGOs, civil society organizations and other who might not have the time, capacity or inclination but may be sitting on information useful forpurposes beyond their imagination, to make sense of their own information. At a recent data diveDWB helped a human rights group that allows users to anonymously upload information aboutviolations get a better look at who was using the system and plot trends, without compromising 6
  7. 7. anonymity. Using open-source tools (such as R) they also visualized the information to make iteasy to understand.Here are some in a fast growing toolkit worth following: Social network analysis: with the growing penetration of web 2.0 technologies, social media is becoming the dominant communication channel for rapid exchange. Network analysis includes a set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed, e.g., how information travels, or who has the most influence over whom. Examples of applications include identifying key opinion leaders, and identifying bottlenecks in enterprise information flows. Two important techniques within network theory and analysis are exploratory data analysis (EDA) and link analysis. EDA is an approach for analyzing large datasets in summarized formats according to their main characteristics. EDA came about in part as a reaction to overdue emphasis in statistical fields on hypothesis testing or “confirmatory analysis”. EDA emphasizes using the data to suggest hypotheses to test. It emphasizes testing assumptions on which inference is based. Similarly link analysis focuses on analyzing relationships among nodes through visualization methods, including network diagrams and association matrices. Gephi is an interactive open source (java based) platform for complex systems and network analyses. Automated web-scraping: almost everything is on the web today, which means almost everything has a hypertext (HTTP) related identity. Web scraping is an automated technique for collecting information from the web. Web scraping transforms typically unstructured data in HTML format into structured data that can be centralized in a spreadsheet. Simply using MS Excel it is possible for instance to scrape tables and other data in a variety of unstructured formats (including real-time, e.g. weather information on world clock, or stock price info), and save them as a spreadsheet for further use. Web scraping, though somewhat mired in legal and privacy issues, has the potential to massively increase the volume of accessible information. For example UN Global Pulse is running a project with Price Stats and the Billion Prices Project at MIT, investigating daily bread prices across Latin American countries by scraping the web for online prices. The end product is a demonstration, an e-bread index which tracks bread price inflation real-time (daily), and can be compared contrasted or can complement the traditional consumer price index (which is only published on a monthly basis). UN Global Pulse is also pioneering Hunch Works, the first social network for hypothesis formation, evidence gathering and decision making. Hunch Works allows researchers to connect with other experts with complementary resources so that together they could quickly determine if data signals are indications of deepening crisis and warrant further investigation. 7
  8. 8. World Bank – Adept software platform: ADePT is a free (STATA based) platform developed by the World Bank that automates and standardizes analysis. ADePT allows complex statistical analyses, and direct pre-configured access to a range of micro level data from the Bank’s and other sources. It is particularly useful for economic analysis and particularly useful in developing countries where researchers may not have ready access to otherwise expensive statistical packages. Hadoop – distributed parallel analytics: is big data is the latest buzz word in IT then Hadoop is a large part of the story behind it. Hadoop is a platform for distributing problems, tasks, analyses across a number of servers, speeding up analysis and shrinking the distance between data, analysis and result. Hadoop works behind things you know well but you have never seen it or are likely to know it. For example one of the most well-known implementations is Facebook, which brings in core data stored by you into Hadoop clusters where it is reflected against your friends, their interest to suggest recommendations back to you. High power distributed parallel analytics solutions like Hadoop help square the big data circle, make it small, in real-time.The Feedback Layer: Deep Context, Complex Microsystems, Real-Time LoopsThe efficacy of the feedback layer is also new. This layer has two key aspects: the purposive orpush driven response and the big anonymous response (discussed above). Targetedcrowdsourcing has already come a long way. The Ushahidi experience in Kenya for instance alsoworked for monitoring elections in Afghanistan and tracking emergencies including the choleraoutbreak during the Haiti earthquake. Mobile phone SMS platforms have been adapted to makeparticipatory budgeting more inclusive in hard to reach areas such as conflict-affected South-Kivu in the Democratic Republic of Congo and results have been encouraging.To understand how powerful the feedback layer can be consider the experience of the MobileAccord, which at the initiative of the World Bank’s World Development Report 2011, ran GeoPoll an SMS based targeted polling in the DRC. The poll asked 10 sensitive questions includingabout topics such as rape and violence against women in conflict zone. The survey produced1.2million text responses and the outputs were turned into a video “DRC Speaks” whichcaptured people’s responses to questions about their experiences in their own words. This endedup being one of the largest surveys ever conducted in the country. 8
  9. 9. There is a pattern here. In the base layer more and more (new and old) data is opening every day.In the analytics layer experimental ideas are leaving the lab for real world application;virtualization and visualization are helping foster new communities geared towards collaborationand collective problem solving. Similarly in the feedback layer the tools are also democratizing.Ushahidi has created a very easy to use version of their implementation called CrowdMap.Anyone who knows how to set up an email account will be able to set up their own incidentmapping of whatever trend, alert or issue they are interested in getting feedback from the crowdon. The service is already being used to track everything from citizen report cards assessingcorruption in India to the Syrian uprising to national emergencies on the tiny island of Samoa.The feedback layer is also innovating in reverse – backwards from new to old media and formats– something essential in international development where so much remains unmapped or in non-digital formats. Take mapping. In the developing world there are still vast areas beyond GPScoverage. But people in those communities have intimate knowledge of their surroundings. Ifonly there was a way to tap into that knowledge and build a bridge between that data source andan open-source digital resource like Open Street Map. Walking Papers does precisely that.Contributors can simply draw, by hand, on simple paper, say a map of their neighbourhood andupload to Walking Papers, where a community specializes in taking that information and using itto deepen Open Street Maps.Looking AheadInternational development as a field of practice and research has tended to be a laggard in usingbig data, powerful analytics and innovative sourcing techniques, and with good reasons.However this is changing faster than anyone predicted and faster than most organizations that‘do development’ are prepared for. The open data movement has widened access to a broadrange of basic contextual information. A similar push is needed to open private sector data in theservice of social good. Big data is beginning to have a big impact on how we think aboutdevelopment challenges and this is because we have the ability to make it understandable likenever before. Powerful analytical tools and collaborative platforms are dramatically changing 9
  10. 10. what is possible for even the most intractable challenges like understanding socioeconomic risksand responses, dealing with urban planning, and better preparing for emergencies. For the firsttime we have a feedback layer which has made possible deep and near real-time awareness ofwhat is working or not working where and why. Together big data, democratized analytics andthe ability to tap deep contexts will change the way we think and do development in the comingyears. 10
  11. 11. BibliographyAid Data, Development Gateway, African Development Bank - Development Loop. Development Loop. n.d. (accessed Dec 21, 2011).Courtney, Alexa, and David Kilcullen. "Big data, small wars, local insights: Designing for development with conflict-affected communities." What Matters. McKinsey & Company, December 2, 2011.Crowd Map. n.d. Without Borders. n.d., Steve, and Jonathan Bays. "Harnessing big data to address the world’s problems." What Matters. McKinsey & Company, November 2, 2011.DRC Speaks (Geo Poll). n.d., Nathan. Reality Mining. Dataset:, Massachusetts: Massachusetts Institute of Technology, 2009.Engineering Social Systems. "Big Data for Social Good." Engineering Social Systems. n.d. (accessed December 21, 2011).Jay Chen, Trishank Karthik, Lakshminaryanan Subramanian. Contextual Information Portals. Online:, Association for the Advancement of Artificial Intelligence, n.d.McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. McKinsey & Company, 2011.Metha, Abhishek. Big Data: Powering the Next Industrial Revolution. White paper:, Seattle: Tableau Software, n.d.Priebatsch, Seth. The game layer on top of the world. Video online at:, Boston: Ted Talks, 2010.Publish What You Fund. "Aid Budgets in Uganda." Publish what you fund. n.d. (accessed December 21, 2011).Shantayanan Devarajan. "Africa’s statistical tragedy." Africa Can End Poverty (blog of World Bank, Africa Chief Economist). October 6, 2011. (accessed October 6, 2011).Swift River. n.d. Global Pulse. n.d. n.d. Papers. n.d. 11