Big data for development

Professor, Information Technology University, Lahore, Pakistan
Feb. 9, 2016
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
Big data for development
1 of 61

More Related Content

What's hot

Useful by Piet DaasUseful by Piet Daas
Useful by Piet DaasCentraal Bureau voor de Statistiek
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...Miriam Fernandez
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
The language of social mediaThe language of social media
The language of social mediaDiana Maynard
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Diana Maynard
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Diana Maynard

What's hot(20)

Similar to Big data for development

Big Data PaperBig Data Paper
Big Data PaperAndile Ngcaba
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyClaudiu Popa
Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...
Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...Amit Sheth
Towards a More Open WorldTowards a More Open World
Towards a More Open WorldAlexander Howard
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Katie Whipkey
WEF - Personal Data New Asset Report2011WEF - Personal Data New Asset Report2011
WEF - Personal Data New Asset Report2011Vincent Ducrey

More from Junaid Qadir

Fundamentals of Artificial Intelligence — QU AIO Leadership in AIFundamentals of Artificial Intelligence — QU AIO Leadership in AI
Fundamentals of Artificial Intelligence — QU AIO Leadership in AIJunaid Qadir
On Writing Well — 6 Tips on Bringing Clarity to WritingOn Writing Well — 6 Tips on Bringing Clarity to Writing
On Writing Well — 6 Tips on Bringing Clarity to WritingJunaid Qadir
 Safe and Trustworthy Artificial Intelligence Safe and Trustworthy Artificial Intelligence
Safe and Trustworthy Artificial IntelligenceJunaid Qadir
COVID Infodemic & the AI-Driven Hype MachineCOVID Infodemic & the AI-Driven Hype Machine
COVID Infodemic & the AI-Driven Hype MachineJunaid Qadir
Ihsan for Muslim Professionals Short CourseIhsan for Muslim Professionals Short Course
Ihsan for Muslim Professionals Short CourseJunaid Qadir
A Thinking Person's Guide to Using Big Data for Development: Myths, Opportuni...A Thinking Person's Guide to Using Big Data for Development: Myths, Opportuni...
A Thinking Person's Guide to Using Big Data for Development: Myths, Opportuni...Junaid Qadir

Recently uploaded

9.21.23 Nationalism, Globalism, and Transnationalism.pptx9.21.23 Nationalism, Globalism, and Transnationalism.pptx
9.21.23 Nationalism, Globalism, and Transnationalism.pptxmary850239
PROFIL SEKOLAH ISLAM SHAFTA 2023.pdfPROFIL SEKOLAH ISLAM SHAFTA 2023.pdf
PROFIL SEKOLAH ISLAM SHAFTA 2023.pdfSHAFTA Surabaya
Personal Brand Exploration - NaQuan CreekmorePersonal Brand Exploration - NaQuan Creekmore
Personal Brand Exploration - NaQuan CreekmoreNaQuan Creekmore
SOFTWARE TESTING.pptxSOFTWARE TESTING.pptx
SOFTWARE TESTING.pptxDrTThendralCompSci
Mass spectrometry- full lecture Mass spectrometry- full lecture
Mass spectrometry- full lecture DRZIAMUHAMMAD2
Info Session on HackathonsInfo Session on Hackathons
Info Session on HackathonsGDSCCVR

Big data for development

Editor's Notes

  1. Exabyte is billion billion.
  2. Figure Credit: http://www.slideshare.net/cloudera/cloudera-for-internet-of-things
  3. From Data Driven by DJ Patil and Hilary Mason http://www.oreilly.com/data/free/files/data-driven.pdf Democratizing Data The democratization of data is one of the most powerful ideas to come out of data science. Everyone in an organization should have access to as much data as legally possible. While broad access to data has become more common in the sciences (for example, it is possible to access raw data from the National Weather Service or the National Institutes for Health), Facebook was one of the first companies to give its employees access to data at scale. Early on, Facebook realized that giving everyone access to data was a good thing. Employees didn’t have to put in a request, wait for prioritization, and receive data that might be out of date. This idea was radical because the prevailing belief was that employees wouldn’t know how to access the data, incorrect data would be used to make poor business decisions, and technical costs would become prohibitive. While there were certainly challenges, Facebook found that the benefits far outweighed the costs; it became a more agile company that could develop new products and respond to market changes quickly. Access to data became a critical part of Facebook’s success, and remains something it invests in aggressively. All of the major web companies soon followed suit. Being able to access data through SQL became a mandatory skill for those in business functions at organizations like Google and LinkedIn. And the wave hasn’t stopped with consumer Internet companies. Nonprofits are seeing real benefits from encouraging access to their data—so much so that many are opening their data to the public. They have realized that experts outside of the organization can make important discoveries that might have been otherwise missed. For example, the World Bank now makes its data open so that groups of volunteers can come together to clean and interpret it. It’s gotten so much value that it’s gone one step further and has a special site dedicated to public data. Governments have also begun to recognize the value of democratizing access to data, at both the local and national level. The UK government has been a leader in open data efforts, and the US government created the Open Government Initiative to take advantage of this movement. As the public and the government began to see the value of making the data more open, governments began to catalog their data, provide training on how to use the data, and publish data in ways that are compatible with modern technologies. In New York City, access to data led to new Moneyball-like approaches that were more efficient, including finding “a five-fold return on the time of building inspectors looking for illegal apartments” and “an increase in the rate of detection for dangerous buildings that are highly likely to result in firefighter injury or death.” International governments have also followed suit to capitalize on the benefits of opening their data. One challenge of democratization is helping people find the right data sets and ensuring that the data is clean. As we’ve said many times, 80% of a data scientist’s work is preparing the data, and users without a background in data analysis won’t be prepared to do the cleanup themselves. To help employees make the best use of data, a new role has emerged: the data steward. The steward’s mandate is to ensure consistency and quality of the data by investing in tooling and processes that make the cost of working with data scale logarithmically while the data itself scales exponentially.
  4. MORE DATA BEATS A CLEVERER ALGORITHM Suppose you’ve constructed the best set of features you can, but the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data (more examples, and possibly more raw features, subject to the curse of dimensionality). Machine learning researchers are mainly concerned with the former, but pragmatically the quickest path to success is often to just get more data. As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.) Part of the reason using cleverer algorithms has a smaller payoff than you might expect is that, to a first approximation, they all do the same. This is surprising when you consider representations as different as, say, sets of rules and neural networks. But in fact propositional rules are readily encoded as neural networks, and similar relationships hold between other representations. All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.” With non-uniformly distributed data, learners can produce widely different frontiers while still making the same predictions in the regions that matter (those with a substantial number of training examples, and therefore also where most test examples are likely to appear). This also helps explain why powerful learners can be unstable but still accurate. Figure 3 illustrates this in 2-D; the effect is much stronger in high dimensions. In machine learning, is more data always better than better algorithms? https://www.quora.com/In-machine-learning-is-more-data-always-better-than-better-algorithms No. There are times when more data helps, there are times when it doesn't. Probably one of the most famous quotes defending the power of data is that of Google's Research Director Peter Norvig claiming that "We don’t have better algorithms. We just have more data.". This quote is usually linked to the article on "The Unreasonable Effectiveness of Data", co-authored by Norvig  himself (you should probably be able to find the pdf on the web although the original is behind the IEEE paywall). The last nail on the coffin of better models is when Norvig is misquoted as saying that "All models are wrong, and you don't need them anyway" (read here for the author's own clarifications on how he was misquoted). The effect that Norvig et. al were referring to in their article, had already been captured years before in the famous paper by Microsoft Researchers Banko and Brill [2001] "Scaling to Very Very Large Corpora for Natural Language Disambiguation". In that paper, the authors included the plot below. That figure shows that, for the given problem, very different algorithms perform virtually the same. however, adding more examples (words) to the training set monotonically increases the accuracy of the model. So, case closed, you might think. Well... not so fast. The reality is that both Norvig's assertions and Banko and Brill's paper are right... in a context. But, they are now and again misquoted in contexts that are completely different than the original ones. But, in order to understand why, we need to get slightly technical.  (I don't plan on giving a full machine learning tutorial in this post. If you don't understand what I explain below, read my answer to How do I learn machine learning?) Variance or Bias? The basic idea is that there are two possible (and almost opposite) reasons a model might not perform well. In the first case, we might have a model that is too complicated for the amount of data we have. This situation, known as high variance, leads to model overfitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features, and... yes, by increasing the number of data points. So, what kind of models were Banko & Brill's, and Norvig dealing with? Yes, you got it right: high variance. In both cases, the authors were  working on language models in which roughly every word in the vocabulary makes a feature. These are models with many features as compared to the training examples. Therefore, they are likely to overfit. And, yes, in this case adding more examples will help.
  5. http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
  6. http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
  7. Big Data Brandwashing.
  8. The following text is excerpted from: “Keeping up with the Quants” “What Are Analytics?   By analytics, we mean the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and add value. Analytics can be classified as descriptive, predictive, or prescriptive according to their methods and purpose. Descriptive analytics involve gathering, organizing, tabulating, and depicting data and then describing the characteristics about what is being studied. This type of analytics was historically called reporting. It can be very useful, but doesn’t tell you anything about why the results happened or about what might happen in the future. Predictive analytics go beyond merely describing the characteristics of the data and the relationships among the variables (factors that can assume a range of different values); they use data from the past to predict the future. They first identify the associations among the variables and then predict the likelihood of a phenomenon—say, that a customer will respond to a particular product advertisement by purchasing it—on the basis of the identified relationships. Although the associations of variables are used for predictive purposes, we are not assuming any explicit cause-and-effect relationship in predictive analytics. In fact, the presence of causal relationships is Optimization, another prescriptive technique, attempts to identify the ideal level of a particular variable in its relationship to another. For example, we might be interested in identifying the price of a product that is most likely to lead to high profitability for a product. Similarly, optimization approaches could identify the level of inventory in a warehouse that is most likely to avoid stock-outs (no product to sell) in a retail organization. Analytics can be classified as qualitative or quantitative according to the process employed and the type of data that are collected and analyzed. Qualitative analysis aims to gather an in-depth understanding of the underlying reasons and motivations for a phenomenon. Usually unstructured data is collected from a small number of nonrepresentative cases and analyzed nonstatistically. Qualitative analytics are often useful tools for exploratory research—the earliest stage of analytics. Quantitative analytics refers to the systematic empirical investigation of phenomena via statistical, mathematical, or computational techniques. Structured data is collected from a large number of representative cases and analyzed statistically. There are various types of analytics that serve different purposes for researchers:   Statistics: The science of collection, organization, analysis, interpretation, and presentation of data Forecasting: The estimation of some variable of interest at some specified future point in time as a function of past data Data mining: The automatic or semiautomatic extraction of previously unknown, interesting patterns in large quantities of data through the use of computational algorithmic and statistical techniques Text mining: The process of deriving patterns and trends from text in a manner similar to data mining Optimization: The use of mathematical techniques to find optimal solutions with regard to some criteria while satisfying constraints Experimental design: The use of test and control groups, with random assignment of subjects or cases to each group, to elicit the cause and effect relationships in a particular outcome   Although the list presents a range of analytics approaches in common use, it is unavoidable that considerable overlaps exist in the use of techniques across the types. For example, regression analysis, perhaps the most common technique in predictive analytics, is a popularly used technique in statistics, forecasting, and data mining. Also, time series analysis, a specific statistical technique for analyzing data that varies over time, is common to both statistics and forecasting. a. Thomas Davenport and Jeanne G. Harris, Competing on Analytics (Boston: Harvard Business School Press, 2007” Excerpt From: Davenport, Thomas H. “Keeping Up with the Quants: Your Guide to Understanding and Using Analytics.” iBooks.
  9. Big Data Brandwashing.
  10. Big Data Brandwashing.
  11. http://static.squarespace.com/static/538cea80e4b00f1fad490c1b/54668a77e4b00fb778d22a34/54668d85e4b00fb778d281f9/1367513685000/NLP-of-Big-Data-using-NLTK-and-Hadoop7.png?format=original Credit: http://devblogs.nvidia.com/parallelforall/wp-content/uploads/sites/3/2014/09/nn_example-624x218.png Deep learning is: 1. a collection of statistical machine learning techniques 2. used to learn feature hierarchies 3. often based on artificial neural networks
  12. Sources of Big Crisis Data As depicted in a detailed crisis analytics taxonomy shown in Figure 1, there are six important sources of big crisis data. 1) Data exhaust refers to the digital trail that we etch behind as we go about performing our everyday online activities with digital devices. The most important example of data exhaust for big crisis data analytics is the mobile “call detail records” (CDRs), which are generated by mobile telecom companies to capture various details related to any call made over their network. Data exhaust also includes transaction data (e.g., banking records and credit card history) and usage data (e.g., access logs). Most of the data exhaust is owned by private organizations (such as mobile service operators) where it used mostly in-house for troubleshooting; data exhaust is seldom shared publicly due to legal and privacy concerns. 2)  Online activity encompasses all types of user generated data on the Internet (e.g., emails, SMS, blogs, comments); search activity using a search engine (such as Google search queries); and activities on social networks (such as Facebook comments, Google+ posts, and Twitter tweets). It has been shown in literature that online activities on different platforms can provide unique insights to crisis development: as an example, the short message services Twitter and SMS are used differently in crisis situations—SMS is used mostly on the ground by the affected community, while Twitter is used mostly by the international aid community [4]. The advantage of online data is that it is often publicly available, and thus it is heavily used by academics in big crisis data research. 3)  Sensing technologies use various cyber-physical sensing systems—such as ground, aerial, and marine vehicles; mobile phones; wireless sensor nodes—to actively gather information about environmental conditions. There are a number of sensing technologies such as (1) remote sensing (in which a satellite or high-flying aircraft scans the earth in order to obtain information about it); (2) networked sensing (in which sensors can perform sensing and can communicate with each other—as in wireless sensor networks); and (3) participatory sensing (in which everyday entities—such as mobile phones, buses, etc.—are fit with sensors). With the emergence of the Internet of Things (IoT) architecture, it is anticipated that sensor data will become one of the biggest sources of big crisis data. Sensing data is usually (but not always) publicly available. 4)  Small data and MyData: With big data, the scope of sampling and analysis can be vastly dissimilar (e.g., the unit of sampling is at the individual level, while the unit of analysis is at the country level), but with “small data”, the unit of analysis is similarly scoped as the unit of sampling. When the unit of sampling and analysis is a single person, we call such personal-data-based analysis “MyData”. There is emerging interest in using small data and MyData for personalized solutions, focused on applications like health (e.g., Cornell’s mhealth project led by Deborah Estrin) and sustainable development (e.g., the Small Data lab at the United Nations University). Today individuals rarely own, or even have access to, all of their personal data; but this has started to change (e.g., some hospitals now make individual medical records data accessible to patients). 5)  A lot of public-related data—that can be very valuable in the case of a crisis—is already being collected by various public/ governmental/ or municipal offices. This includes census data, birth and death certificates, and other types of personal and socio-economic data. Typically, such data has been painstakingly collected using paper-based traditional survey methods. In recent times, advances in digital technology have led people to develop mobile-phone-based data- collection tools that can easily collect, aggregate, and analyze data. Various open-source tools such as the Open Data Kit (ODK) make it trivial for such data to be collected. While public-related data is not always publicly accessible, increasingly governments are adopting the Open Data trend to open up public-related data. 6)  Finally, the method of crowdsourcing is an active data collection method in which applications actively involve a wide user base to solicit their knowledge about particular topics or events. Crowdsourcing combines a) digital technology, b) human skills, and c) human generosity and utilizes the cognitive surplus of digital human samaritans—the volunteer open-source coders; the citizens who provide data, or help complete a task—to create a volunteer workforce that can be put to work on large global projects. Crowdsourced data is usually publicly available and is widely used by big crisis data practitioners.
  13. The following is an excerpt from the Crisis Analytics paper. Mobile Phones The rapid adoption of mobile technology has been unprecedented. Smartphones are rapidly becoming the central computer and communication devices in the lives of people around the world. Modern phones are not restricted to only making and receiving calls—current off-the-shelf smartphones can be used to detect, among other things, physical activity (via accelerometers); speech and auditory context (via microphones); location (via GPS) and co-location with others (via Bluetooth and GPS). This transforms the modern crisis response since modern smartphones can now act as general-purpose sensors and individuals can directly engage in the disaster response activities through cloud-, crowd-, and SMS-based technologies. This participatory trend in which the aid efforts are centered on and driven by people—and the fact that aid workers have to work with large amounts of diverse data—makes modern disaster response totally different from traditional approaches. Mobile phone technology is ubiquitously deployed, both in developed countries as well as in underdeveloped countries. CDR-based mobile analytics presents a great opportunity to obtain insights (at a very lost cost) about mobility patterns, traffic information, and sociological networks—information that can be profitably utilized during various stages of disaster response (e.g., in epidemic control, and in tracking population dynamics). CDRs have been used by digital humanitarians during various crises (such as the non-profit FlowMinder’s work with anonymous mobile operator data during the Haiti earthquake to follow the massive population displacements) to not only point out the current locations of populations, but also predict their future trajectory [5]. From the WDR 2016 Report: More households in developing countries own a mobile phone than have access to electricity or improved sanitation (figure O.4, panel a). Mobile phones, reaching almost four-fifths of the world’s people, provide the main form of internet access in developing countries. But even then, nearly 2 billion people do not own a mobile phone, and nearly 60 percent of the world’s population has no access to the internet. On average, 8 in 10 individuals in the developing world own a mobile phone, and the number is steadily rising. Even among the bottom fifth of the population, nearly 70 percent own a mobile phone. The lowest mobile penetration is in Sub-Saharan Africa (73 percent), against 98 percent in high-income countries. But internet adoption lags behind considerably: only 31 percent of the population in developing countries had access in 2014, against 80 percent in high-income countries. China has the largest number of internet users, followed by the United States, with India, Japan, and Brazil filling out the top five. The world viewed from the perspective of the number of internet users looks more equal than when scaled by income (map O.1)—reflecting the internet’s rapid globalization. Digital finance has promoted financial inclusion, providing access to financial services to many of the 80 percent of poor adults estimated to be excluded from the regulated financial sector. It has boosted efficiency, as the cost of financial transactions has dropped and speed and convenience have increased. And it has led to major innovations in the financial sector, many of which have emerged in developing countries (box S2.1). The benefits pervade almost all areas discussed in this Report. Digital finance makes businesses more productive, allows individuals to take advantage of opportunities in the digital world, and helps streamline public sector service delivery. Like all great opportunities, digital finance also comes with risks. What makes online financial systems easy to use for customers also makes them susceptible to cybercrime. The entry of nontraditional players poses new challenges for policy, regulation, and supervision. And the ease of transferring funds across the globe—often anonymously, using means such as cryptocurrencies—might increase illicit financial flows.
  14. https://projects.vrac.iastate.edu/REU2011/wp-content/uploads/2011/05/Harnessing-the-Crowdsourcins-Power-of-Social-Media-for-Disaster-Relief.pdf Figure 2 illustrates the food requests on Ushahidi-Haiti, and Figure 3 shows the most affected locations during the Japanese tsunami based on the number of reports mapping on Ushahidi’s crisis map. Using these maps, relief organizations can coordinate resource distribution and make better decisions based on their analysis of crowdsourced data. Fallback plans can be further developed for the top events or to cover the majority of events. Third, providers can include geo-tag information for messages sent from some platforms (such as Twitter) and devices (including handheld smart phones). Such crowdsourced data can help relief organizations accurately locate specific requests for help. Furthermore, visualizing this type of data on a crisis map offers a common disaster view and helps organizations intuitively ascertain the current status.
  15. Wisdom of the Crowds: http://www.amazon.com/Wisdom-Crowds-James-Surowiecki/dp/0385721706/ The following is excerpted from https://en.wikipedia.org/wiki/The_Wisdom_of_Crowds The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, published in 2004, is a book written by James Surowiecki about the aggregation of information in groups, resulting in decisions that, he argues, are often better than could have been made by any single member of the group. The book presents numerous case studies and anecdotes to illustrate its argument, and touches on several fields, primarily economics and psychology. The book relates to diverse collections of independently deciding individuals, rather than crowd psychology as traditionally understood. Its central thesis, that a diverse collection of independently deciding individuals is likely to make certain types of decisions and predictions better than individuals or even experts, draws many parallels with statistical sampling; however, there is little overt discussion of statistics in the book. Failures of crowd intelligence[edit] Surowiecki studies situations (such as rational bubbles) in which the crowd produces very bad judgment, and argues that in these types of situations their cognition or cooperation failed because (in one way or another) the members of the crowd were too conscious of the opinions of others and began to emulate each other and conform rather than think differently. Although he gives experimental details of crowds collectively swayed by a persuasive speaker, he says that the main reason that groups of people intellectually conform is that the system for making decisions has a systematic flaw. Surowiecki asserts that what happens when the decision making environment is not set up to accept the crowd, is that the benefits of individual judgments and private information are lost and that the crowd can only do as well as its smartest member, rather than perform better (as he shows is otherwise possible).[4] Detailed case histories of such failures include: Applications[edit] Surowiecki is a very strong advocate of the benefits of decision markets and regrets the failure of DARPA's controversial Policy Analysis Market to get off the ground. He points to the success of public and internal corporate markets as evidence that a collection of people with varying points of view but the same motivation (to make a good guess) can produce an accurate aggregate prediction. According to Surowiecki, the aggregate predictions have been shown to be more reliable than the output of anythink tank. He advocates extensions of the existing futures markets even into areas such as terrorist activity and prediction markets within companies. To illustrate this thesis, he says that his publisher is able to publish a more compelling output by relying on individual authors under one-off contracts bringing book ideas to them. In this way they are able to tap into the wisdom of a much larger crowd than would be possible with an in-house writing team. Will Hutton has argued that Surowiecki's analysis applies to value judgments as well as factual issues, with crowd decisions that "emerge of our own aggregated free will [being] astonishingly... decent". He concludes that "There's no better case for pluralism, diversity and democracy, along with a genuinely independent press."[8] Applications of the wisdom-of-crowds effect exist in three general categories: Prediction markets, Delphi methods, and extensions of the traditional opinion poll. The following is excerpted from the Crisis Analytics paper: Leveraging the Wisdom and the Generosity of the Crowd Broadly speaking, there are only a few ways we can go about problem solving or predicting something: (1) experts, (2) crowds, and (3) machines (working on algorithms; or learning from data). While experts possess valuable experiences and insights, they may also suffer from biases. The benefit of crowds accrues from its diversity: it is typically the case that due to a phenomenon known as “the wisdom of the crowds” [10], the collective opinion of a group of diverse individuals is better than, or at least as good as, the opinion of experts. Crowds can be useful in disaster response in at least two different ways: firstly, crowdsourcing, in which disaster data is gathered from a broad set of users and locations [6]; and secondly, crowdcomputing, in which crowds help process and analyze the data through collaboratively solving “microtasks” [1]. 1) Crowdsourcing: Crowdsourcing is the outsourcing of a job traditionally performed by a designated agent (usually an employee) to an undefined—generally a large group of people—in the form of an open call. In essence, crowdsourcing is the application of the open-source principles (used to develop products such as Linux, Wikipedia, etc.) to the fields outside of software. Crowdsourcing has been used in the context of disaster response in multiple ways [11]: including crowdsearching, microtasking, citizen science, rapid translation, data cleaning and verification, developing ML classifiers, and election monitoring [6]. 2) Crowdcomputing: Crowdcomputing is a technique that utilizes crowds for solving complex problems. A notable early use of crowdcomputing was the use of crowdsearching by MIT’s team at the 2009 DARPA Network Challenge. The MIT’s team solved a time-critical problem in the least time using an integration of social networking, the Internet, and some clever incentives to foment crowd collaboration. In contemporary times, a number of “microtasking” platforms have emerged as smarter ways of crowdsearching. A familiar example is “Amazon Mechanical Turk”, the commercial microtasking platform that allows users to submit tasks of a large job (that is too large for a single person or small team to perform) for distribution to a global crowd of volunteers (who are remunerated in return for performing these microtasks). A number of free and open-source microtasking platforms have also been developed, including generic microtasking platforms such as CrowdCrafting—which was used by Digital Humanitarian Network (DHN) volunteers in response to Typhoon Pablo in the Philippines—as well as humanitarian-response-focused platforms such as MicroMappers. MicroMappers, developed at the Qatar Computing Research Institute (QCRI), was conceived as a fully customized microtasking platform for humanitarian response—a platform that would be on standby and available within minutes of the DHN being activated. MicroMappers can facilitate the microtasking of translatio, and classification of online user-generated multimedia content (in various formats such as text, images, videos) through tagging.
  16. Principles: 1. Do not harm 2. Use data to help create peaceful coexistence 3. Use data to help vulnerable people and people in need 4. Use data to preserve and improve natural environment 5. Use data to help creating a fair world without discrimination
  17. http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
  18. From our Crisis Analytics paper: To be sure, the use of mapping technology to combat crisis is not new. A classic example of Crisis Mapping Analytics is John Snow’s Cholera Map. Snow studied the severe outbreak of cholera in 1854 near the Broad Street in London, England. Contrary to the prevailing mindset (that believed that cholera was spread through polluted air or “miasma”), Snow showed—through the ingenious use of spatial analytics (comprising mapping and detailed statistical analysis)—that the cholera cases were all clustered around the pump in a particular street, and that contaminated water, not air, spread cholera. We can also turn to Florence Nightingale—a nurse assigned to the old Barrack Hospital in Scutari during the Crimean War in 1850s—for another vintage example of data-based crisis analytics. Nightingale combined data analytics and striking visualizations to highlight the importance of proper healthcare and hygiene for checking the spread of disease amongst soldiers. In the work of Snow and Nightingale, we already see the vestiges of crisis analytics: they were already testing for spatial effects (such as autocorrelation; clustering/ dispersion) and testing hypotheses (about proposed correlations and relationships).
  19. The Internet, Open Source, and Open Data In a wide variety of fields, the new Internet-based economy is spawning a paradigm shift in how institutions work. The “open source” culture (the paradigm underlying Internet projects such as Linux and Wikipedia) has ushered in a new era that relies more on collaboration and volunteerism (than on formal organizations). Instead of a rigid structure constrained by scarcity of human resources, the new paradigm is driven by abundance and cognitive surplus (due to technology-driven pooling of volunteer human resources). This open source trend is now also visible in humanitarian development in various manifestations such as digital humanitarianism, user generated information, participatory mapping, volunteered geographic information, open- source software (such as OpenStreetMap) and open data. Many V&TCs have exploited these open standards to link data from disparate sources and create mashups (which are defined as a web page/ application that uses or combines data or functionality from multiple existing sources to create new services). Another important trend emerging in the modern era is the “Open Data” that has resulted in unprecedented commoditization and opening up of data. A number of countries from allover the world (more than 40 countries in 2015) have established open data initiatives to open numerous kinds of datasets to public for greater transparency. Open data can also lead to improved governance through the involvement of public and better collaboration between public and private organizations. As an example, in the aftermath of the Haiti crisis, volunteers across the world cobbled together data from various sources—including data from satellite maps and mobile companies along with information about health facilities from the maps of the World Health Organization, and police facilities from the Pacific Disaster Center—and plotted them on open-source platforms such as the OpenStreetMap. The importance of OpenStreetMap can be gauged from the fact that soon after the earthquake, OpenStreetMap had become the de facto source of Haiti map data for most of the United Nations (UN) agencies. SaTScan™ is a free software that analyzes spatial, temporal and space-time data using the spatial, temporal, or space-time scan statistics. It is designed for any of the following interrelated purposes: • Perform geographical surveillance of disease, to detect spatial or space-time disease clusters, and to see if they are statistically significant. • Test whether a disease is randomly distributed over space, over time or over space and time. • Evaluate the statistical significance of disease cluster alarms. • Perform repeated time-periodic disease surveillance for early detection of disease outbreaks. The software may also be used for similar problems in other fields such as archaeology, astronomy, botany, criminology, ecology, economics, engineering, forestry, genetics, geography, geology, history, neurology or zoology. Data Types and Methods SaTScan uses either a Poisson-based model, where the number of events in a geographical area is Poisson-distributed, according to a known underlying population at risk; a Bernoulli model, with 0/1 event data such as cases and controls; a space-time permutation model, using only case data; an ordinal model, for ordered categorical data; an exponential model for survival time data with or without censored variables; or a normal model for other types of continuous data. The data may be either aggregated at the census tract, zip code, county or other geographical level, or there may be unique coordinates for each observation. SaTScan adjusts for the underlying spatial inhomogeneity of a background population. It can also adjust for any number of categorical covariates provided by the user, as well as for temporal trends, known space-time clusters and missing data. It is possible to scan multiple data sets simultaneously to look for clusters that occur in one or more of them.
  20. As the United Nations launches a 17-point agenda for helping the world's poor, 267 economists from 44 countries on Friday published a declaration advocating one particular way: Make people healthier. "In terms of how much better you can make the world per dollar you spend, it’s very difficult to beat a set of strategic investments in health care," Harvard University economist Larry Summers, who organized the manifesto, said in an interview. "Ours is the unique generation that has the prospect of convergence across the world in health," making the poor as healthy as the rich, Summers said. What the economists are advocating is already included in the United Nations' new Sustainable Development Goals for the next 15 years. To be specific, it's part 8 of goal 3: "Achieve universal health coverage, including financial risk protection, access to quality essential health-care services and access to safe, effective, quality and affordable essential medicines and vaccines for all." The economists clearly feel that health deserves to be called out for special consideration. In their declaration, they "call on global policymakers to prioritize a pro-poor pathway to universal health coverage as an essential pillar of development.” Giving high priority to health implicitly means giving a lower priority to some of the other items on the UN agenda. Summers said he doesn't want to discuss which ones. "I’m not going to get into that game." (Summers, a former Treasury secretary, National Economic Council chief, and Harvard University president, knows a politically radioactive question when he hears it.) Friday's declaration grew out of the Global Health 2035 report by the Lancet Commission on Investing in Health. That report found that each dollar invested in health in poor countries can have a payback of $9 or more. World Bank Chief Economist Kaushik Basu, another signer of the declaration, said that having grown up in India made him especially aware of the importance of good health. "In India if you are poor, one health episode spins you into a trap you’ll never get out of," he said.
  21. https://www.youtube.com/watch?v=-L-WFukOARU Steven Keating also has a TEDx tallk on this: https://www.youtube.com/watch?v=U5SafKJgqPM http://news.mit.edu/2015/student-profile-steven-keating-0401 In 2007, Steven Keating had his brain scanned out of sheer curiosity. Keating had joined a research study that included an MRI scan, and he asked that the scan’s raw data be returned to him. The scan revealed only a slight abnormality, near his brain’s smell center, which he was advised to have re-evaluated in a few years. A second scan, in 2010, showed no change, suggesting that the abnormality was most likely benign. While the second scan provided reassurance, Keating’s knowledge of the abnormality — as a result of having access to the raw data from these scans — ultimately led to the detection of a baseball-sized tumor that was removed this past August. Now a graduate student in the Department of Mechanical Engineering and based at the MIT Media Lab, Keating says that his curiosity saved his life — and that his experience with cancer has fueled a strong interest in advocating for open health data. Discovering a baseball-sized brain tumor Keating arrived at MIT in fall 2010 as the first student to join the Media Lab’s Mediated Matter Group. Under his advisors — Neri Oxman, the group’s director and the Sony Corporation Career Development Associate Professor of Media Arts and Sciences, and David Wallace, a professor of mechanical engineering and engineering systems — Keating studies digital construction and biologically inspired design. He is pursuing a PhD in mechanical engineering with a minor in synthetic biology. Last July, Keating noticed that he was experiencing a phantom vinegar smell for about 30 seconds every day. Knowing that his 2007 and 2010 research scans showed an abnormality near his smell center, he requested an MRI scan through MIT Medical. The scan revealed that the abnormality had grown into a tumor that needed to be removed as soon as possible. Keating went to Brigham and Women’s Hospital (BWH) in Boston on Aug. 19 for surgery, accompanied and supported by his family and his girlfriend; Oxman; and Yoel Fink, a professor of materials science and director of MIT’s Research Laboratory of Electronics. The surgery was performed by neurosurgeon E. Antonio Chiocca, and Keating, though sedated, was kept awake while the tumor was removed. This was so doctors could ask him questions while they were probing and cutting brain tissue to ensure they were not damaging the brain’s language center. The 10-hour surgery was captured on video, which, at Keating’s request, was shared with him. His recovery was quick: Keating was out of the hospital after two days, and he was back on the MIT campus within a week. A tissue biopsy confirmed that his tumor was an IDH1-mutant malignant astrocytoma. In this type of brain cancer, which was only first identified by researchers in 2009, the mutated IDH enzyme leads to the production of 2HG, a novel, oncogenic metabolite. Through the Bridge Project — a collaboration between MIT’s Koch Institute for Integrative Cancer Research and the Dana-Farber/Harvard Cancer Center — a cross-institutional research team is exploring how to use 2HG as a biomarker to detect and monitor IDH-mutant cancers. Ovidiu Andronesi, a radiologist at Massachusetts General Hospital (MGH) and a collaborator on this research, applied this monitoring technology via MRI spectroscopy imaging to scan Keating’s brain before and after his surgery. These scans show the reduction of 2HG after doctors removed the tumor; the scans were also shared with Keating, at his request. “As a cancer scientist, hearing Steven talk about 2HG spectroscopy screening as part of his clinical care is remarkable,” says Matthew Vander Heiden, the Eisen and Chang Career Development Associate Professor of Biology and a member of the Koch Institute, who is a leader on this research project. “IDH’s role in these cancers was only discovered six years ago, and it is incredible, as well as humbling, that Steven could benefit from some of the basic science done in this short time period since IDH mutations were recognized.”   Diving deeper into the data Since the surgery, Keating’s curiosity has only become more acute. This has been fueled, in large part, by his close connection with his doctors and the data they were able to provide. “Because of that connection, I had new options,” he says. “I asked for the surgery to videotaped, for my genome to be sequenced, and for the raw data from scans.” With this abundance of data, Keating is able to apply his own research interests to develop an intimate understanding of his brain and his tumor. In Oxman’s Mediated Matter Group, Keating’s research explores how to leverage 3-D printing and other fabrication methods to print everything from living organisms to entire buildings. With the resources available to him at the Media Lab, he and colleagues James Weaver and Ahmed Hosny at Harvard University’s Wyss Institute for Biologically Inspired Engineering have pored over his health data and created digital and 3-D-printed models of his tumor, brain, and surgically repaired skull. To share his experiences as a patient-scientist, Keating gave a talk at the Koch Institute on Oct. 22 as part of a public event on IDH-mutant cancers. He returned on Nov. 21 to share his story with the Koch Institute’s cancer researchers. MIT graduate student Steven Keating shares how innate curiosity helped him discover a baseball-sized tumor in his brain. Video: Koch Institute “Steven’s story is so inspiring in part because he is approaching his own cancer as a scientific problem, and he is actively seeking the data he needs to solve that problem,” says Tyler Jacks, director of the Koch Institute and the David H. Koch Professor in MIT’s Department of Biology. “After hearing his story, I think all of us were motivated to get back into the lab.” “Steven’s insatiable curiosity is what science is all about,” adds Nancy Hopkins, a professor emerita of biology, and member of the Koch Institute, who attended both talks. “He addresses even his own cancer as if it were the latest fascinating experiment and as an opportunity to advance knowledge and help others.” Advocating for opening health data Given his up-close-and-personal experience with his health, Keating says he is now a strong believer in open sourcing and allowing patients to have easy access to their own health data. He says he was fortunate that his doctors were willing to share his data, but he did notice many small barriers along the way. “My doctors are incredible for sharing my data and encouraging me to learn more from it,” Keating says. “However, the process raised some questions for me, as I received my data on 30 CDs, without easy tools to understand, learn, or share, and there was no genetic data included. Why CDs? Why limited access for patients to their own data? Can we have a simple, standardized share button at the hospital? Where is the Google Maps, Facebook, or Dropbox for health? It needs to be simple, understandable, and easy, as small barriers add up quickly.” Keating says this cause has personal importance because having access to his health data not only led him to discover his tumor in the first place, but it also helped find the doctors and medical care he needed. “Imagine having your whole medical record that you could not only share with doctors and scientists but also with friends and family, too,” he says. “Patients could get second opinions very easily, and doctors can follow what leaders in the field are doing.” He says there are also huge mutual benefits when patients decide to share their health data with researchers, because it provides them with an actual case to study. The same is true when data is shared within patient communities, as those with precisely similar conditions are able to connect with one another. Critics of open-source health data largely point to privacy considerations. This is especially true with regard to patients’ genetic data, which inherently reveals information about their family members. Many also worry about patients making medical decisions based on their own interpretation, against the advice of doctors. Furthermore, people say doctors might second-guess every one of their decisions to the point where the standard of care would decrease. While Keating recognizes and respects these concerns, he says that the landscape of health care is changing — mentioning the rise of wearable technologies that collect personal health data, such as smart watches, as an example. “I’m a strong believer in privacy, but if a patient wants to share, they should be able to,” he says. “Your personal being is your personal property, and you should have the right to share that data if you want to.” This is an area where Keating is leading by example. He has open-sourced his health data on his personal website, where his MRI scans and tumor model are available for download, and he has been meeting with government and hospital officials and leaders in the open-source health data field. He also has been exploring how links can be made between hospitals and open patient data repositories, such as Sage Bionetworks, the Personal Genome Project,Cancer Commons, and Patients Like Me. As a result of his advocacy for open-health data, the White House invited Keating to President Barack Obama’s unveiling of the Precision Medicine Initiative in January. Obama's proposal calls for increased federal investment in patient-powered research that accounts for individual differences in genes, environments, and lifestyles. One of the initiative’s primary objectives is accelerating design and testing of tailored cancer treatments through the National Cancer Institute. Having completed proton therapy at MGH with radiation oncologist Helen Shih, Keating is now undergoing chemotherapy at BWH. All the while, his spirits remain high. In an email he sent his friends and family before his surgery, Keating described life as a “wild ride.” However, as wild as it can be, he says that being an MIT student armed with data and a sense of curiosity can make all the difference. “The benefit of MIT is that we can know it’s a ride, but it’s a scary ride unless you have information to make it a curious problem,” he says. “And if it’s a curious problem, it becomes an exciting ride.”
  22. Currently, even in the advanced countries like the US and UK, not all office-based physicians have electronic medical records. There is an emerging trend towards electronic health records, which will then lead towards some sort of health information liquidity. Big data is a disruptive force in healthcare and thus is resisted by health industry in many scenarios (as documented by Eric Topol in his excellent book). ----- “It’s your blood, your DNA, and your money; shouldn’t the images, records, and data belong to you, too? Dr. Topol’s deeply researched, powerfully presented arguments will ruffle feathers in the medical establishment—but he maintains that the new era of smartphones, apps, and tiny sensors is putting the patient in charge for the first time. And he’s right.” —DAVID POGUE, FOUNDER OF YAHOO TECH AND HOST OF PBS’ “NOVA” Excerpt From: Eric Topol. “The Patient Will See You Now: The Future of Medicine is in Your Hands.” iBooks. The Avatar Will See You Now Medical centers are testing new, friendly ways to reduce the need for office visits by extending their reach into patients’ homes. The avatar, Molly, interviews them in Spanish or English about the levels of pain they feel as a video guides them through exercises, while the 3-D cameras of a Kinect device measure their movements. Because it’s a pilot project, Paul Carlisle, the director of rehabilitation services, looks on. But the ultimate goal is for the routine to be done from a patient’s home. 
“It would change our whole model,” says Carlisle, who is running the trial as the public hospital looks for creative ways to extend the reach of its overtaxed budget and staff. “We don’t want to replace therapists. But in some ways, it does replace the need to have them there all the time.” The Robot Will See You Now IBM's Watson—the same machine that beat Ken Jennings at Jeopardy—is now churning through case histories at Memorial Sloan-Kettering, learning to make diagnoses and treatment recommendations. This is one in a series of developments suggesting that technology may be about to disrupt health care in the same way it has disrupted so many other industries. Are doctors necessary? Just how far might the automation of medicine go? Your smartphone will see you now: The Robot Will See You Now IBM's Watson—the same machine that beat Ken Jennings at Jeopardy—is now churning through case histories at Memorial Sloan-Kettering, learning to make diagnoses and treatment recommendations. This is one in a series of developments suggesting that technology may be about to disrupt health care in the same way it has disrupted so many other industries. Are doctors necessary? Just how far might the automation of medicine go? The crowd will see you now THAT people scour the pages of the world wide web searching for answers to medical problems is well known. Indeed, doctors label the most diligent seekers of online medical information “cyber-chondriacs”. Some frustrated individuals have even set up their own websites, replete with data about their conditions or those of family members, to encourage strangers to help solve “mum’s medical mystery”, or offer a cure for a particular brain cancer. The need for a “crowdsourced” service like this comes from the number of rare diseases around. The National Institutes of Health, America’s medical agency, recognises 7,000—defined as those that each affect fewer than 200,000 people. A general practitioner cannot possibly recognise all of these. Moreover, it may not be clear to him, even when he knows he cannot help, what sort of specialist the patient should be referred to. Research published in 2013, in theJournal of Rare Disorders, says about 8% of Americans—some 25m people—are affected by rare diseases, and that it takes an average of 7½ years to get a diagnosis. Even in Britain, with all the resources of the country’s National Health Service at a GP’s disposal, rare-disease diagnosis takes an average of 5½ years. Also, doctors often get it wrong. A survey of eight rare diseases in Europe found that around 40% of patients received an erroneous diagnosis at first. This is something that can lead to life-threatening complications. CrowdMed, though, brings numerous pairs of eyeballs, each with different knowledge behind them, to every problem. Patients submit their cases and may offer a reward of a few hundred dollars to lubricate the process. The volunteer diagnosticians are students, retired doctors, nurses and even laymen and women who enjoy pitting their wits against a good medical mystery. Besides the cash, successful volunteers also get the kudos of rising in the website’s ranking system—and that ranking system is, in turn, used to filter the feedback given to patients, to try to avoid mistakes.
  23. http://www.wsj.com/articles/big-data-cuts-buildings-energy-use-1411937794
  24. While at first glance it is difficult to assess the value of this rather rudimentary data, remarkably useful information on human behavior may be derived from large sets of de-identified CDRs. There are at least three dimensions that can be measured: As mobile phone users send and receive calls and messages through different cell towers, it is possible to “connect the dots” and reconstruct the movement patterns of a community. This information may be used to visualize daily rhythms of commuting to and from home, work, school, markets or clinics, but also has applications in modeling everything from the spread of disease to the movements of a disaster-affected population. The geographic distribution of one’s social connections may be useful both for building demographic profiles of aggregated call traffic and understanding changes in behavior. Studies have shown that men and women tend to use their phones differently, as do different age groups. Frequently making and receiving calls with contacts outside of one’s immediate community is correlated with higher socio-economic class. 3. Mobile network operators use monthly airtime expenses to estimate the household income of anonymous subscribers in order to target appropriate services to them through advertising. When people in developing economies have more money to spend, they tend to spend a significant portion of it on topping off their mobile airtime credit. Monitoring airtime expenses for trends and sudden changes could prove useful for detecting the early impact of an economic crisis, as well as for measuring the impact of programmes designed to improve livelihoods.
  25. http://www.wsj.com/articles/big-data-cuts-buildings-energy-use-1411937794
  26. The following text is excerpted from: http://en.wikipedia.org/wiki/Empirical The word empirical denotes information gained by means of observation or experiments. Empirical data is data produced by an experiment or observation. A central concept in modern science and the scientific method is that all evidence must be empirical, or empirically based, that is, dependent on evidence or consequences that are observable by the senses. It is usually differentiated from the philosophic usage of empiricism by the use of the adjective empirical or the adverb empirically. The term refers to the use of working hypotheses that are testable using observation or experiment. In this sense of the word, scientific statements are subject to, and derived from, our experiences or observations.
  27. The scale of Punjab: 120 million—if this was a country, it would have been the tenth most populous country in the world.  The scale of crowdsourced picture (4.6 million pictures using 100$ smartphone by 3000 to 4000 field workers to date).
  28. The mobile applications used by the government of Punjab are powered by a platform developed by ITU over time known as DataPlug. DataPlug—which is based on University of Washington’s Open Data Kit—makes it trivial to create new customized mobile applications (through a drag and drop GUI). Using this, non-technical persons can easily create mobile applications in 5 minutes without writing a line of code. The application can be changed from DataPlug’s site even after the application has been deployed by simply making the changes at the DataPlug’s website and this will be pushed to all the users. There are various options in these mobile applications (including textbox, taking pictures, record location, etc.).
  29. http://www.un.org/millenniumgoals/reports.shtml
  30. http://www.un.org/millenniumgoals/reports.shtml
  31. The mobile applications used by the government of Punjab are powered by a platform developed by ITU over time known as DataPlug. DataPlug—which is based on University of Washington’s Open Data Kit—makes it trivial to create new customized mobile applications (through a drag and drop GUI). Using this, non-technical persons can easily create mobile applications in 5 minutes without writing a line of code. The application can be changed from DataPlug’s site even after the application has been deployed by simply making the changes at the DataPlug’s website and this will be pushed to all the users. There are various options in these mobile applications (including textbox, taking pictures, record location, etc.).
  32. Challenges and considerations: Incentives to open and share big data Need to develop cultural and institutional capabilities But even easily accessible data such as Facebook “Likes” can predict sensitive characteristics including “sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender.” Data collectors often sell the data to others. One data broker assembled an average of 1,500 pieces of informa- tion about more than half a billion consumers worldwide from information people provided voluntarily on various websites. Statistics: For every person connected to high-speed broadband, five are not. Worldwide, some 4 billion people do not have any internet access, nearly 2 billion do not use a mobile phone, and almost half a billion live outside areas with a mobile signal. More households in developing countries own a mobile phone than have access to electricity or improved sanitation (figure O.4, panel a). The massive data volumes collected by internet platforms have created a whole new branch of economics— nano-economics—which studies individual, computer- mediated transactions.14 The main benefit to the user is that services can be tailored to individual needs and preferences—although at the cost of giving up privacy. For the seller, it allows more targeted advertising and even price discrimination, when automated systems can analyze user behavior to determine willingness to pay and offer different prices to different users. Kosinski, Stillwell, and Graepel 2013. (via WDR 2016) And smartphone sensors can infer a user’s “mood, stress levels, personality type, bipolar disorder, demographics (e.g., gen- der, marital status, job status, age), smoking habits, overall wellbeing, progression of Parkinson’s disease, sleep pat- terns, happiness, levels of exercise, and types of physical activity or movement.” c. See Peppet (2014) for individual references.
  33. Main vectors and diseases they transmit Vectors are living organisms that can transmit infectious diseases between humans or from animals to humans. Many of these vectors are bloodsucking insects, which ingest disease-producing microorganisms during a blood meal from an infected host (human or animal) and later inject it into a new host during their subsequent blood meal. Mosquitoes are the best known disease vector. Others include ticks, flies, sandflies, fleas, triatomine bugs and some freshwater aquatic snails.
  34. The lesson that was learned: You’re best served by predicting such a pandemic, once such a pandemic has happened, you cannot really contain it. The dengue project was implemented at the scale of Punjab, which is a population of 120 million people. (For reference, the population of UK is only 64 million). 2011: 20,000 infections in Punjab, 17,000 in Lahore alone (*this was a pandemic by any scale); more than than 250 died; 2012: 234 infections; and no death. Project Highlights Data capture on the move Verification via GPS coordinates Real  time data entry into central server Consolidated Online Dashboards, accessible by all stakeholders involved Spatial-temporal analysis (SaTScan) to identify intersecting areas with dengue larvae breeding hotspots and dengue patients reported Built in early disease detection/warning system with geographical illustrations Current Standing The system has been successfully operational Punjab wide and spanned over 36 districts and more than 25 departments. 4000 android phone users More than 5,021,373  anti-dengue surveillance activities are submitted via android mobiles -------------------- In 2011, Lahore, Pakistan, was hit by the worst outbreak of dengue fever in its history (and anywhere in the world). The outbreak infected 16,000 people and took more than 350 lives. The Punjab IT board mobilized its response using big data averting any death in 2012 (and limiting the infections to only 234). While the magnitude of the disease naturally varies from year to year, the big-data driven approach adopted by PITB should be given some credit for averting a bigger tragedy. They key was to develop a tracking system that localized troubled areas and quarantined the disease and thereby prevented its spread. This was aided by mobile phone technology. Government workers were provided with 1500 Android phones to track the location and timing of confirmed dengue cases. The workers photologged the performance of more than 67000 prevention activities. Based on the tagging, the troubled areas could be localized and inevitably the troubled areas contained water pools (which is a breeding ground for dengue mosquitos). Countries like Pakistan typically do not have elaborate setup for surveilling diseases. PITB researchers adapted the Flubreaks project—which processed data from Google Flu Trends—for dengue fever outbreak.
  35. The lesson that was learned: You’re best served by predicting such a pandemic, once such a pandemic has happened, you cannot really contain it. The dengue project was implemented at the scale of Punjab, which is a population of 120 million people. (For reference, the population of UK is only 64 million). 2011: 20,000 infections in Punjab, 17,000 in Lahore alone (*this was a pandemic by any scale); more than than 250 died; 2012: 234 infections; and no death. Project Highlights Data capture on the move Verification via GPS coordinates Real  time data entry into central server Consolidated Online Dashboards, accessible by all stakeholders involved Spatial-temporal analysis (SaTScan) to identify intersecting areas with dengue larvae breeding hotspots and dengue patients reported Built in early disease detection/warning system with geographical illustrations Current Standing The system has been successfully operational Punjab wide and spanned over 36 districts and more than 25 departments. 4000 android phone users More than 5,021,373  anti-dengue surveillance activities are submitted via android mobiles -------------------- In 2011, Lahore, Pakistan, was hit by the worst outbreak of dengue fever in its history (and anywhere in the world). The outbreak infected 16,000 people and took more than 350 lives. The Punjab IT board mobilized its response using big data averting any death in 2012 (and limiting the infections to only 234). While the magnitude of the disease naturally varies from year to year, the big-data driven approach adopted by PITB should be given some credit for averting a bigger tragedy. They key was to develop a tracking system that localized troubled areas and quarantined the disease and thereby prevented its spread. This was aided by mobile phone technology. Government workers were provided with 1500 Android phones to track the location and timing of confirmed dengue cases. The workers photologged the performance of more than 67000 prevention activities. Based on the tagging, the troubled areas could be localized and inevitably the troubled areas contained water pools (which is a breeding ground for dengue mosquitos). Countries like Pakistan typically do not have elaborate setup for surveilling diseases. PITB researchers adapted the Flubreaks project—which processed data from Google Flu Trends—for dengue fever outbreak.
  36. The investigating officer is provided with Android phones and the officer is responsible for taking a picture to the crime spot and providing input on an FIR. The geomap provides information on different types of crimes. Different patterns emerge from noting and analyzing the crimes. For example, it can be seen that mobile phone snatching is typically in unlit alley/ streets, and thus a solution to contain such crimes would be place more lighting. Similarly, various other crimes can be filtered, and this provides a way to analyze the various patterns of crimes and develop an appropriate strategy for crime prevention. This system is based on a similar system called CompStat. CompStat—or COMPSTAT—(short for COMPuter STATistics) is a combination of management philosophy and organizational management tools for police departments named after the New York City Police Department's accountability process, and has since been implemented in many other departments. Because it often relies on underlying software tools, CompStat has sometimes been confused for a software program in itself. This is a fundamental misconception. CompStat often does, however, incorporate crime mapping systems and a commercial or internally developed database collection system. In some cases, police departments have started offering information to the public through their own websites. In other cases, police departments can either create their own XML feed or use a third party to display data on a map. The largest of these is CrimeReports.com, used by thousands of agencies nationwide. History of CompStat: In 1994, Police Commissioner William Bratton introduced a data-driven management model in the New York City Police Department called CompStat, which has been credited with decreasing crime and increasing quality of life in New York City over the last eighteen years (Bratton, 1998; Kelling & Bratton, 1998; Shane, 2007). Due to its success in New York, CompStat has diffused quickly across the United States and has become a widely embraced management model focused on crime reduction. The CompStat process is guided by four principles, which are summarized as follows (see McDonald, 2002; Shane, 2007; & Godown, 2009): Accurate and timely intelligence (i.e., "Know what is happening." (Godown, 2009)): In this context, crime intelligence relies on data primarily from official sources, such as calls for service, crime, and arrest data. This data should be accurate and available as close to real-time as possible. This crime and disorder data is used to produce crime maps, trends, and other analysis products. Subsequently, command staff uses these information products to identify crime problems to be addressed. Effective tactics (i.e., "Have a plan." (Godown, 2009)): Relying on past successes and appropriate resources, command staff and officers plan tactics that will respond fully to the identified problem. These tactics may include law enforcement, government, and community partners at the local, state, and federal levels. A CompStat meeting provides a collective process for developing tactics as well as accountability for developing these tactics. Rapid deployment (i.e., "Do it quickly." (Godown, 2009)): Contrary to the reactive policing model, the CompStat model strives to deploy resources to where there is a crime problem now, as a means of heading off the problem before it continues or escalates. As such, the tactics should be deployed in a timely manner. Relentless follow-up and assessment (i.e., "If it works, do more. If not, do something else." (Godown, 2009)): The CompStat meeting provides the forum to "check-in" on the success of current and past strategies in addressing identified problems. Problem-focused strategies are normally judged a success by a reduction in or absence of the initial crime problem. This success or lack thereof, provides knowledge of how to improve current and future planning and deployment of resources.
  37. http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
  38. http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
  39. Human behavior is complex to model, influence, and predict: While it is becoming increasing clear that there is a value proposition of big data  for tackling complex technical and business problems,  it is not that obvious on how well big data can tackle complex social problems. This is not a surprise. It has to do with people and institutions!   So, how can (big) data be used for social good? On  paper, it is simple :-) In practice it is not. We as data scientists, data engineers,  may employ similar tools as when we work for business or science, but our motivation need to stem from the desire to help alleviate some of the world's most pressing problems; poverty, disease, ecological harm, war and famine. Think for example of what is going on with the refugees in these days for example…. 2) Health-related decisions especially relate to a person’s social network; smokers typically have smokers in their social networks. Similar network effects can be seen for other problems such as alcohol and depression. There is strong evidence that obesity spreads through social networks. (This is based on a study from the authors of the book Connected, Christakis and Fowler who studies the spread of obesity in a large social network over 32 years.)
  40. Human behavior is complex to model, influence, and predict: While it is becoming increasing clear that there is a value proposition of big data  for tackling complex technical and business problems,  it is not that obvious on how well big data can tackle complex social problems. This is not a surprise. It has to do with people and institutions!   So, how can (big) data be used for social good? On  paper, it is simple :-) In practice it is not. We as data scientists, data engineers,  may employ similar tools as when we work for business or science, but our motivation need to stem from the desire to help alleviate some of the world's most pressing problems; poverty, disease, ecological harm, war and famine. Think for example of what is going on with the refugees in these days for example…. 2) Health-related decisions especially relate to a person’s social network; smokers typically have smokers in their social networks. Similar network effects can be seen for other problems such as alcohol and depression. There is strong evidence that obesity spreads through social networks. (This is based on a study from the authors of the book Connected, Christakis and Fowler who studies the spread of obesity in a large social network over 32 years.)
  41. Credit: Demography, meet Big Data; Big Data, meet Demography: Reflections on the Data-Rich Future of Population Science By Emmanuel Letouzé Director & Co-Founder, Data-Pop Alliance
  42. The movements and locations of individuals are, of course, traditionally regarded as part of one of the most sensitive areas of privacy. Companies such as Google and Apple are increasingly collecting such data. In the wake of revelations that the National Security Agency (NSA) has accessed information from major Internet companies — including Google, Microsoft, Facebook, Skype, Apple and Yahoo — a debate has begun to unfold. How important might small bits of data or “metadata” be, from phone numbers and GPS tracking data to even just the location “pings” recorded by cellular telecommunications towers? A 2013 study in Scientific Reports, published in the journal Nature, “Unique in the Crowd: The Privacy Bounds of Human Mobility,” is one of the latest research efforts to show how humans can be tracked and identified based on databases that, in principle, contain anonymous data. Researchers from MIT, Harvard and Université Catholique de Louvain in Belgium analyze what they call “mobility traces,” or data that can “approximate [the] whereabouts of individuals and can be used to reconstruct individuals’ movements across space and time.” They point out that “a simply anonymized dataset does not contain name, home address, phone number or other obvious identifier. Yet if an individual’s patterns are unique enough, outside information can be used to link the data back to an individual.” The study performs an analysis of 15 months of mobile phone data relating to about 1.5 million individuals in a small European country during 2006-2007. Ensuring Privacy and Preventing Abuse: The field of big data promises great opportunities but also entails some great risks of abuse and misuse. With big crisis data, there is always the danger of the wrong people getting hold of sensitive data—something that can easily lead to disastrous consequences. The development of appropriate policy can help manage this dilemma between the opportunities and risks of big data. Some of the big questions that big data policies should address are: (1) what data to open up? (2) who should be able to access which data? (3) which data should be publicly accessible? (4) how can the data be used, reused, repurposed, and linked? The devised policies must also include prescriptive steps that ensure that data is used ethically (and not misused by malevolent actors and crisis profiteers). In particular, we should take steps to ensure that crisis victims do not expose themselves or others to further harm unwittingly (e.g., in countries beset with a civil war, or sectarian violence, a request for help with personal information may also be used maliciously by malevolent actors for violent purposes). 2) Ethical Big Crisis Data Analytics: It is also important that the big crisis data and the digital humanitarian communities emphasize value-based and ethical humanitarian service. In this regard, these communities can leverage the collective knowledge of the existing humanitarian organizations available in the form of the “humanitarian principles”1 that define a set of universal principles for humanitarian action based on international humanitarian law. These principles are widely accepted by humanitarian actors and are even binding for the UN agencies. The guiding humanitarian principles are: (1) humanity: the humanitarian imperative comes first—aid has to be given in accordance to need. The purpose of humanitarian action is to protect life and health and ensure respect for human beings; (2) neutrality: the humanitarian actors must not take sides in hostilities or engage in controversies of a political, racial, religious or ideological nature; (3) impartiality: aid should be delivered without discrimination as to nationality, race, religious beliefs, class or political opinions; and (4) independence: the humanitarian action must be autonomous from the political, economic, military or other objectives that any actor may hold with regard to areas where humanitarian action is being implemented. The big crisis data analytics community also needs to adopt these, or similar, principles to guide their own work.
  43. In 2008, researchers from Google explored this potential, claiming that they could “nowcast” the flu based on people’s searches. The essential idea, published in a paper in Nature, was that when people are sick with the flu, many search for flu-related information on Google, providing almost instant signals of overall flu prevalence. The paper demonstrated that search data, if properly tuned to the flu tracking information from the Centers for Disease Control and Prevention, could produce accurate estimates of flu prevalence two weeks earlier than the CDC’s data—turning the digital refuse of people’s searches into potentially life-saving insights. And then, GFT failed—and failed spectacularly—missing at the peak of the 2013 flu season by 140 percent. When Google quietly euthanized the program, called Google Flu Trends (GFT), it turned the poster child of big data into the poster child of the foibles of big data. But GFT’s failure doesn’t erase the value of big data. What it does do is highlight a number of problematic practices in its use—what we like to call “big data hubris.” The value of the data held by entities like Google is almost limitless, if used correctly. That means the corporate giants holding these data have a responsibility to use it in the public’s best interest. In a paper published in 2014 in Science, our research teams documented and deconstructed the failure of Google to predict flu prevalence. Our team from Northeastern University, the University of Houston, and Harvard University compared the performance of GFT with very simple models based on the CDC’s data, finding that GFT had begun to perform worse. Moreover, we highlighted a persistent pattern of GFT performing well for two to three years and then failing significantly and requiring substantial revision. --- [DAVID HAND] https://www.youtube.com/watch?v=C1zMUjHOLr4 All data sets have problems: distortion, missing values. Data quality is an important issue: this has been well investigated for small data sets, but this may be an even bigger problem for large data sets. --- Holmes famously solves the case by focusing on a critical piece of evidence, a guard dog that doesn’t bark during the commission of the crime. He concludes “the midnight visitor was someone the dog knew well”, ultimately leading to the determination that the horse’s trainer was the guilty party. The story is often used as an example of the importance of expanding the search for clues beyond the obvious and visible. Caveat Emptor: Beware of the Big Noise If big sized data was not challenging enough, crisis analysts have to deal with another formidable challenge: big false data. The presence of false data dilutes the signal to noise ratio, making the task of finding the right information at the right time even more challenging. This problem of big noise is particularly problematic for crowdsourced data in which the noise may be injected intentionally or unintentionally. Intentional sources of noise may come from pranksters or more sinisterly through cyber-attacks (this is particularly a risk during man-inflicted disasters, such as coordinated terrorist attacks). Unintentional sources of noise also creep into disaster data (e.g., through the spreading of false rumors on social networks; or through the circulation of stale information about some time-critical matter). The data may also be false due to bias. Much like the “dog that didn’t bark” that tipped off Sherlock Holmes in one of his investigations, the data that is not captured is sometimes more important than what was captured. This sampling bias is always present in social media and must be investigated using sound statistical analysis (the need of which is not obviated due to the large size of data). As an example of the inherent bias in big data, we note that the Google Flu Tracker overestimated the size of the 2013 influenza pandemic by 50%, and predicted double the amount of flu-related doctor visits. ---- A prime example that demonstrates the limitations of big data analytics is Google Flu Trends, a machine- learning algorithm for predicting the number of u cases based on Google search terms. To predict the spread of influenza across the United States, the Google team analyzed the top fifty million search terms for indications that the u had broken out in particular locations. While, at first, the algorithms appeared to create accurate predictions of where the u was more prevalent, it generated highly inaccurate estimates over time.165 is could be because the algorithm failed to take into account certain variables. For example, the algorithm may not have taken into account that people would be more likely to search for u-related terms if the local news ran a story on a u outbreak, even if the outbreak occurred halfway around the world. As one researcher has noted, Google Flu Trends demonstrates that a “theory-free analysis of mere correlations is inevitably fragile. Summary of Research Considerations (the Federal Trade Commission (FTC) report): In light of this research, companies already using or considering engaging in big data analytics should: „  Consider whether your data sets are missing information from particular populations and, if they are, take appropriate steps to address this problem. „  Review your data sets and algorithms to ensure that hidden biases are not having an unintended impact on certain populations. „  Remember that just because big data found a correlation, it does not necessarily mean that the correlation is meaningful. As such, you should balance the risks of using those results, especially where your policies could negatively a ect certain populations. It may be worthwhile to have human oversight of data and algorithms when big data tools are used to make important decisions, such as those implicating health, credit, and employment. „  Consider whether fairness and ethical considerations advise against using big data in certain circumstances. Consider further whether you can use big data in ways that advance opportunities for previously underrepresented populations. Correlations are a way of catching a scientists attention, but we need models and mechanisms to explain and predict in a way that advances science and creates practical applications. Concluding something from a single source of data (even if voluminous) is problematic. Sometimes inconvenient data is shrugged away as an outlier. But there’s an objective way of knowing what is an outlier. There might be a lot of information in a statistical anomaly, which may be inadvertently filtered away.
  44. Chris Anderson, “The End of Theory”, Wired: http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory http://k38.kn3.net/F1558A3F3.jpg Correlation, as any first-year statistics students knows, is not causation. For causation analysis, one needs models and theories and experiments. For many business applications (such as collaborative filtering for recommendations and personalization), correlations are often enough to do interesting things. This needs to be investigated how useful correlations can be for BD4D. To really gain insight, BD4D science needs to aim at understanding, awareness, or forecasting.
  45. Excerpt From: Siegel, Eric. “Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die.” iBooks. And of course, in such scenarios, not all false positives are equally bad. False positive about who is going to commit a crime is much worse than a false positive about where a crime would be committed.
  46. Data acquisition and sharing is difficult. In countries where there is not an open data policy, doing BD4D is very difficulty. I think the existing works have only scratched the surface of what can be possible with big data for development. With BD4D, we can reinforce a positive loop in which a good government can use big data for development, and big data behavioral insights can be used to improve social behavior. Existing tools are quite basic tools. They make use of crowdsourcing and traditional simple analytics. The real promise of BD4D comes to the picture, when we are able to analyze many modes of data---government data, personal data, open data, online data (such as social media), text, video, audio---all in concert 2) The use of AI becomes more important---especially for NLP. 3) Predictive BD4D analytics is the frontier. With BD4D, we can embed cognitive robotics into the fray but that opens up the Pandora’s box and involves many ethical use questions.
  47. http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800 “The Big Data-Driven Business: How to Use Big Data to Win Customers, Beat Competitors, and Boost Profits”, by Russell Glass and Sean Callahan, 2014 Figures from Book: Data Science for Dummies, Lillian Pierson. Big Data for Dummies, Judith Hurwitz Figure from data.gov Figure from data.gov.uk Figure from https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1 From Harvard Data Science course Picture Credit (Data Scientist):
  48. There’s a lot to like about the altruism and idealism of BD4D, but implementing practical BD4D systems that are useful in practice is going to be challenging.