Big data for development

•

3 likes•2,110 views

Junaid Qadir

Talk at the Computer Lab, University of Cambridge

Education

Big data for development (BD4D)
Junaid Qadir
InformationTechnology
University (ITU), Pakistan

Session Outline
The big data revolution
•  What’s driving big data? What’s new about big data?
Big data for development (BD4D)
•  BD4D techniques
•  BD4D data sources
•  BD4D applications
-big data for health; mobile-based BD4D applications
BD4D research at ITU/ Punjab IT board
•  Main ideas pursued
•  Some success stories
0
1
2
BD4D challenges and pitfalls3
•  Data ownership, privacy, bias, causation, and false positives

Applications of big data
Business
Recommendations
Sports
People Operations
Transport
smart buildings and
energy analytics

What’s driving the data deluge
The Internet of Things

Traditional use of data
“Today the data is siloed off
and unavailable. When data is
in silos you can't make use of
it either for evil or for the public
good, and we need the public
good. We need to stop
pandemics. We need to make
a greener world. We need to
make a fairer world.”
Alex “Sandy” Pentland

how can we use data, and our ability to store it and
process it, for social good and development of human
beings (especially for underdeveloped countries)?
Can we use big data for development?
Under Revision, at the Big Data Analytics (BDAN) Journal

BD4D techniques
big data analytics/ data science
for development

Although the buzzwords
describing the field have
changed - from ‘Knowledge
Discovery’ to ‘Data Mining’ to
‘Predictive Analytics’, and now to
‘Data Science’, the essence has
remained the same - discovery
of what is true and useful in the
mountains of data.
Gregory Piatetsky-Shapiro,
Data Mining Pioneer
Various faces of data analytics

Types of Analytics
The BD4D
research frontier

Data science techniques for BD4D
Visual Analytics

Data science techniques for BD4D
Machine Learning
Computer
Output
Data
Program/
Model
Classification to predict quantity:
regression methods
Classifications to predict category:
trees, forests, kNN, etc.
Supervised Learning
Clustering to self-organize
in categories:
k-means etc.
Unsupervised Learning

Data science techniques for BD4D
Time Series Analysis
Deep Learning

mobile phones

 

Mobile phones reach almost four-fifths of the world’s people
More households in developing countries own a mobile phone
than have access to electricity or improved sanitation.
the ``leapfrogging’’
effect
Kenya’s M-Pesa
reached 80% of
households in
4yrs!

crowdsourcing through social media
(for data generation and big data processing)
Food requests after the Haiti earthquake. The Ushahidi-Haiti crisis map
helps organizations intuitively ascertain where supplies are most needed.

crowdsourcing (crowd computing)
(for data generation and big data processing)
www.youtube.com/watch?v=Z82B1zsvyZU
The aggregation of information
in groups results in decisions
that are often better than could
have been made by any single
member of the group.

BD4D applications
1.  Emergency/ Crisis Response
2.  Healthcare
3.  Better governance
4.  Education
5.  Agriculture, Hunger, Food

Big data for development
Humanitarian emergencies/ crisis response

Use of technology in Haiti (2010)
The dawn of digital humanitarianism

The precursor of big data geo-analytics
A classic example of Crisis Mapping Analytics is John Snow’s Cholera Map. Snow studied the
severe outbreak of cholera in 1854 near the Broad Street in London, England.

What’s new: big crisis data analytics
The Internet, Open Source, and Open Data
Mobilization of the Wisdom of the Crowds
Crowd Mapping
Crowd Computing

big data for health
Healthcare is the killer development ‘app’
As the United Nations launches a 17-point agenda for helping the
world's poor, 267 economists from 44 countries on Friday published a
declaration advocating one particular way: Make people healthier.

What’s wrong with diagnosis today?
Hit and miss and done on intuition
MIT graduate student Steven Keating,
whose live was saved by his curiosity, and
his medical health data that he curated.
How medical selfies can save your life!

Since the beginning of time, most
people have been isolated, without
information about or access to the
best health practices.
But in just the last decade, this
situation has changed completely:
through the spread of cell phone
networks, the vast majority of
humanity now has a two-way
digital connection that can send
voice, text, and most recently,
images and digital sensor data.
Big data will revolutionize healthcare

mobile phone based
development analytics
Call detailed record (CDR)

Mobile big data dimensions
mobility
As mobile phone users send
and receive calls and
messages through different
cell towers, it is possible to
“connect the dots” and
reconstruct the movement
patterns of a community.
social
interaction
The geographic distribution
of one’s social connections
may be useful both for
building demographic
profiles of aggregated call
traffic and understanding
changes in behavior.
economic
activity
Mobile network operators use
monthly airtime expenses to
estimate the household
income of anonymous
subscribers in order to target
appropriate services to them
through advertising.

Some CDR-based applications
Disaster Response
Migration Analysis
Digital Epidemiology
Proxy Census Map
Socioeconomic Indicators
Optimizing Transporation

Mobile-based reality mining
Unobtrusive Personal Analytics

What was observed by us … may be observed
so well that all the disputes that for so many
generations have vexed philosophers are
destroyed by visible certainty, and we are
liberated from wordy arguments.
Computational social science
“Because of the success of science, there is a
kind of a pseudo-science. Social science is an
example of a science which is not a science.

BD4D research at ITU/ PITB2
~120 million people

Smartphones and good governance
https://www.youtube.com/watch?v=ZcGDj33a8WE

Promoting open data (open.punjab.gov.pk)

The use of crowdsourcing
(e.g., for open education data accountability)

Dengue disease epidemic (Lahore, 2011)
21,000 conﬁrmed patients in Punjab
17,000 patients in Lahore alone.
352 deaths.
Disease prevention

Dengue activity tracking system (DATs)
http://www.pitb.gov.pk/dats

BD4D challenges/ pitfalls
Data ownership, privacy, bias, causation, and false positives
3

What can we do with big data?
(the optimistic view)
“With enough data and the ability to crunch it,
virtually any challenge facing
humanity today can be solved.”
Eric Schmidt et al, How Google Works, 2014

What can we do with big data?
(the pragmatic view)
“Technology amplifies human intent
and capacity; it doesn't substitute for
them.”---Kentaro Toyama
Digital dividends depend on key
“analog complements” that include
appropriate policies, regulatory
frameworks, accountability, and
capable human workforce.

Social networks and development
Human behavior is massively
influenced by their social networks.
Connections can completely transform you!
Dev. Insight!

human development is complex …
Human behavior is complex to
model, influence, and predict.
What drives human/ societal
development is the right mindset,
attitude, and intent—which is not
necessarily dependent on data.
Should governments shape
individual choices?

Personal data & data ownership
“Personal data is the
new oil of the internet
and the new currency
of the digital world.”
—Meglena Kuneva
European Consumer Commissioner

Privacy and big data
In a dataset where the location of an individual is specified hourly, and with a spatial
resolution equal to that given by the carrier’s antennas, four spatio-temporal points are
enough to uniquely identify 95% of the individuals.
Many open
ethical data use/
transparency/
privacy
questions..

The (big?) bias/ noise of big data
“Is there any point to which you would wish to draw my attention?”
“To the curious incident of the dog in the night-time.”
“The dog did nothing in the night-time.”
“That was the curious incident,” remarked Sherlock Holmes.
Sometimes what is not in the data is more important;
i.e., the data may be an unrepresentative sample
It is to mistake the noise in the big data to be the signal
Another self-introduced bias in big data can be outlier filtering/ deletion.

Correlations vs. Causation
Is induction the new big-data era mode of science?
We should not use the same
information to construct and
test the same hypothesis
Correlation based analysis often suffices for traditional big data applications, but will it for BD4D?

Machine predictions going awry!
In the movie Minority Report, the cop tackles and
handcuﬀs individuals who have commi:ed no
crime (yet), proclaiming stuﬀ like:
“By mandate of the District of Columbia
Precrime Division, I’m placing you under arrest
for the future murder of Sarah Marks and
Donald Dubin.”

The arrested person confronts Cruise and asks:

“You ever get any false positives?”
In fact, it is very easy to do design a pre-crime criminal catching
algorithm that will catch ALL the criminals!

Open research challenges
1.  Mitigating the highlighted pitfalls/ challenges
2.  Multimodal BD4D analytics
3.  Predictive BD4D analytics
4.  Combining humans, crowds, and AI
5.  Unsupervised BD4D analytics
Open research challenges
1.  Mitigating the highlighted pitfalls/ challenges
2.  Multimodal BD4D analytics
3.  Predictive BD4D analytics
4.  Combining humans, crowds, and AI
5.  Unsupervised BD4D analytics

Credits/ Acknowledgments
Figures from various sources:
Digital life and data exhaust: “Demography, meet Big Data; Big Data, meet Demography”, Emmanuel Letouzé
Mobile CDR-based applications from: “Big Data in Action For Development” (The World Bank).
Cover pages of various magazines and research papers
Various PITB managed websites and applications.
Wikipedia/ Online Pictures of Feynman, Galileo, Galton, Pentland, Shapiro.
Snapshots from Youtube Videos and Online Websites.
Stock photos from various places; the respective owners are the rightful owners of the content.
Books
Connected (Christakis & Fowler),
Inside the Nudge Unit (David Halpern),
Geek Heresy (Toyama),
The Patient Will See You Now (Eric Topol),
Work Rules! (Bock),
World Bank Annual Reports.
These resources have been used in these lecture slides for educational purpose under the fair use doctrine.
The ownership of these resources, if copyrighted, is retained by their respective copyright owners.
Various presentations at slideshare, including:
http://www.slideshare.net/cloudera/cloudera-for-internet-of-things

Concluding remarks
The area of BD4D offers an opportunity to
do good research that also leads to
tangible human impact and development.
We will love to hear your ideas on how we
can use (big data & networking) technology
for development of Pakistan (and other
underdeveloped countries)
junaid.qadir@itu.edu.pk

What's hot

Useful by Piet DaasCentraal Bureau voor de Statistiek

ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...Miriam Fernandez

Isi 2017 presentation on Big Data and biasPiet J.H. Daas

The language of social mediaDiana Maynard

Using language to save the world: interactions between society, behaviour and...Diana Maynard

Understanding the world with NLP: interactions between society, behaviour and...Diana Maynard

Data augmented ethnography:  using big data and ethnography to explore candi...Salla-Maaria Laaksonen

20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...Diana Maynard

Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber

PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAkevig

POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOKIJwest

PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAkevig

Mapping Online Publics: New Methods for Twitter ResearchAxel Bruns

PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAijnlc

GlobalPulse_SAS_MethodsPaper2011UN Global Pulse

Twitter Based Election Prediction and AnalysisIRJET Journal

How to get started with Data JournalismCentre for Media Pluralism and Media Freedom

Chung-Jui LAI - Polarization of Political Opinion by News MediaREVULN

Mapping Online Publics on TwitterAxel Bruns

Global Pulse: Mining Indonesian Tweets to Understand Food Price Crises copyUN Global Pulse

What's hot (20)

Useful by Piet Daas

ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...

Isi 2017 presentation on Big Data and bias

The language of social media

Using language to save the world: interactions between society, behaviour and...

Understanding the world with NLP: interactions between society, behaviour and...

Data augmented ethnography:  using big data and ethnography to explore candi...

20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...

Who’s in the Gang? Revealing Coordinating Communities in Social Media

PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA

POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK

PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA

Mapping Online Publics: New Methods for Twitter Research

PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA

GlobalPulse_SAS_MethodsPaper2011

Twitter Based Election Prediction and Analysis

How to get started with Data Journalism

Chung-Jui LAI - Polarization of Political Opinion by News Media

Mapping Online Publics on Twitter

Global Pulse: Mining Indonesian Tweets to Understand Food Price Crises copy

Similar to Big data for development

Big Data PaperAndile Ngcaba

The REAL Impact of Big Data on PrivacyClaudiu Popa

Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...Amit Sheth

Towards a More Open WorldAlexander Howard

Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Katie Whipkey

WEF - Personal Data New Asset Report2011Vincent Ducrey

Big Data-Job 2Roshan Barua

ICCM 2013 Panel 1: What's so Big about Big Data?Tom Weinandy

Big Data Analytics (1).pptkrishnapalrajput132

June 2015 (142) MIS Quarterly Executive 67The Big Dat.docxcroysierkathey

Data privacy and security in ICT4D - Meeting Report UN Global Pulse

Big Data For Development A PrimerUN Global Pulse

Big data and developmentSimone Sala

The Future of Big Data EMC

Big Data Analytics - The New Cold WarKunal Dutta

Big Data for Development: Opportunities and Challenges, Summary SlidedeckUN Global Pulse

Rasetti fondazioneisi 29_06_2015CSI Piemonte

Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Arab Federation for Digital Economy

Data Science For Social Good: Tackling the Challenge of HomelessnessAnita Luthra

Energing Technology and the Creative EconomyJerome Glenn

Similar to Big data for development (20)

Big Data Paper

The REAL Impact of Big Data on Privacy

Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...

Towards a More Open World

Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...

WEF - Personal Data New Asset Report2011

Big Data-Job 2

ICCM 2013 Panel 1: What's so Big about Big Data?

Big Data Analytics (1).ppt

June 2015 (142) MIS Quarterly Executive 67The Big Dat.docx

Data privacy and security in ICT4D - Meeting Report

Big Data For Development A Primer

Big data and development

The Future of Big Data

Big Data Analytics - The New Cold War

Big Data for Development: Opportunities and Challenges, Summary Slidedeck

Rasetti fondazioneisi 29_06_2015

Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...

Data Science For Social Good: Tackling the Challenge of Homelessness

Energing Technology and the Creative Economy

Recently uploaded

Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande

Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood

The byproduct of sericulture in different industries.pptxShobhayan Kirtania

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

Interactive Powerpoint_How to Master effective communicationnomboosow

APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

Advanced Views - Calendar View in Odoo 17Celine George

Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha

Mastering the Unannounced Regulatory InspectionSafetyChain Software

Accessible design: Minimum effort, maximum impactdawncurless

JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

Recently uploaded (20)

Web & Social Media Analytics Previous Year Question Paper.pdf

Sanyam Choudhary Chemistry practical.pdf

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Separation of Lanthanides/ Lanthanides and Actinides

A Critique of the Proposed National Education Policy Reform

Beyond the EU: DORA and NIS 2 Directive's Global Impact

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx

The byproduct of sericulture in different industries.pptx

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

Interactive Powerpoint_How to Master effective communication

APM Welcome, APM North West Network Conference, Synergies Across Sectors

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...

1029-Danh muc Sach Giao Khoa khoi 6.pdf

Advanced Views - Calendar View in Odoo 17

Call Girls in Dwarka Mor Delhi Contact Us 9654467111

Mastering the Unannounced Regulatory Inspection

Accessible design: Minimum effort, maximum impact

JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...

Grant Readiness 101 TechSoup and Remy Consulting

CARE OF CHILD IN INCUBATOR..........pptx

Big data for development

1. Big data for development (BD4D) Junaid Qadir InformationTechnology University (ITU), Pakistan

2. Session Outline The big data revolution •  What’s driving big data? What’s new about big data? Big data for development (BD4D) •  BD4D techniques •  BD4D data sources •  BD4D applications -big data for health; mobile-based BD4D applications BD4D research at ITU/ Punjab IT board •  Main ideas pursued •  Some success stories 0 1 2 BD4D challenges and pitfalls3 •  Data ownership, privacy, bias, causation, and false positives

3. the big data revolution 0

4. Big interest in big data

5. Applications of big data Business Recommendations Sports People Operations Transport smart buildings and energy analytics

6. what’s driving big data?

7. Digital life and Data exhaust

8. What’s driving the data deluge The Internet of Things

9. Democratization of data

10. what’s new about big data?

11. More data trumps clever algorithms

12. big data for development (BD4D) 1

13. Traditional use of data “Today the data is siloed off and unavailable. When data is in silos you can't make use of it either for evil or for the public good, and we need the public good. We need to stop pandemics. We need to make a greener world. We need to make a fairer world.” Alex “Sandy” Pentland

14. What are we using big data for?

15. how can we use data, and our ability to store it and process it, for social good and development of human beings (especially for underdeveloped countries)? Can we use big data for development? Under Revision, at the Big Data Analytics (BDAN) Journal

16. BD4D techniques big data analytics/ data science for development

17. Although the buzzwords describing the field have changed - from ‘Knowledge Discovery’ to ‘Data Mining’ to ‘Predictive Analytics’, and now to ‘Data Science’, the essence has remained the same - discovery of what is true and useful in the mountains of data. Gregory Piatetsky-Shapiro, Data Mining Pioneer Various faces of data analytics

18. Types of Analytics The BD4D research frontier

19. Data science techniques for BD4D Visual Analytics

20. Data science techniques for BD4D Machine Learning Computer Output Data Program/ Model Classification to predict quantity: regression methods Classifications to predict category: trees, forests, kNN, etc. Supervised Learning Clustering to self-organize in categories: k-means etc. Unsupervised Learning

21. Data science techniques for BD4D Time Series Analysis Deep Learning

22. BD4D data sources

23.   mobile phones   Mobile phones reach almost four-fifths of the world’s people More households in developing countries own a mobile phone than have access to electricity or improved sanitation. the ``leapfrogging’’ effect Kenya’s M-Pesa reached 80% of households in 4yrs!

24.   crowdsourcing through social media (for data generation and big data processing) Food requests after the Haiti earthquake. The Ushahidi-Haiti crisis map helps organizations intuitively ascertain where supplies are most needed.

25.   crowdsourcing (crowd computing) (for data generation and big data processing) www.youtube.com/watch?v=Z82B1zsvyZU The aggregation of information in groups results in decisions that are often better than could have been made by any single member of the group.

26. BD4D applications 1.  Emergency/ Crisis Response 2.  Healthcare 3.  Better governance 4.  Education 5.  Agriculture, Hunger, Food

27. Big data for development Humanitarian emergencies/ crisis response

28. Use of technology in Haiti (2010) The dawn of digital humanitarianism

29. The precursor of big data geo-analytics A classic example of Crisis Mapping Analytics is John Snow’s Cholera Map. Snow studied the severe outbreak of cholera in 1854 near the Broad Street in London, England.

30. What’s new: big crisis data analytics The Internet, Open Source, and Open Data Mobilization of the Wisdom of the Crowds Crowd Mapping Crowd Computing

31. big data for health Healthcare is the killer development ‘app’ As the United Nations launches a 17-point agenda for helping the world's poor, 267 economists from 44 countries on Friday published a declaration advocating one particular way: Make people healthier.

32. What’s wrong with diagnosis today? Hit and miss and done on intuition MIT graduate student Steven Keating, whose live was saved by his curiosity, and his medical health data that he curated. How medical selfies can save your life!

33. Since the beginning of time, most people have been isolated, without information about or access to the best health practices. But in just the last decade, this situation has changed completely: through the spread of cell phone networks, the vast majority of humanity now has a two-way digital connection that can send voice, text, and most recently, images and digital sensor data. Big data will revolutionize healthcare

34. mobile phone based development analytics Call detailed record (CDR)

35. Mobile big data dimensions mobility As mobile phone users send and receive calls and messages through different cell towers, it is possible to “connect the dots” and reconstruct the movement patterns of a community. social interaction The geographic distribution of one’s social connections may be useful both for building demographic profiles of aggregated call traffic and understanding changes in behavior. economic activity Mobile network operators use monthly airtime expenses to estimate the household income of anonymous subscribers in order to target appropriate services to them through advertising.

36. Some CDR-based applications Disaster Response Migration Analysis Digital Epidemiology Proxy Census Map Socioeconomic Indicators Optimizing Transporation

37. Mobile-based reality mining Unobtrusive Personal Analytics

38. What was observed by us … may be observed so well that all the disputes that for so many generations have vexed philosophers are destroyed by visible certainty, and we are liberated from wordy arguments. Computational social science “Because of the success of science, there is a kind of a pseudo-science. Social science is an example of a science which is not a science.

39. BD4D research at ITU/ PITB2 ~120 million people

40. Smartphones and good governance https://www.youtube.com/watch?v=ZcGDj33a8WE

41. Promoting open data (open.punjab.gov.pk)

42. The use of crowdsourcing (e.g., for open education data accountability)

43. Creating new mobile apps: DATAPLUG

44. Some Pakistani success stories

45. Greater transparency

46. Dengue disease epidemic (Lahore, 2011) 21,000 conﬁrmed patients in Punjab 17,000 patients in Lahore alone. 352 deaths. Disease prevention

47. Dengue activity tracking system (DATs) http://www.pitb.gov.pk/dats

48. Crime analytics

49. BD4D challenges/ pitfalls Data ownership, privacy, bias, causation, and false positives 3

50. What can we do with big data? (the optimistic view) “With enough data and the ability to crunch it, virtually any challenge facing humanity today can be solved.” Eric Schmidt et al, How Google Works, 2014

51. What can we do with big data? (the pragmatic view) “Technology amplifies human intent and capacity; it doesn't substitute for them.”---Kentaro Toyama Digital dividends depend on key “analog complements” that include appropriate policies, regulatory frameworks, accountability, and capable human workforce.

52. Social networks and development Human behavior is massively influenced by their social networks. Connections can completely transform you! Dev. Insight!

53. human development is complex … Human behavior is complex to model, influence, and predict. What drives human/ societal development is the right mindset, attitude, and intent—which is not necessarily dependent on data. Should governments shape individual choices?

54. Personal data & data ownership “Personal data is the new oil of the internet and the new currency of the digital world.” —Meglena Kuneva European Consumer Commissioner

55. Privacy and big data In a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. Many open ethical data use/ transparency/ privacy questions..

56. The (big?) bias/ noise of big data “Is there any point to which you would wish to draw my attention?” “To the curious incident of the dog in the night-time.” “The dog did nothing in the night-time.” “That was the curious incident,” remarked Sherlock Holmes. Sometimes what is not in the data is more important; i.e., the data may be an unrepresentative sample It is to mistake the noise in the big data to be the signal Another self-introduced bias in big data can be outlier filtering/ deletion.

57. Correlations vs. Causation Is induction the new big-data era mode of science? We should not use the same information to construct and test the same hypothesis Correlation based analysis often suffices for traditional big data applications, but will it for BD4D?

58. Machine predictions going awry! In the movie Minority Report, the cop tackles and handcuﬀs individuals who have commi:ed no crime (yet), proclaiming stuﬀ like: “By mandate of the District of Columbia Precrime Division, I’m placing you under arrest for the future murder of Sarah Marks and Donald Dubin.” The arrested person confronts Cruise and asks: “You ever get any false positives?” In fact, it is very easy to do design a pre-crime criminal catching algorithm that will catch ALL the criminals!

59. Open research challenges 1.  Mitigating the highlighted pitfalls/ challenges 2.  Multimodal BD4D analytics 3.  Predictive BD4D analytics 4.  Combining humans, crowds, and AI 5.  Unsupervised BD4D analytics Open research challenges 1.  Mitigating the highlighted pitfalls/ challenges 2.  Multimodal BD4D analytics 3.  Predictive BD4D analytics 4.  Combining humans, crowds, and AI 5.  Unsupervised BD4D analytics

60. Credits/ Acknowledgments Figures from various sources: Digital life and data exhaust: “Demography, meet Big Data; Big Data, meet Demography”, Emmanuel Letouzé Mobile CDR-based applications from: “Big Data in Action For Development” (The World Bank). Cover pages of various magazines and research papers Various PITB managed websites and applications. Wikipedia/ Online Pictures of Feynman, Galileo, Galton, Pentland, Shapiro. Snapshots from Youtube Videos and Online Websites. Stock photos from various places; the respective owners are the rightful owners of the content. Books Connected (Christakis & Fowler), Inside the Nudge Unit (David Halpern), Geek Heresy (Toyama), The Patient Will See You Now (Eric Topol), Work Rules! (Bock), World Bank Annual Reports. These resources have been used in these lecture slides for educational purpose under the fair use doctrine. The ownership of these resources, if copyrighted, is retained by their respective copyright owners. Various presentations at slideshare, including: http://www.slideshare.net/cloudera/cloudera-for-internet-of-things

61. Concluding remarks The area of BD4D offers an opportunity to do good research that also leads to tangible human impact and development. We will love to hear your ideas on how we can use (big data & networking) technology for development of Pakistan (and other underdeveloped countries) junaid.qadir@itu.edu.pk

Editor's Notes

Exabyte is billion billion.
Figure Credit: http://www.slideshare.net/cloudera/cloudera-for-internet-of-things
From Data Driven by DJ Patil and Hilary Mason http://www.oreilly.com/data/free/files/data-driven.pdf Democratizing Data The democratization of data is one of the most powerful ideas to come out of data science. Everyone in an organization should have access to as much data as legally possible. While broad access to data has become more common in the sciences (for example, it is possible to access raw data from the National Weather Service or the National Institutes for Health), Facebook was one of the first companies to give its employees access to data at scale. Early on, Facebook realized that giving everyone access to data was a good thing. Employees didn’t have to put in a request, wait for prioritization, and receive data that might be out of date. This idea was radical because the prevailing belief was that employees wouldn’t know how to access the data, incorrect data would be used to make poor business decisions, and technical costs would become prohibitive. While there were certainly challenges, Facebook found that the benefits far outweighed the costs; it became a more agile company that could develop new products and respond to market changes quickly. Access to data became a critical part of Facebook’s success, and remains something it invests in aggressively. All of the major web companies soon followed suit. Being able to access data through SQL became a mandatory skill for those in business functions at organizations like Google and LinkedIn. And the wave hasn’t stopped with consumer Internet companies. Nonprofits are seeing real benefits from encouraging access to their data—so much so that many are opening their data to the public. They have realized that experts outside of the organization can make important discoveries that might have been otherwise missed. For example, the World Bank now makes its data open so that groups of volunteers can come together to clean and interpret it. It’s gotten so much value that it’s gone one step further and has a special site dedicated to public data. Governments have also begun to recognize the value of democratizing access to data, at both the local and national level. The UK government has been a leader in open data efforts, and the US government created the Open Government Initiative to take advantage of this movement. As the public and the government began to see the value of making the data more open, governments began to catalog their data, provide training on how to use the data, and publish data in ways that are compatible with modern technologies. In New York City, access to data led to new Moneyball-like approaches that were more efficient, including finding “a five-fold return on the time of building inspectors looking for illegal apartments” and “an increase in the rate of detection for dangerous buildings that are highly likely to result in firefighter injury or death.” International governments have also followed suit to capitalize on the benefits of opening their data. One challenge of democratization is helping people find the right data sets and ensuring that the data is clean. As we’ve said many times, 80% of a data scientist’s work is preparing the data, and users without a background in data analysis won’t be prepared to do the cleanup themselves. To help employees make the best use of data, a new role has emerged: the data steward. The steward’s mandate is to ensure consistency and quality of the data by investing in tooling and processes that make the cost of working with data scale logarithmically while the data itself scales exponentially.
MORE DATA BEATS A CLEVERER ALGORITHM Suppose you’ve constructed the best set of features you can, but the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data (more examples, and possibly more raw features, subject to the curse of dimensionality). Machine learning researchers are mainly concerned with the former, but pragmatically the quickest path to success is often to just get more data. As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.) Part of the reason using cleverer algorithms has a smaller payoff than you might expect is that, to a first approximation, they all do the same. This is surprising when you consider representations as different as, say, sets of rules and neural networks. But in fact propositional rules are readily encoded as neural networks, and similar relationships hold between other representations. All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.” With non-uniformly distributed data, learners can produce widely different frontiers while still making the same predictions in the regions that matter (those with a substantial number of training examples, and therefore also where most test examples are likely to appear). This also helps explain why powerful learners can be unstable but still accurate. Figure 3 illustrates this in 2-D; the effect is much stronger in high dimensions. In machine learning, is more data always better than better algorithms? https://www.quora.com/In-machine-learning-is-more-data-always-better-than-better-algorithms No. There are times when more data helps, there are times when it doesn't. Probably one of the most famous quotes defending the power of data is that of Google's Research Director Peter Norvig claiming that "We don’t have better algorithms. We just have more data.". This quote is usually linked to the article on "The Unreasonable Effectiveness of Data", co-authored by Norvig himself (you should probably be able to find the pdf on the web although the original is behind the IEEE paywall). The last nail on the coffin of better models is when Norvig is misquoted as saying that "All models are wrong, and you don't need them anyway" (read here for the author's own clarifications on how he was misquoted). The effect that Norvig et. al were referring to in their article, had already been captured years before in the famous paper by Microsoft Researchers Banko and Brill [2001] "Scaling to Very Very Large Corpora for Natural Language Disambiguation". In that paper, the authors included the plot below. That figure shows that, for the given problem, very different algorithms perform virtually the same. however, adding more examples (words) to the training set monotonically increases the accuracy of the model. So, case closed, you might think. Well... not so fast. The reality is that both Norvig's assertions and Banko and Brill's paper are right... in a context. But, they are now and again misquoted in contexts that are completely different than the original ones. But, in order to understand why, we need to get slightly technical. (I don't plan on giving a full machine learning tutorial in this post. If you don't understand what I explain below, read my answer to How do I learn machine learning?) Variance or Bias? The basic idea is that there are two possible (and almost opposite) reasons a model might not perform well. In the first case, we might have a model that is too complicated for the amount of data we have. This situation, known as high variance, leads to model overfitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features, and... yes, by increasing the number of data points. So, what kind of models were Banko & Brill's, and Norvig dealing with? Yes, you got it right: high variance. In both cases, the authors were working on language models in which roughly every word in the vocabulary makes a feature. These are models with many features as compared to the training examples. Therefore, they are likely to overfit. And, yes, in this case adding more examples will help.
http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
Big Data Brandwashing.
The following text is excerpted from: “Keeping up with the Quants” “What Are Analytics? By analytics, we mean the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and add value. Analytics can be classified as descriptive, predictive, or prescriptive according to their methods and purpose. Descriptive analytics involve gathering, organizing, tabulating, and depicting data and then describing the characteristics about what is being studied. This type of analytics was historically called reporting. It can be very useful, but doesn’t tell you anything about why the results happened or about what might happen in the future. Predictive analytics go beyond merely describing the characteristics of the data and the relationships among the variables (factors that can assume a range of different values); they use data from the past to predict the future. They first identify the associations among the variables and then predict the likelihood of a phenomenon—say, that a customer will respond to a particular product advertisement by purchasing it—on the basis of the identified relationships. Although the associations of variables are used for predictive purposes, we are not assuming any explicit cause-and-effect relationship in predictive analytics. In fact, the presence of causal relationships is Optimization, another prescriptive technique, attempts to identify the ideal level of a particular variable in its relationship to another. For example, we might be interested in identifying the price of a product that is most likely to lead to high profitability for a product. Similarly, optimization approaches could identify the level of inventory in a warehouse that is most likely to avoid stock-outs (no product to sell) in a retail organization. Analytics can be classified as qualitative or quantitative according to the process employed and the type of data that are collected and analyzed. Qualitative analysis aims to gather an in-depth understanding of the underlying reasons and motivations for a phenomenon. Usually unstructured data is collected from a small number of nonrepresentative cases and analyzed nonstatistically. Qualitative analytics are often useful tools for exploratory research—the earliest stage of analytics. Quantitative analytics refers to the systematic empirical investigation of phenomena via statistical, mathematical, or computational techniques. Structured data is collected from a large number of representative cases and analyzed statistically. There are various types of analytics that serve different purposes for researchers: Statistics: The science of collection, organization, analysis, interpretation, and presentation of data Forecasting: The estimation of some variable of interest at some specified future point in time as a function of past data Data mining: The automatic or semiautomatic extraction of previously unknown, interesting patterns in large quantities of data through the use of computational algorithmic and statistical techniques Text mining: The process of deriving patterns and trends from text in a manner similar to data mining Optimization: The use of mathematical techniques to find optimal solutions with regard to some criteria while satisfying constraints Experimental design: The use of test and control groups, with random assignment of subjects or cases to each group, to elicit the cause and effect relationships in a particular outcome Although the list presents a range of analytics approaches in common use, it is unavoidable that considerable overlaps exist in the use of techniques across the types. For example, regression analysis, perhaps the most common technique in predictive analytics, is a popularly used technique in statistics, forecasting, and data mining. Also, time series analysis, a specific statistical technique for analyzing data that varies over time, is common to both statistics and forecasting. a. Thomas Davenport and Jeanne G. Harris, Competing on Analytics (Boston: Harvard Business School Press, 2007” Excerpt From: Davenport, Thomas H. “Keeping Up with the Quants: Your Guide to Understanding and Using Analytics.” iBooks.
Big Data Brandwashing.
Big Data Brandwashing.
http://static.squarespace.com/static/538cea80e4b00f1fad490c1b/54668a77e4b00fb778d22a34/54668d85e4b00fb778d281f9/1367513685000/NLP-of-Big-Data-using-NLTK-and-Hadoop7.png?format=original Credit: http://devblogs.nvidia.com/parallelforall/wp-content/uploads/sites/3/2014/09/nn_example-624x218.png Deep learning is: 1. a collection of statistical machine learning techniques 2. used to learn feature hierarchies 3. often based on artificial neural networks
Sources of Big Crisis Data As depicted in a detailed crisis analytics taxonomy shown in Figure 1, there are six important sources of big crisis data. 1) Data exhaust refers to the digital trail that we etch behind as we go about performing our everyday online activities with digital devices. The most important example of data exhaust for big crisis data analytics is the mobile “call detail records” (CDRs), which are generated by mobile telecom companies to capture various details related to any call made over their network. Data exhaust also includes transaction data (e.g., banking records and credit card history) and usage data (e.g., access logs). Most of the data exhaust is owned by private organizations (such as mobile service operators) where it used mostly in-house for troubleshooting; data exhaust is seldom shared publicly due to legal and privacy concerns. 2) Online activity encompasses all types of user generated data on the Internet (e.g., emails, SMS, blogs, comments); search activity using a search engine (such as Google search queries); and activities on social networks (such as Facebook comments, Google+ posts, and Twitter tweets). It has been shown in literature that online activities on different platforms can provide unique insights to crisis development: as an example, the short message services Twitter and SMS are used differently in crisis situations—SMS is used mostly on the ground by the affected community, while Twitter is used mostly by the international aid community [4]. The advantage of online data is that it is often publicly available, and thus it is heavily used by academics in big crisis data research. 3) Sensing technologies use various cyber-physical sensing systems—such as ground, aerial, and marine vehicles; mobile phones; wireless sensor nodes—to actively gather information about environmental conditions. There are a number of sensing technologies such as (1) remote sensing (in which a satellite or high-flying aircraft scans the earth in order to obtain information about it); (2) networked sensing (in which sensors can perform sensing and can communicate with each other—as in wireless sensor networks); and (3) participatory sensing (in which everyday entities—such as mobile phones, buses, etc.—are fit with sensors). With the emergence of the Internet of Things (IoT) architecture, it is anticipated that sensor data will become one of the biggest sources of big crisis data. Sensing data is usually (but not always) publicly available. 4) Small data and MyData: With big data, the scope of sampling and analysis can be vastly dissimilar (e.g., the unit of sampling is at the individual level, while the unit of analysis is at the country level), but with “small data”, the unit of analysis is similarly scoped as the unit of sampling. When the unit of sampling and analysis is a single person, we call such personal-data-based analysis “MyData”. There is emerging interest in using small data and MyData for personalized solutions, focused on applications like health (e.g., Cornell’s mhealth project led by Deborah Estrin) and sustainable development (e.g., the Small Data lab at the United Nations University). Today individuals rarely own, or even have access to, all of their personal data; but this has started to change (e.g., some hospitals now make individual medical records data accessible to patients). 5) A lot of public-related data—that can be very valuable in the case of a crisis—is already being collected by various public/ governmental/ or municipal offices. This includes census data, birth and death certificates, and other types of personal and socio-economic data. Typically, such data has been painstakingly collected using paper-based traditional survey methods. In recent times, advances in digital technology have led people to develop mobile-phone-based data- collection tools that can easily collect, aggregate, and analyze data. Various open-source tools such as the Open Data Kit (ODK) make it trivial for such data to be collected. While public-related data is not always publicly accessible, increasingly governments are adopting the Open Data trend to open up public-related data. 6) Finally, the method of crowdsourcing is an active data collection method in which applications actively involve a wide user base to solicit their knowledge about particular topics or events. Crowdsourcing combines a) digital technology, b) human skills, and c) human generosity and utilizes the cognitive surplus of digital human samaritans—the volunteer open-source coders; the citizens who provide data, or help complete a task—to create a volunteer workforce that can be put to work on large global projects. Crowdsourced data is usually publicly available and is widely used by big crisis data practitioners.
The following is an excerpt from the Crisis Analytics paper. Mobile Phones The rapid adoption of mobile technology has been unprecedented. Smartphones are rapidly becoming the central computer and communication devices in the lives of people around the world. Modern phones are not restricted to only making and receiving calls—current off-the-shelf smartphones can be used to detect, among other things, physical activity (via accelerometers); speech and auditory context (via microphones); location (via GPS) and co-location with others (via Bluetooth and GPS). This transforms the modern crisis response since modern smartphones can now act as general-purpose sensors and individuals can directly engage in the disaster response activities through cloud-, crowd-, and SMS-based technologies. This participatory trend in which the aid efforts are centered on and driven by people—and the fact that aid workers have to work with large amounts of diverse data—makes modern disaster response totally different from traditional approaches. Mobile phone technology is ubiquitously deployed, both in developed countries as well as in underdeveloped countries. CDR-based mobile analytics presents a great opportunity to obtain insights (at a very lost cost) about mobility patterns, traffic information, and sociological networks—information that can be profitably utilized during various stages of disaster response (e.g., in epidemic control, and in tracking population dynamics). CDRs have been used by digital humanitarians during various crises (such as the non-profit FlowMinder’s work with anonymous mobile operator data during the Haiti earthquake to follow the massive population displacements) to not only point out the current locations of populations, but also predict their future trajectory [5]. From the WDR 2016 Report: More households in developing countries own a mobile phone than have access to electricity or improved sanitation (figure O.4, panel a). Mobile phones, reaching almost four-fifths of the world’s people, provide the main form of internet access in developing countries. But even then, nearly 2 billion people do not own a mobile phone, and nearly 60 percent of the world’s population has no access to the internet. On average, 8 in 10 individuals in the developing world own a mobile phone, and the number is steadily rising. Even among the bottom fifth of the population, nearly 70 percent own a mobile phone. The lowest mobile penetration is in Sub-Saharan Africa (73 percent), against 98 percent in high-income countries. But internet adoption lags behind considerably: only 31 percent of the population in developing countries had access in 2014, against 80 percent in high-income countries. China has the largest number of internet users, followed by the United States, with India, Japan, and Brazil filling out the top five. The world viewed from the perspective of the number of internet users looks more equal than when scaled by income (map O.1)—reflecting the internet’s rapid globalization. Digital finance has promoted financial inclusion, providing access to financial services to many of the 80 percent of poor adults estimated to be excluded from the regulated financial sector. It has boosted efficiency, as the cost of financial transactions has dropped and speed and convenience have increased. And it has led to major innovations in the financial sector, many of which have emerged in developing countries (box S2.1). The benefits pervade almost all areas discussed in this Report. Digital finance makes businesses more productive, allows individuals to take advantage of opportunities in the digital world, and helps streamline public sector service delivery. Like all great opportunities, digital finance also comes with risks. What makes online financial systems easy to use for customers also makes them susceptible to cybercrime. The entry of nontraditional players poses new challenges for policy, regulation, and supervision. And the ease of transferring funds across the globe—often anonymously, using means such as cryptocurrencies—might increase illicit financial flows.
https://projects.vrac.iastate.edu/REU2011/wp-content/uploads/2011/05/Harnessing-the-Crowdsourcins-Power-of-Social-Media-for-Disaster-Relief.pdf Figure 2 illustrates the food requests on Ushahidi-Haiti, and Figure 3 shows the most affected locations during the Japanese tsunami based on the number of reports mapping on Ushahidi’s crisis map. Using these maps, relief organizations can coordinate resource distribution and make better decisions based on their analysis of crowdsourced data. Fallback plans can be further developed for the top events or to cover the majority of events. Third, providers can include geo-tag information for messages sent from some platforms (such as Twitter) and devices (including handheld smart phones). Such crowdsourced data can help relief organizations accurately locate specific requests for help. Furthermore, visualizing this type of data on a crisis map offers a common disaster view and helps organizations intuitively ascertain the current status.
Wisdom of the Crowds: http://www.amazon.com/Wisdom-Crowds-James-Surowiecki/dp/0385721706/ The following is excerpted from https://en.wikipedia.org/wiki/The_Wisdom_of_Crowds The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, published in 2004, is a book written by James Surowiecki about the aggregation of information in groups, resulting in decisions that, he argues, are often better than could have been made by any single member of the group. The book presents numerous case studies and anecdotes to illustrate its argument, and touches on several fields, primarily economics and psychology. The book relates to diverse collections of independently deciding individuals, rather than crowd psychology as traditionally understood. Its central thesis, that a diverse collection of independently deciding individuals is likely to make certain types of decisions and predictions better than individuals or even experts, draws many parallels with statistical sampling; however, there is little overt discussion of statistics in the book. Failures of crowd intelligence[edit] Surowiecki studies situations (such as rational bubbles) in which the crowd produces very bad judgment, and argues that in these types of situations their cognition or cooperation failed because (in one way or another) the members of the crowd were too conscious of the opinions of others and began to emulate each other and conform rather than think differently. Although he gives experimental details of crowds collectively swayed by a persuasive speaker, he says that the main reason that groups of people intellectually conform is that the system for making decisions has a systematic flaw. Surowiecki asserts that what happens when the decision making environment is not set up to accept the crowd, is that the benefits of individual judgments and private information are lost and that the crowd can only do as well as its smartest member, rather than perform better (as he shows is otherwise possible).[4] Detailed case histories of such failures include: Applications[edit] Surowiecki is a very strong advocate of the benefits of decision markets and regrets the failure of DARPA's controversial Policy Analysis Market to get off the ground. He points to the success of public and internal corporate markets as evidence that a collection of people with varying points of view but the same motivation (to make a good guess) can produce an accurate aggregate prediction. According to Surowiecki, the aggregate predictions have been shown to be more reliable than the output of anythink tank. He advocates extensions of the existing futures markets even into areas such as terrorist activity and prediction markets within companies. To illustrate this thesis, he says that his publisher is able to publish a more compelling output by relying on individual authors under one-off contracts bringing book ideas to them. In this way they are able to tap into the wisdom of a much larger crowd than would be possible with an in-house writing team. Will Hutton has argued that Surowiecki's analysis applies to value judgments as well as factual issues, with crowd decisions that "emerge of our own aggregated free will [being] astonishingly... decent". He concludes that "There's no better case for pluralism, diversity and democracy, along with a genuinely independent press."[8] Applications of the wisdom-of-crowds effect exist in three general categories: Prediction markets, Delphi methods, and extensions of the traditional opinion poll. The following is excerpted from the Crisis Analytics paper: Leveraging the Wisdom and the Generosity of the Crowd Broadly speaking, there are only a few ways we can go about problem solving or predicting something: (1) experts, (2) crowds, and (3) machines (working on algorithms; or learning from data). While experts possess valuable experiences and insights, they may also suffer from biases. The benefit of crowds accrues from its diversity: it is typically the case that due to a phenomenon known as “the wisdom of the crowds” [10], the collective opinion of a group of diverse individuals is better than, or at least as good as, the opinion of experts. Crowds can be useful in disaster response in at least two different ways: firstly, crowdsourcing, in which disaster data is gathered from a broad set of users and locations [6]; and secondly, crowdcomputing, in which crowds help process and analyze the data through collaboratively solving “microtasks” [1]. 1) Crowdsourcing: Crowdsourcing is the outsourcing of a job traditionally performed by a designated agent (usually an employee) to an undefined—generally a large group of people—in the form of an open call. In essence, crowdsourcing is the application of the open-source principles (used to develop products such as Linux, Wikipedia, etc.) to the fields outside of software. Crowdsourcing has been used in the context of disaster response in multiple ways [11]: including crowdsearching, microtasking, citizen science, rapid translation, data cleaning and verification, developing ML classifiers, and election monitoring [6]. 2) Crowdcomputing: Crowdcomputing is a technique that utilizes crowds for solving complex problems. A notable early use of crowdcomputing was the use of crowdsearching by MIT’s team at the 2009 DARPA Network Challenge. The MIT’s team solved a time-critical problem in the least time using an integration of social networking, the Internet, and some clever incentives to foment crowd collaboration. In contemporary times, a number of “microtasking” platforms have emerged as smarter ways of crowdsearching. A familiar example is “Amazon Mechanical Turk”, the commercial microtasking platform that allows users to submit tasks of a large job (that is too large for a single person or small team to perform) for distribution to a global crowd of volunteers (who are remunerated in return for performing these microtasks). A number of free and open-source microtasking platforms have also been developed, including generic microtasking platforms such as CrowdCrafting—which was used by Digital Humanitarian Network (DHN) volunteers in response to Typhoon Pablo in the Philippines—as well as humanitarian-response-focused platforms such as MicroMappers. MicroMappers, developed at the Qatar Computing Research Institute (QCRI), was conceived as a fully customized microtasking platform for humanitarian response—a platform that would be on standby and available within minutes of the DHN being activated. MicroMappers can facilitate the microtasking of translatio, and classification of online user-generated multimedia content (in various formats such as text, images, videos) through tagging.
Principles: 1. Do not harm 2. Use data to help create peaceful coexistence 3. Use data to help vulnerable people and people in need 4. Use data to preserve and improve natural environment 5. Use data to help creating a fair world without discrimination
http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
From our Crisis Analytics paper: To be sure, the use of mapping technology to combat crisis is not new. A classic example of Crisis Mapping Analytics is John Snow’s Cholera Map. Snow studied the severe outbreak of cholera in 1854 near the Broad Street in London, England. Contrary to the prevailing mindset (that believed that cholera was spread through polluted air or “miasma”), Snow showed—through the ingenious use of spatial analytics (comprising mapping and detailed statistical analysis)—that the cholera cases were all clustered around the pump in a particular street, and that contaminated water, not air, spread cholera. We can also turn to Florence Nightingale—a nurse assigned to the old Barrack Hospital in Scutari during the Crimean War in 1850s—for another vintage example of data-based crisis analytics. Nightingale combined data analytics and striking visualizations to highlight the importance of proper healthcare and hygiene for checking the spread of disease amongst soldiers. In the work of Snow and Nightingale, we already see the vestiges of crisis analytics: they were already testing for spatial effects (such as autocorrelation; clustering/ dispersion) and testing hypotheses (about proposed correlations and relationships).
The Internet, Open Source, and Open Data In a wide variety of fields, the new Internet-based economy is spawning a paradigm shift in how institutions work. The “open source” culture (the paradigm underlying Internet projects such as Linux and Wikipedia) has ushered in a new era that relies more on collaboration and volunteerism (than on formal organizations). Instead of a rigid structure constrained by scarcity of human resources, the new paradigm is driven by abundance and cognitive surplus (due to technology-driven pooling of volunteer human resources). This open source trend is now also visible in humanitarian development in various manifestations such as digital humanitarianism, user generated information, participatory mapping, volunteered geographic information, open- source software (such as OpenStreetMap) and open data. Many V&TCs have exploited these open standards to link data from disparate sources and create mashups (which are defined as a web page/ application that uses or combines data or functionality from multiple existing sources to create new services). Another important trend emerging in the modern era is the “Open Data” that has resulted in unprecedented commoditization and opening up of data. A number of countries from allover the world (more than 40 countries in 2015) have established open data initiatives to open numerous kinds of datasets to public for greater transparency. Open data can also lead to improved governance through the involvement of public and better collaboration between public and private organizations. As an example, in the aftermath of the Haiti crisis, volunteers across the world cobbled together data from various sources—including data from satellite maps and mobile companies along with information about health facilities from the maps of the World Health Organization, and police facilities from the Pacific Disaster Center—and plotted them on open-source platforms such as the OpenStreetMap. The importance of OpenStreetMap can be gauged from the fact that soon after the earthquake, OpenStreetMap had become the de facto source of Haiti map data for most of the United Nations (UN) agencies. SaTScan™ is a free software that analyzes spatial, temporal and space-time data using the spatial, temporal, or space-time scan statistics. It is designed for any of the following interrelated purposes: • Perform geographical surveillance of disease, to detect spatial or space-time disease clusters, and to see if they are statistically significant. • Test whether a disease is randomly distributed over space, over time or over space and time. • Evaluate the statistical significance of disease cluster alarms. • Perform repeated time-periodic disease surveillance for early detection of disease outbreaks. The software may also be used for similar problems in other fields such as archaeology, astronomy, botany, criminology, ecology, economics, engineering, forestry, genetics, geography, geology, history, neurology or zoology. Data Types and Methods SaTScan uses either a Poisson-based model, where the number of events in a geographical area is Poisson-distributed, according to a known underlying population at risk; a Bernoulli model, with 0/1 event data such as cases and controls; a space-time permutation model, using only case data; an ordinal model, for ordered categorical data; an exponential model for survival time data with or without censored variables; or a normal model for other types of continuous data. The data may be either aggregated at the census tract, zip code, county or other geographical level, or there may be unique coordinates for each observation. SaTScan adjusts for the underlying spatial inhomogeneity of a background population. It can also adjust for any number of categorical covariates provided by the user, as well as for temporal trends, known space-time clusters and missing data. It is possible to scan multiple data sets simultaneously to look for clusters that occur in one or more of them.
As the United Nations launches a 17-point agenda for helping the world's poor, 267 economists from 44 countries on Friday published a declaration advocating one particular way: Make people healthier. "In terms of how much better you can make the world per dollar you spend, it’s very difficult to beat a set of strategic investments in health care," Harvard University economist Larry Summers, who organized the manifesto, said in an interview. "Ours is the unique generation that has the prospect of convergence across the world in health," making the poor as healthy as the rich, Summers said. What the economists are advocating is already included in the United Nations' new Sustainable Development Goals for the next 15 years. To be specific, it's part 8 of goal 3: "Achieve universal health coverage, including financial risk protection, access to quality essential health-care services and access to safe, effective, quality and affordable essential medicines and vaccines for all." The economists clearly feel that health deserves to be called out for special consideration. In their declaration, they "call on global policymakers to prioritize a pro-poor pathway to universal health coverage as an essential pillar of development.” Giving high priority to health implicitly means giving a lower priority to some of the other items on the UN agenda. Summers said he doesn't want to discuss which ones. "I’m not going to get into that game." (Summers, a former Treasury secretary, National Economic Council chief, and Harvard University president, knows a politically radioactive question when he hears it.) Friday's declaration grew out of the Global Health 2035 report by the Lancet Commission on Investing in Health. That report found that each dollar invested in health in poor countries can have a payback of $9 or more. World Bank Chief Economist Kaushik Basu, another signer of the declaration, said that having grown up in India made him especially aware of the importance of good health. "In India if you are poor, one health episode spins you into a trap you’ll never get out of," he said.
https://www.youtube.com/watch?v=-L-WFukOARU Steven Keating also has a TEDx tallk on this: https://www.youtube.com/watch?v=U5SafKJgqPM http://news.mit.edu/2015/student-profile-steven-keating-0401 In 2007, Steven Keating had his brain scanned out of sheer curiosity. Keating had joined a research study that included an MRI scan, and he asked that the scan’s raw data be returned to him. The scan revealed only a slight abnormality, near his brain’s smell center, which he was advised to have re-evaluated in a few years. A second scan, in 2010, showed no change, suggesting that the abnormality was most likely benign. While the second scan provided reassurance, Keating’s knowledge of the abnormality — as a result of having access to the raw data from these scans — ultimately led to the detection of a baseball-sized tumor that was removed this past August. Now a graduate student in the Department of Mechanical Engineering and based at the MIT Media Lab, Keating says that his curiosity saved his life — and that his experience with cancer has fueled a strong interest in advocating for open health data. Discovering a baseball-sized brain tumor Keating arrived at MIT in fall 2010 as the first student to join the Media Lab’s Mediated Matter Group. Under his advisors — Neri Oxman, the group’s director and the Sony Corporation Career Development Associate Professor of Media Arts and Sciences, and David Wallace, a professor of mechanical engineering and engineering systems — Keating studies digital construction and biologically inspired design. He is pursuing a PhD in mechanical engineering with a minor in synthetic biology. Last July, Keating noticed that he was experiencing a phantom vinegar smell for about 30 seconds every day. Knowing that his 2007 and 2010 research scans showed an abnormality near his smell center, he requested an MRI scan through MIT Medical. The scan revealed that the abnormality had grown into a tumor that needed to be removed as soon as possible. Keating went to Brigham and Women’s Hospital (BWH) in Boston on Aug. 19 for surgery, accompanied and supported by his family and his girlfriend; Oxman; and Yoel Fink, a professor of materials science and director of MIT’s Research Laboratory of Electronics. The surgery was performed by neurosurgeon E. Antonio Chiocca, and Keating, though sedated, was kept awake while the tumor was removed. This was so doctors could ask him questions while they were probing and cutting brain tissue to ensure they were not damaging the brain’s language center. The 10-hour surgery was captured on video, which, at Keating’s request, was shared with him. His recovery was quick: Keating was out of the hospital after two days, and he was back on the MIT campus within a week. A tissue biopsy confirmed that his tumor was an IDH1-mutant malignant astrocytoma. In this type of brain cancer, which was only first identified by researchers in 2009, the mutated IDH enzyme leads to the production of 2HG, a novel, oncogenic metabolite. Through the Bridge Project — a collaboration between MIT’s Koch Institute for Integrative Cancer Research and the Dana-Farber/Harvard Cancer Center — a cross-institutional research team is exploring how to use 2HG as a biomarker to detect and monitor IDH-mutant cancers. Ovidiu Andronesi, a radiologist at Massachusetts General Hospital (MGH) and a collaborator on this research, applied this monitoring technology via MRI spectroscopy imaging to scan Keating’s brain before and after his surgery. These scans show the reduction of 2HG after doctors removed the tumor; the scans were also shared with Keating, at his request. “As a cancer scientist, hearing Steven talk about 2HG spectroscopy screening as part of his clinical care is remarkable,” says Matthew Vander Heiden, the Eisen and Chang Career Development Associate Professor of Biology and a member of the Koch Institute, who is a leader on this research project. “IDH’s role in these cancers was only discovered six years ago, and it is incredible, as well as humbling, that Steven could benefit from some of the basic science done in this short time period since IDH mutations were recognized.” Diving deeper into the data Since the surgery, Keating’s curiosity has only become more acute. This has been fueled, in large part, by his close connection with his doctors and the data they were able to provide. “Because of that connection, I had new options,” he says. “I asked for the surgery to videotaped, for my genome to be sequenced, and for the raw data from scans.” With this abundance of data, Keating is able to apply his own research interests to develop an intimate understanding of his brain and his tumor. In Oxman’s Mediated Matter Group, Keating’s research explores how to leverage 3-D printing and other fabrication methods to print everything from living organisms to entire buildings. With the resources available to him at the Media Lab, he and colleagues James Weaver and Ahmed Hosny at Harvard University’s Wyss Institute for Biologically Inspired Engineering have pored over his health data and created digital and 3-D-printed models of his tumor, brain, and surgically repaired skull. To share his experiences as a patient-scientist, Keating gave a talk at the Koch Institute on Oct. 22 as part of a public event on IDH-mutant cancers. He returned on Nov. 21 to share his story with the Koch Institute’s cancer researchers. MIT graduate student Steven Keating shares how innate curiosity helped him discover a baseball-sized tumor in his brain. Video: Koch Institute “Steven’s story is so inspiring in part because he is approaching his own cancer as a scientific problem, and he is actively seeking the data he needs to solve that problem,” says Tyler Jacks, director of the Koch Institute and the David H. Koch Professor in MIT’s Department of Biology. “After hearing his story, I think all of us were motivated to get back into the lab.” “Steven’s insatiable curiosity is what science is all about,” adds Nancy Hopkins, a professor emerita of biology, and member of the Koch Institute, who attended both talks. “He addresses even his own cancer as if it were the latest fascinating experiment and as an opportunity to advance knowledge and help others.” Advocating for opening health data Given his up-close-and-personal experience with his health, Keating says he is now a strong believer in open sourcing and allowing patients to have easy access to their own health data. He says he was fortunate that his doctors were willing to share his data, but he did notice many small barriers along the way. “My doctors are incredible for sharing my data and encouraging me to learn more from it,” Keating says. “However, the process raised some questions for me, as I received my data on 30 CDs, without easy tools to understand, learn, or share, and there was no genetic data included. Why CDs? Why limited access for patients to their own data? Can we have a simple, standardized share button at the hospital? Where is the Google Maps, Facebook, or Dropbox for health? It needs to be simple, understandable, and easy, as small barriers add up quickly.” Keating says this cause has personal importance because having access to his health data not only led him to discover his tumor in the first place, but it also helped find the doctors and medical care he needed. “Imagine having your whole medical record that you could not only share with doctors and scientists but also with friends and family, too,” he says. “Patients could get second opinions very easily, and doctors can follow what leaders in the field are doing.” He says there are also huge mutual benefits when patients decide to share their health data with researchers, because it provides them with an actual case to study. The same is true when data is shared within patient communities, as those with precisely similar conditions are able to connect with one another. Critics of open-source health data largely point to privacy considerations. This is especially true with regard to patients’ genetic data, which inherently reveals information about their family members. Many also worry about patients making medical decisions based on their own interpretation, against the advice of doctors. Furthermore, people say doctors might second-guess every one of their decisions to the point where the standard of care would decrease. While Keating recognizes and respects these concerns, he says that the landscape of health care is changing — mentioning the rise of wearable technologies that collect personal health data, such as smart watches, as an example. “I’m a strong believer in privacy, but if a patient wants to share, they should be able to,” he says. “Your personal being is your personal property, and you should have the right to share that data if you want to.” This is an area where Keating is leading by example. He has open-sourced his health data on his personal website, where his MRI scans and tumor model are available for download, and he has been meeting with government and hospital officials and leaders in the open-source health data field. He also has been exploring how links can be made between hospitals and open patient data repositories, such as Sage Bionetworks, the Personal Genome Project,Cancer Commons, and Patients Like Me. As a result of his advocacy for open-health data, the White House invited Keating to President Barack Obama’s unveiling of the Precision Medicine Initiative in January. Obama's proposal calls for increased federal investment in patient-powered research that accounts for individual differences in genes, environments, and lifestyles. One of the initiative’s primary objectives is accelerating design and testing of tailored cancer treatments through the National Cancer Institute. Having completed proton therapy at MGH with radiation oncologist Helen Shih, Keating is now undergoing chemotherapy at BWH. All the while, his spirits remain high. In an email he sent his friends and family before his surgery, Keating described life as a “wild ride.” However, as wild as it can be, he says that being an MIT student armed with data and a sense of curiosity can make all the difference. “The benefit of MIT is that we can know it’s a ride, but it’s a scary ride unless you have information to make it a curious problem,” he says. “And if it’s a curious problem, it becomes an exciting ride.”
Currently, even in the advanced countries like the US and UK, not all office-based physicians have electronic medical records. There is an emerging trend towards electronic health records, which will then lead towards some sort of health information liquidity. Big data is a disruptive force in healthcare and thus is resisted by health industry in many scenarios (as documented by Eric Topol in his excellent book). ----- “It’s your blood, your DNA, and your money; shouldn’t the images, records, and data belong to you, too? Dr. Topol’s deeply researched, powerfully presented arguments will ruffle feathers in the medical establishment—but he maintains that the new era of smartphones, apps, and tiny sensors is putting the patient in charge for the first time. And he’s right.” —DAVID POGUE, FOUNDER OF YAHOO TECH AND HOST OF PBS’ “NOVA” Excerpt From: Eric Topol. “The Patient Will See You Now: The Future of Medicine is in Your Hands.” iBooks. The Avatar Will See You Now Medical centers are testing new, friendly ways to reduce the need for office visits by extending their reach into patients’ homes. The avatar, Molly, interviews them in Spanish or English about the levels of pain they feel as a video guides them through exercises, while the 3-D cameras of a Kinect device measure their movements. Because it’s a pilot project, Paul Carlisle, the director of rehabilitation services, looks on. But the ultimate goal is for the routine to be done from a patient’s home.  “It would change our whole model,” says Carlisle, who is running the trial as the public hospital looks for creative ways to extend the reach of its overtaxed budget and staff. “We don’t want to replace therapists. But in some ways, it does replace the need to have them there all the time.” The Robot Will See You Now IBM's Watson—the same machine that beat Ken Jennings at Jeopardy—is now churning through case histories at Memorial Sloan-Kettering, learning to make diagnoses and treatment recommendations. This is one in a series of developments suggesting that technology may be about to disrupt health care in the same way it has disrupted so many other industries. Are doctors necessary? Just how far might the automation of medicine go? Your smartphone will see you now: The Robot Will See You Now IBM's Watson—the same machine that beat Ken Jennings at Jeopardy—is now churning through case histories at Memorial Sloan-Kettering, learning to make diagnoses and treatment recommendations. This is one in a series of developments suggesting that technology may be about to disrupt health care in the same way it has disrupted so many other industries. Are doctors necessary? Just how far might the automation of medicine go? The crowd will see you now THAT people scour the pages of the world wide web searching for answers to medical problems is well known. Indeed, doctors label the most diligent seekers of online medical information “cyber-chondriacs”. Some frustrated individuals have even set up their own websites, replete with data about their conditions or those of family members, to encourage strangers to help solve “mum’s medical mystery”, or offer a cure for a particular brain cancer. The need for a “crowdsourced” service like this comes from the number of rare diseases around. The National Institutes of Health, America’s medical agency, recognises 7,000—defined as those that each affect fewer than 200,000 people. A general practitioner cannot possibly recognise all of these. Moreover, it may not be clear to him, even when he knows he cannot help, what sort of specialist the patient should be referred to. Research published in 2013, in theJournal of Rare Disorders, says about 8% of Americans—some 25m people—are affected by rare diseases, and that it takes an average of 7½ years to get a diagnosis. Even in Britain, with all the resources of the country’s National Health Service at a GP’s disposal, rare-disease diagnosis takes an average of 5½ years. Also, doctors often get it wrong. A survey of eight rare diseases in Europe found that around 40% of patients received an erroneous diagnosis at first. This is something that can lead to life-threatening complications. CrowdMed, though, brings numerous pairs of eyeballs, each with different knowledge behind them, to every problem. Patients submit their cases and may offer a reward of a few hundred dollars to lubricate the process. The volunteer diagnosticians are students, retired doctors, nurses and even laymen and women who enjoy pitting their wits against a good medical mystery. Besides the cash, successful volunteers also get the kudos of rising in the website’s ranking system—and that ranking system is, in turn, used to filter the feedback given to patients, to try to avoid mistakes.
http://www.wsj.com/articles/big-data-cuts-buildings-energy-use-1411937794
While at first glance it is difficult to assess the value of this rather rudimentary data, remarkably useful information on human behavior may be derived from large sets of de-identified CDRs. There are at least three dimensions that can be measured: As mobile phone users send and receive calls and messages through different cell towers, it is possible to “connect the dots” and reconstruct the movement patterns of a community. This information may be used to visualize daily rhythms of commuting to and from home, work, school, markets or clinics, but also has applications in modeling everything from the spread of disease to the movements of a disaster-affected population. The geographic distribution of one’s social connections may be useful both for building demographic profiles of aggregated call traffic and understanding changes in behavior. Studies have shown that men and women tend to use their phones differently, as do different age groups. Frequently making and receiving calls with contacts outside of one’s immediate community is correlated with higher socio-economic class. 3. Mobile network operators use monthly airtime expenses to estimate the household income of anonymous subscribers in order to target appropriate services to them through advertising. When people in developing economies have more money to spend, they tend to spend a significant portion of it on topping off their mobile airtime credit. Monitoring airtime expenses for trends and sudden changes could prove useful for detecting the early impact of an economic crisis, as well as for measuring the impact of programmes designed to improve livelihoods.
http://www.wsj.com/articles/big-data-cuts-buildings-energy-use-1411937794
The following text is excerpted from: http://en.wikipedia.org/wiki/Empirical The word empirical denotes information gained by means of observation or experiments. Empirical data is data produced by an experiment or observation. A central concept in modern science and the scientific method is that all evidence must be empirical, or empirically based, that is, dependent on evidence or consequences that are observable by the senses. It is usually differentiated from the philosophic usage of empiricism by the use of the adjective empirical or the adverb empirically. The term refers to the use of working hypotheses that are testable using observation or experiment. In this sense of the word, scientific statements are subject to, and derived from, our experiences or observations.
The scale of Punjab: 120 million—if this was a country, it would have been the tenth most populous country in the world. The scale of crowdsourced picture (4.6 million pictures using 100$ smartphone by 3000 to 4000 field workers to date).
The mobile applications used by the government of Punjab are powered by a platform developed by ITU over time known as DataPlug. DataPlug—which is based on University of Washington’s Open Data Kit—makes it trivial to create new customized mobile applications (through a drag and drop GUI). Using this, non-technical persons can easily create mobile applications in 5 minutes without writing a line of code. The application can be changed from DataPlug’s site even after the application has been deployed by simply making the changes at the DataPlug’s website and this will be pushed to all the users. There are various options in these mobile applications (including textbox, taking pictures, record location, etc.).
http://www.un.org/millenniumgoals/reports.shtml
http://www.un.org/millenniumgoals/reports.shtml
The mobile applications used by the government of Punjab are powered by a platform developed by ITU over time known as DataPlug. DataPlug—which is based on University of Washington’s Open Data Kit—makes it trivial to create new customized mobile applications (through a drag and drop GUI). Using this, non-technical persons can easily create mobile applications in 5 minutes without writing a line of code. The application can be changed from DataPlug’s site even after the application has been deployed by simply making the changes at the DataPlug’s website and this will be pushed to all the users. There are various options in these mobile applications (including textbox, taking pictures, record location, etc.).
Challenges and considerations: Incentives to open and share big data Need to develop cultural and institutional capabilities But even easily accessible data such as Facebook “Likes” can predict sensitive characteristics including “sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender.” Data collectors often sell the data to others. One data broker assembled an average of 1,500 pieces of informa- tion about more than half a billion consumers worldwide from information people provided voluntarily on various websites. Statistics: For every person connected to high-speed broadband, five are not. Worldwide, some 4 billion people do not have any internet access, nearly 2 billion do not use a mobile phone, and almost half a billion live outside areas with a mobile signal. More households in developing countries own a mobile phone than have access to electricity or improved sanitation (figure O.4, panel a). The massive data volumes collected by internet platforms have created a whole new branch of economics— nano-economics—which studies individual, computer- mediated transactions.14 The main benefit to the user is that services can be tailored to individual needs and preferences—although at the cost of giving up privacy. For the seller, it allows more targeted advertising and even price discrimination, when automated systems can analyze user behavior to determine willingness to pay and offer different prices to different users. Kosinski, Stillwell, and Graepel 2013. (via WDR 2016) And smartphone sensors can infer a user’s “mood, stress levels, personality type, bipolar disorder, demographics (e.g., gen- der, marital status, job status, age), smoking habits, overall wellbeing, progression of Parkinson’s disease, sleep pat- terns, happiness, levels of exercise, and types of physical activity or movement.” c. See Peppet (2014) for individual references.
Main vectors and diseases they transmit Vectors are living organisms that can transmit infectious diseases between humans or from animals to humans. Many of these vectors are bloodsucking insects, which ingest disease-producing microorganisms during a blood meal from an infected host (human or animal) and later inject it into a new host during their subsequent blood meal. Mosquitoes are the best known disease vector. Others include ticks, flies, sandflies, fleas, triatomine bugs and some freshwater aquatic snails.
The lesson that was learned: You’re best served by predicting such a pandemic, once such a pandemic has happened, you cannot really contain it. The dengue project was implemented at the scale of Punjab, which is a population of 120 million people. (For reference, the population of UK is only 64 million). 2011: 20,000 infections in Punjab, 17,000 in Lahore alone (*this was a pandemic by any scale); more than than 250 died; 2012: 234 infections; and no death. Project Highlights Data capture on the move Verification via GPS coordinates Real time data entry into central server Consolidated Online Dashboards, accessible by all stakeholders involved Spatial-temporal analysis (SaTScan) to identify intersecting areas with dengue larvae breeding hotspots and dengue patients reported Built in early disease detection/warning system with geographical illustrations Current Standing The system has been successfully operational Punjab wide and spanned over 36 districts and more than 25 departments. 4000 android phone users More than 5,021,373 anti-dengue surveillance activities are submitted via android mobiles -------------------- In 2011, Lahore, Pakistan, was hit by the worst outbreak of dengue fever in its history (and anywhere in the world). The outbreak infected 16,000 people and took more than 350 lives. The Punjab IT board mobilized its response using big data averting any death in 2012 (and limiting the infections to only 234). While the magnitude of the disease naturally varies from year to year, the big-data driven approach adopted by PITB should be given some credit for averting a bigger tragedy. They key was to develop a tracking system that localized troubled areas and quarantined the disease and thereby prevented its spread. This was aided by mobile phone technology. Government workers were provided with 1500 Android phones to track the location and timing of confirmed dengue cases. The workers photologged the performance of more than 67000 prevention activities. Based on the tagging, the troubled areas could be localized and inevitably the troubled areas contained water pools (which is a breeding ground for dengue mosquitos). Countries like Pakistan typically do not have elaborate setup for surveilling diseases. PITB researchers adapted the Flubreaks project—which processed data from Google Flu Trends—for dengue fever outbreak.
The lesson that was learned: You’re best served by predicting such a pandemic, once such a pandemic has happened, you cannot really contain it. The dengue project was implemented at the scale of Punjab, which is a population of 120 million people. (For reference, the population of UK is only 64 million). 2011: 20,000 infections in Punjab, 17,000 in Lahore alone (*this was a pandemic by any scale); more than than 250 died; 2012: 234 infections; and no death. Project Highlights Data capture on the move Verification via GPS coordinates Real time data entry into central server Consolidated Online Dashboards, accessible by all stakeholders involved Spatial-temporal analysis (SaTScan) to identify intersecting areas with dengue larvae breeding hotspots and dengue patients reported Built in early disease detection/warning system with geographical illustrations Current Standing The system has been successfully operational Punjab wide and spanned over 36 districts and more than 25 departments. 4000 android phone users More than 5,021,373 anti-dengue surveillance activities are submitted via android mobiles -------------------- In 2011, Lahore, Pakistan, was hit by the worst outbreak of dengue fever in its history (and anywhere in the world). The outbreak infected 16,000 people and took more than 350 lives. The Punjab IT board mobilized its response using big data averting any death in 2012 (and limiting the infections to only 234). While the magnitude of the disease naturally varies from year to year, the big-data driven approach adopted by PITB should be given some credit for averting a bigger tragedy. They key was to develop a tracking system that localized troubled areas and quarantined the disease and thereby prevented its spread. This was aided by mobile phone technology. Government workers were provided with 1500 Android phones to track the location and timing of confirmed dengue cases. The workers photologged the performance of more than 67000 prevention activities. Based on the tagging, the troubled areas could be localized and inevitably the troubled areas contained water pools (which is a breeding ground for dengue mosquitos). Countries like Pakistan typically do not have elaborate setup for surveilling diseases. PITB researchers adapted the Flubreaks project—which processed data from Google Flu Trends—for dengue fever outbreak.
The investigating officer is provided with Android phones and the officer is responsible for taking a picture to the crime spot and providing input on an FIR. The geomap provides information on different types of crimes. Different patterns emerge from noting and analyzing the crimes. For example, it can be seen that mobile phone snatching is typically in unlit alley/ streets, and thus a solution to contain such crimes would be place more lighting. Similarly, various other crimes can be filtered, and this provides a way to analyze the various patterns of crimes and develop an appropriate strategy for crime prevention. This system is based on a similar system called CompStat. CompStat—or COMPSTAT—(short for COMPuter STATistics) is a combination of management philosophy and organizational management tools for police departments named after the New York City Police Department's accountability process, and has since been implemented in many other departments. Because it often relies on underlying software tools, CompStat has sometimes been confused for a software program in itself. This is a fundamental misconception. CompStat often does, however, incorporate crime mapping systems and a commercial or internally developed database collection system. In some cases, police departments have started offering information to the public through their own websites. In other cases, police departments can either create their own XML feed or use a third party to display data on a map. The largest of these is CrimeReports.com, used by thousands of agencies nationwide. History of CompStat: In 1994, Police Commissioner William Bratton introduced a data-driven management model in the New York City Police Department called CompStat, which has been credited with decreasing crime and increasing quality of life in New York City over the last eighteen years (Bratton, 1998; Kelling & Bratton, 1998; Shane, 2007). Due to its success in New York, CompStat has diffused quickly across the United States and has become a widely embraced management model focused on crime reduction. The CompStat process is guided by four principles, which are summarized as follows (see McDonald, 2002; Shane, 2007; & Godown, 2009): Accurate and timely intelligence (i.e., "Know what is happening." (Godown, 2009)): In this context, crime intelligence relies on data primarily from official sources, such as calls for service, crime, and arrest data. This data should be accurate and available as close to real-time as possible. This crime and disorder data is used to produce crime maps, trends, and other analysis products. Subsequently, command staff uses these information products to identify crime problems to be addressed. Effective tactics (i.e., "Have a plan." (Godown, 2009)): Relying on past successes and appropriate resources, command staff and officers plan tactics that will respond fully to the identified problem. These tactics may include law enforcement, government, and community partners at the local, state, and federal levels. A CompStat meeting provides a collective process for developing tactics as well as accountability for developing these tactics. Rapid deployment (i.e., "Do it quickly." (Godown, 2009)): Contrary to the reactive policing model, the CompStat model strives to deploy resources to where there is a crime problem now, as a means of heading off the problem before it continues or escalates. As such, the tactics should be deployed in a timely manner. Relentless follow-up and assessment (i.e., "If it works, do more. If not, do something else." (Godown, 2009)): The CompStat meeting provides the forum to "check-in" on the success of current and past strategies in addressing identified problems. Problem-focused strategies are normally judged a success by a reduction in or absence of the initial crime problem. This success or lack thereof, provides knowledge of how to improve current and future planning and deployment of resources.
http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800
Human behavior is complex to model, influence, and predict: While it is becoming increasing clear that there is a value proposition of big data for tackling complex technical and business problems, it is not that obvious on how well big data can tackle complex social problems. This is not a surprise. It has to do with people and institutions! So, how can (big) data be used for social good? On paper, it is simple :-) In practice it is not. We as data scientists, data engineers, may employ similar tools as when we work for business or science, but our motivation need to stem from the desire to help alleviate some of the world's most pressing problems; poverty, disease, ecological harm, war and famine. Think for example of what is going on with the refugees in these days for example…. 2) Health-related decisions especially relate to a person’s social network; smokers typically have smokers in their social networks. Similar network effects can be seen for other problems such as alcohol and depression. There is strong evidence that obesity spreads through social networks. (This is based on a study from the authors of the book Connected, Christakis and Fowler who studies the spread of obesity in a large social network over 32 years.)
Human behavior is complex to model, influence, and predict: While it is becoming increasing clear that there is a value proposition of big data for tackling complex technical and business problems, it is not that obvious on how well big data can tackle complex social problems. This is not a surprise. It has to do with people and institutions! So, how can (big) data be used for social good? On paper, it is simple :-) In practice it is not. We as data scientists, data engineers, may employ similar tools as when we work for business or science, but our motivation need to stem from the desire to help alleviate some of the world's most pressing problems; poverty, disease, ecological harm, war and famine. Think for example of what is going on with the refugees in these days for example…. 2) Health-related decisions especially relate to a person’s social network; smokers typically have smokers in their social networks. Similar network effects can be seen for other problems such as alcohol and depression. There is strong evidence that obesity spreads through social networks. (This is based on a study from the authors of the book Connected, Christakis and Fowler who studies the spread of obesity in a large social network over 32 years.)
Credit: Demography, meet Big Data; Big Data, meet Demography: Reflections on the Data-Rich Future of Population Science By Emmanuel Letouzé Director & Co-Founder, Data-Pop Alliance
The movements and locations of individuals are, of course, traditionally regarded as part of one of the most sensitive areas of privacy. Companies such as Google and Apple are increasingly collecting such data. In the wake of revelations that the National Security Agency (NSA) has accessed information from major Internet companies — including Google, Microsoft, Facebook, Skype, Apple and Yahoo — a debate has begun to unfold. How important might small bits of data or “metadata” be, from phone numbers and GPS tracking data to even just the location “pings” recorded by cellular telecommunications towers? A 2013 study in Scientific Reports, published in the journal Nature, “Unique in the Crowd: The Privacy Bounds of Human Mobility,” is one of the latest research efforts to show how humans can be tracked and identified based on databases that, in principle, contain anonymous data. Researchers from MIT, Harvard and Université Catholique de Louvain in Belgium analyze what they call “mobility traces,” or data that can “approximate [the] whereabouts of individuals and can be used to reconstruct individuals’ movements across space and time.” They point out that “a simply anonymized dataset does not contain name, home address, phone number or other obvious identifier. Yet if an individual’s patterns are unique enough, outside information can be used to link the data back to an individual.” The study performs an analysis of 15 months of mobile phone data relating to about 1.5 million individuals in a small European country during 2006-2007. Ensuring Privacy and Preventing Abuse: The field of big data promises great opportunities but also entails some great risks of abuse and misuse. With big crisis data, there is always the danger of the wrong people getting hold of sensitive data—something that can easily lead to disastrous consequences. The development of appropriate policy can help manage this dilemma between the opportunities and risks of big data. Some of the big questions that big data policies should address are: (1) what data to open up? (2) who should be able to access which data? (3) which data should be publicly accessible? (4) how can the data be used, reused, repurposed, and linked? The devised policies must also include prescriptive steps that ensure that data is used ethically (and not misused by malevolent actors and crisis profiteers). In particular, we should take steps to ensure that crisis victims do not expose themselves or others to further harm unwittingly (e.g., in countries beset with a civil war, or sectarian violence, a request for help with personal information may also be used maliciously by malevolent actors for violent purposes). 2) Ethical Big Crisis Data Analytics: It is also important that the big crisis data and the digital humanitarian communities emphasize value-based and ethical humanitarian service. In this regard, these communities can leverage the collective knowledge of the existing humanitarian organizations available in the form of the “humanitarian principles”1 that define a set of universal principles for humanitarian action based on international humanitarian law. These principles are widely accepted by humanitarian actors and are even binding for the UN agencies. The guiding humanitarian principles are: (1) humanity: the humanitarian imperative comes first—aid has to be given in accordance to need. The purpose of humanitarian action is to protect life and health and ensure respect for human beings; (2) neutrality: the humanitarian actors must not take sides in hostilities or engage in controversies of a political, racial, religious or ideological nature; (3) impartiality: aid should be delivered without discrimination as to nationality, race, religious beliefs, class or political opinions; and (4) independence: the humanitarian action must be autonomous from the political, economic, military or other objectives that any actor may hold with regard to areas where humanitarian action is being implemented. The big crisis data analytics community also needs to adopt these, or similar, principles to guide their own work.
In 2008, researchers from Google explored this potential, claiming that they could “nowcast” the flu based on people’s searches. The essential idea, published in a paper in Nature, was that when people are sick with the flu, many search for flu-related information on Google, providing almost instant signals of overall flu prevalence. The paper demonstrated that search data, if properly tuned to the flu tracking information from the Centers for Disease Control and Prevention, could produce accurate estimates of flu prevalence two weeks earlier than the CDC’s data—turning the digital refuse of people’s searches into potentially life-saving insights. And then, GFT failed—and failed spectacularly—missing at the peak of the 2013 flu season by 140 percent. When Google quietly euthanized the program, called Google Flu Trends (GFT), it turned the poster child of big data into the poster child of the foibles of big data. But GFT’s failure doesn’t erase the value of big data. What it does do is highlight a number of problematic practices in its use—what we like to call “big data hubris.” The value of the data held by entities like Google is almost limitless, if used correctly. That means the corporate giants holding these data have a responsibility to use it in the public’s best interest. In a paper published in 2014 in Science, our research teams documented and deconstructed the failure of Google to predict flu prevalence. Our team from Northeastern University, the University of Houston, and Harvard University compared the performance of GFT with very simple models based on the CDC’s data, finding that GFT had begun to perform worse. Moreover, we highlighted a persistent pattern of GFT performing well for two to three years and then failing significantly and requiring substantial revision. --- [DAVID HAND] https://www.youtube.com/watch?v=C1zMUjHOLr4 All data sets have problems: distortion, missing values. Data quality is an important issue: this has been well investigated for small data sets, but this may be an even bigger problem for large data sets. --- Holmes famously solves the case by focusing on a critical piece of evidence, a guard dog that doesn’t bark during the commission of the crime. He concludes “the midnight visitor was someone the dog knew well”, ultimately leading to the determination that the horse’s trainer was the guilty party. The story is often used as an example of the importance of expanding the search for clues beyond the obvious and visible. Caveat Emptor: Beware of the Big Noise If big sized data was not challenging enough, crisis analysts have to deal with another formidable challenge: big false data. The presence of false data dilutes the signal to noise ratio, making the task of finding the right information at the right time even more challenging. This problem of big noise is particularly problematic for crowdsourced data in which the noise may be injected intentionally or unintentionally. Intentional sources of noise may come from pranksters or more sinisterly through cyber-attacks (this is particularly a risk during man-inflicted disasters, such as coordinated terrorist attacks). Unintentional sources of noise also creep into disaster data (e.g., through the spreading of false rumors on social networks; or through the circulation of stale information about some time-critical matter). The data may also be false due to bias. Much like the “dog that didn’t bark” that tipped off Sherlock Holmes in one of his investigations, the data that is not captured is sometimes more important than what was captured. This sampling bias is always present in social media and must be investigated using sound statistical analysis (the need of which is not obviated due to the large size of data). As an example of the inherent bias in big data, we note that the Google Flu Tracker overestimated the size of the 2013 influenza pandemic by 50%, and predicted double the amount of flu-related doctor visits. ---- A prime example that demonstrates the limitations of big data analytics is Google Flu Trends, a machine- learning algorithm for predicting the number of u cases based on Google search terms. To predict the spread of influenza across the United States, the Google team analyzed the top fifty million search terms for indications that the u had broken out in particular locations. While, at first, the algorithms appeared to create accurate predictions of where the u was more prevalent, it generated highly inaccurate estimates over time.165 is could be because the algorithm failed to take into account certain variables. For example, the algorithm may not have taken into account that people would be more likely to search for u-related terms if the local news ran a story on a u outbreak, even if the outbreak occurred halfway around the world. As one researcher has noted, Google Flu Trends demonstrates that a “theory-free analysis of mere correlations is inevitably fragile. Summary of Research Considerations (the Federal Trade Commission (FTC) report): In light of this research, companies already using or considering engaging in big data analytics should: Consider whether your data sets are missing information from particular populations and, if they are, take appropriate steps to address this problem. Review your data sets and algorithms to ensure that hidden biases are not having an unintended impact on certain populations. Remember that just because big data found a correlation, it does not necessarily meanthat the correlation is meaningful. As such, you should balance the risks of using those results, especially where your policies could negatively a ect certain populations. It may be worthwhile to have human oversight of data and algorithms when big data tools are used to make important decisions, such as those implicating health, credit, and employment. Consider whether fairness and ethical considerations advise against using big data in certain circumstances. Consider further whether you can use big data in ways that advance opportunities for previously underrepresented populations. Correlations are a way of catching a scientists attention, but we need models and mechanisms to explain and predict in a way that advances science and creates practical applications. Concluding something from a single source of data (even if voluminous) is problematic. Sometimes inconvenient data is shrugged away as an outlier. But there’s an objective way of knowing what is an outlier. There might be a lot of information in a statistical anomaly, which may be inadvertently filtered away.
Chris Anderson, “The End of Theory”, Wired: http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory http://k38.kn3.net/F1558A3F3.jpg Correlation, as any first-year statistics students knows, is not causation. For causation analysis, one needs models and theories and experiments. For many business applications (such as collaborative filtering for recommendations and personalization), correlations are often enough to do interesting things. This needs to be investigated how useful correlations can be for BD4D. To really gain insight, BD4D science needs to aim at understanding, awareness, or forecasting.
Excerpt From: Siegel, Eric. “Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die.” iBooks. And of course, in such scenarios, not all false positives are equally bad. False positive about who is going to commit a crime is much worse than a false positive about where a crime would be committed.
Data acquisition and sharing is difficult. In countries where there is not an open data policy, doing BD4D is very difficulty. I think the existing works have only scratched the surface of what can be possible with big data for development. With BD4D, we can reinforce a positive loop in which a good government can use big data for development, and big data behavioral insights can be used to improve social behavior. Existing tools are quite basic tools. They make use of crowdsourcing and traditional simple analytics. The real promise of BD4D comes to the picture, when we are able to analyze many modes of data---government data, personal data, open data, online data (such as social media), text, video, audio---all in concert 2) The use of AI becomes more important---especially for NLP. 3) Predictive BD4D analytics is the frontier. With BD4D, we can embed cognitive robotics into the fray but that opens up the Pandora’s box and involves many ethical use questions.
http://www.amazon.com/The-Big-Data-Driven-Business-Competitors/dp/1118889800 “The Big Data-Driven Business: How to Use Big Data to Win Customers, Beat Competitors, and Boost Profits”, by Russell Glass and Sean Callahan, 2014 Figures from Book: Data Science for Dummies, Lillian Pierson. Big Data for Dummies, Judith Hurwitz Figure from data.gov Figure from data.gov.uk Figure from https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1 From Harvard Data Science course Picture Credit (Data Scientist):
There’s a lot to like about the altruism and idealism of BD4D, but implementing practical BD4D systems that are useful in practice is going to be challenging.

Big data for development

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data for development

Similar to Big data for development (20)

More from Junaid Qadir

More from Junaid Qadir (20)

Recently uploaded

Recently uploaded (20)

Big data for development

Editor's Notes