Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting comfortable with Data

2,029 views

Published on

Talk at a Data Journalism BootCamp organised by ICFJ, World Bank Group and African Media Initiative in New Delhi to a group of 60 journalists, coders and social sector folks. Other amazing sessions included those from Govind Ethiraj of IndiaSpend, Andrew from BBC, Parul from Google, Nasr from HacksHacker, Thej from DataMeet and David from Code for Africa. http://delhi.dbootcamp.org/

Published in: Data & Analytics
  • Be the first to comment

Getting comfortable with Data

  1. 1. Getting comfortable with data @ritvvijparrikh, Data Designer, pykih.com d|Bootcamp, http://delhi.dbootcamp.org, September 5, 2014
  2. 2. About me I help organisations make sense (visual or otherwise) of data. 2005 University: Neural Net based Market Prediction Stock Market Data 2006 Software Developer at Amdocs 2011 Design Lead Amdocs Sales Team to AT&T Product Manager at samhita.org Founded TracksGiving - analytics for charities 2013 Founded pykih - data visualisation Telecom data for AT&T Donation data FirstPost Journalism++ Cologne / Datawrapper Microsoft visual.ly NarendraModi.in
  3. 3. How developers look at data? v/s How do journalists look at data?
  4. 4. How developers look at data? v/s How do journalists look at data?
  5. 5. Article in English WHO does not marvel at the prospect of India going to the polls? Starting on April 7th, illiterate villagers and destitute slum-dwellers will have an equal say alongside Mumbai’s millionaires in picking their government. Almost 815m citizens are eligible to cast their ballots in nine phases of voting over five weeks—the largest collective democratic act in history. ! But who does not also deplore the fecklessness and venality of India’s politicians? The country is teeming with problems, but a decade under a coalition led by the Congress party has left it rudderless. Growth…
  6. 6. English article on Genomics Assumption: Unknown domain Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism).[1][2] Advances in genomics have triggered a revolution in discovery-based research to understand even the most complex biological systems such as brain.[3] The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The field also includes studies of intragenomic phenomena …
  7. 7. Conclusion We are comfortable with unknown material in English
  8. 8. Objective: Be comfortable with unknown data sets
  9. 9. Objectives Can we build Data Comprehension skills?
  10. 10. Objectives • What is data made up of?! • Data File Formats! • Where is the story worthy data! • Data Types! • Properties of Data! • Insights / Recipes for stories! • Data Aggregation! • Basic Spreadsheet Functions
  11. 11. About me Data on Glass Manufacturing Factory Floor in German Language Unknown domain. Unknown language. We still modelled the data correctly.
  12. 12. Let’s dive into basics What is data made up of
  13. 13. Where does data come from? Human! Actions / experiences Wind
  14. 14. Where does data come from? Documented Data Human! Actions / experiences Wind Documenting
  15. 15. Where does data come from? Documented Data Insights Human! Actions / experiences Wind Documenting Sea Travel
  16. 16. Where does data come from? Documented Data Insights Human! Actions / experiences Wind Documenting Sea Travel What am I doing Twitter Sentiment about Budget
  17. 17. Where does data come from? Documented Data Insights Human! Actions / experiences Wind Documenting What am I doing Twitter Sea Travel Sentiment about Budget Vote Election Commission Political Change
  18. 18. Where does data come from? Human! Actions / experiences Documented Data Insights Wind Documenting What am I doing Twitter Sea Travel Sentiment about Budget Vote Election Commission Political Change State Dept. Wires Wikileaks Backdoor Foreign Policy
  19. 19. What has changed? Human Actions / experiences Documented Data Insights Wind Documenting What am I doing Twitter Sea Travel Sentiment about Budget Vote Election Commission Political Change State Dept. Wires Wikileaks Backdoor Foreign Policy
  20. 20. Technology Human Actions / experiences Documented Data Insights Wind Documenting Sea Travel What am I doing Twitter Sentiment about Budget Vote Election Commission Political Change State Dept. Wires Wikileaks Backdoor Foreign Policy
  21. 21. Struggling with Human Actions / experiences Documented Data Insights Grasp Wind Documenting Sea Travel What am I doing Twitter Sentiment about Budget Vote Election Commission Political Change State Dept. Wires Wikileaks Backdoor Foreign Policy
  22. 22. Struggling with Human Actions / experiences Documented Data Insights Story Grasp Wind Documenting Sea Travel What am I doing Twitter Sentiment about Budget Vote Election Commission Political Change State Dept. Wires Wikileaks Backdoor Foreign Policy
  23. 23. What is data made up of? Human Actions / experiences Documented Data Insights Wind Documenting Domain ! Human context Meta data! How is it stored Sea Travel
  24. 24. Data Comprehension Human Actions / experiences Documented Data Insights Wind Documenting Domain ! Human context Meta data! How is it stored Grammar of the Data Sea Travel
  25. 25. Documented Data Insights Let’s dive into basics Data Formats Human! Actions / experiences
  26. 26. How is the data stored? Format is a pre-defined structure in which 1s’ and 0s’ are stored to for a software to read it.
  27. 27. How is the data stored? ! ! Data Designed for! Data and ! Formatting! Humans Designed for! Machine
  28. 28. Machine readable data is for us ! ! Data Designed for! Data and ! Formatting! Humans Designed for! Machine Our objective to discover story in data. Formatting will unnecessarily come in the way.
  29. 29. Tabular v/s Document data Designed for! Humans Designed for! Machine Tabular Document
  30. 30. Scraping / API Integration Designed for! Humans Designed for! Machine Tabular Scrape / API Document New Terms: PDF Scraping. Web Scraping. API Integration. Developer
  31. 31. Machine readable Tabular data formats Designed for! Humans Designed for! Machine Tabular Scrape / API Document
  32. 32. * separated values files | (pipe) acts as a delimiters allowing us to identify columns new lines help identify rows Extend this concept, and you get ! ! Comma Separated Value files! Pipe Separated Value files! Semicolon Separated Value files! Tab Separated Value files …
  33. 33. FYI - Data Formats Designed for! Humans Designed for! Machine Tabular Scrape / API Document
  34. 34. Let’s open a Government Data Set
  35. 35. Whom was this created for? Designed for! Machine Document Designed for! Humans Tabular
  36. 36. Whom was this created for? Designed for Humans Designed for! Machine Tabular Document Horizontal
  37. 37. Machine readable data is for us ! ! Data Designed for! Data and ! Formatting! Humans Designed for! Machine Our objective to discover story in data. Formatting will unnecessarily come in the way. Recap
  38. 38. What we want is Vertical
  39. 39. Documented Data Insights Let’s dive into basics Human! Actions / experiences Where is the find story-worthy data
  40. 40. Where is all the story worthy data sitting? • data.gov.in! • RBI.org.in! • mospi.nic.in ! • planningcommission.nic.in! • unicef.org/statistics! • indiabudget.nic.in! • ncrb.nic.in! • mha.nic.in! • dise.in! • World Bank! • Oxfam! • IMF! • World Health Organisation! • …
  41. 41. It could also be in… • data.gov.in! • RBI.org.in! • mospi.nic.in ! • planningcommission.nic.in! • unicef.org/statistics! • indiabudget.nic.in! • ncrb.nic.in! • mha.nic.in! • dise.in! • World Bank! • Oxfam! • IMF! • World Health Organisation! • … • Tweets! • Stock Market! • Politician’s speeches! • Other news articles! • Wiki Leaks! • Police FIR reports! • Survey! • Blogs! • Cell phone tower logs
  42. 42. Documented Data Insights Let’s dive into basics Grammar of the Data Human! Actions / experiences
  43. 43. Datatype Journalist • String! • Number
  44. 44. Datatype Journalist Developer • String! • Number • String! • Number! • Decimal / Float / Scientific ! • Boolean! • Date! • Date Time! • Time
  45. 45. Datatype String Hello Number 3 Float 3.03 Boolean Yes / No, True / False Date 3 Feb 2014 Date Time 3 Feb 2014 1 am Time 1am Blank / Empty / Null
  46. 46. Datatypes in Google Spreadsheet
  47. 47. Formatting Things you do to make the data more Human readable.!
  48. 48. Formatting Things you do to make the data more Human readable.! Data Formated Data 3 3% 3.03 $3.03 34950683 3,49,50,683 34950683 34.950683 Million Rounding Up 35 Million
  49. 49. Formatting in Google Spreadsheets
  50. 50. Formatting is for presentation purposes only Stay away from tools that do not format for presentation only. E.g. Round, Currency.
  51. 51. What if Formatting is not used for presentation? Things you do to make the data more Human readable.! Data Data type Formated Data Data type 3 Number 3% String 3.03 Float $3.03 String 34950683 Number 3,49,50,683 String 34950683 Number 34.950683 Million String Rounding Up Number 35 Million String
  52. 52. Properties of Data Quantitative ! ! • is things you ADD e.g. number of sandwiches ! ! Qualitative ! ! • that tell you ATTRIBUTES e.g. staleness of sandwich, veg or non-veg
  53. 53. Properties of Data … Quantitative ! ! • e.g. number of sandwiches ! • Always a number Qualitative ! ! • e.g. staleness of sandwich, veg or non-veg! • May or may not be a number e.g. number of days ago it was manufactured!
  54. 54. Properties of Data … Quantitative ! ! • e.g. number of sandwiches ! • Always a number! • Objective: ADD! Qualitative ! ! • e.g. staleness of sandwich, veg or non-veg! • May or may not be a number e.g. number of days ago it was manufactured! • Objective: Quality / Health
  55. 55. Properties of Data … Geospatial! ! Terms! ! • Countries! • States / Regions! • Districts / Counties! • Taluka! • Cities! • Latitude Longitude! ! Need for Standardisation! ! • India = Bharat = Republic of India = Hindustan! ! Standards! ! • ISO2 Codes
  56. 56. Properties of Data … Timeseries! ! Terms! ! • Year! • Month - Year! • Date! • Date / Time! • Time! • Day of the Week! • Hour
  57. 57. Properties of data … Exercise Sentiment Qualitative Number of tweets Quantitative Day Timeseries
  58. 58. Properties of Data … Source: http://www.bbc.co.uk/news/business-15748696
  59. 59. Properties of Data … Health of Economy Qualitative Size of Economy Quantitative Countries Geospatial Years Timeseries Source: http://www.bbc.co.uk/news/business-15748696
  60. 60. Properties of Data … Health of Economy Qualitative Size of Economy Quantitative Countries Geospatial Years Timeseries Debt Relational
  61. 61. Properties of Data … Relational data friends friends Joe Ram Zoe exports India exports US Goa Pune B’lore Hubli
  62. 62. Properties of Data … Relational data
  63. 63. Properties of Data … Even Railway fares are a relationship.
  64. 64. Properties of Data … Hierarchical data Source http://www.pykih.com/data-journalism/election-counting-day-app-for-firstpost
  65. 65. Properties of Data … Hierarchical data is any data that has a tree Journalist • CEO - VP - Managers - ….! • Prime Minister - Cabinet - …! • Country - State - City - Zipcode
  66. 66. Properties of Data … Hierarchical data is any data that has a tree Journalist • CEO - VP - Managers - ….! • Prime Minister - Cabinet - … Developer • Product Hierarchy! • Distribution of funds! • Flow of Ganga into various tributaries
  67. 67. Properties of Data … Unique Example Source: http://dadaviz.com/i/794
  68. 68. I have $10 to spend in a day Is it more? Is it less?
  69. 69. Data when compared makes sense Everything is relative December 2012 iPhone division revenue for Quarter was $24.4 B Fact
  70. 70. Data when compared makes sense Everything is relative December 2012 iPhone division revenue for Quarter was $24.4 B Fact Story December 2012 Entire Microsoft’s revenue for same Quarter was $20.9 B
  71. 71. Comparisons must have a baseline What is the common denominator Source: http://www.statista.com/chart/2628/police-firearms-discharges/
  72. 72. Let’s dive into basics Recipes for stories Human! Actions / experiences Documented Data Insights
  73. 73. India gives maximum citizenship to people from ____? I would assume it is Bangladesh or Nepal. But since Bangladesh’s base is higher… it should be Bangladesh.
  74. 74. India gives maximum citizenship to people from ____? Source: http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=1153
  75. 75. India gives maximum citizenship to people from ____? Pakistan
  76. 76. India gives maximum citizenship to people from ____? Source:http://factchecker.in/pakistanis-get-maximum-indian-citizenship/
  77. 77. What did we do? Hypothesis Testing Source: http://factchecker.in/category/fact-check/
  78. 78. Often you have data but no hypothesis… In such a case, you will explore the data set to find patterns and insights. Census Dashboard - http://www.pykih.com/data-journalism/india-census
  79. 79. Perspectives
  80. 80. Perspectives -> Stories
  81. 81. Two is better than one
  82. 82. Two is better than one If you plot crime in UP across last 10 years, all you get is a LINE chart.
  83. 83. Two is better than one If you plot crime in UP across last 10 years, all you get is a LINE chart. + Political parties ruling UP in same period = Story
  84. 84. When you see Political Speeches as Speeches
  85. 85. When you see Political Speeches as Data
  86. 86. Data is simply documented human actions / experience. Focus on understanding the Grammar behind data.
  87. 87. Fun fact: The word pykih came to us in a CAPTCHA. That’s the day we decided that till we do good work it does not matter what we are called. We are at @pykih

×