Visualizing Data Journalism (HasGeek Fifth Elephant)

  • 357 views
Uploaded on

The presentation is broken into two parts. First, it introduces the various core fundamentals of data visualization and then we apply those fundamentals in two case studies. The second part revolves …

The presentation is broken into two parts. First, it introduces the various core fundamentals of data visualization and then we apply those fundamentals in two case studies. The second part revolves around challenges with data journalism and what is pykih doing about them.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
357
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Visualizing Data Journalism Ritvvij Parrikh, Founder, www.pykih.com ! ! Fifth Elephant, Delhi Run-up Event, India Today Mediaplex, June 14, 2014
  • 2. Pykih is a data Visualization company. We build custom visual representations of large data sets to make data actionable for readers. We have satisfied customers in six countries. Introduction
  • 3. • Data Viz. • Theory • Case Study 1 • Case Study 2 • Summary • Challenges in Data Journalism • What we are doing about it for ourselves Agenda
  • 4. Data Visualization
  • 5. Let’s explore the humble pie chart… Party Percentage E 38% D 25% C 20% B 15% A 2% Break the whole into parts.
  • 6. Let’s explore the humble pie chart… Party Percentage E 38% D 25% C 20% B 15% A 2% Break the whole into parts. Data: One dimensional Visual Encoding: Area
  • 7. New Terms • Dimension: Columns by which you group data.! ! • Facts: Numbers that you can count, sum, average, etc.! ! • Examples:! • Seat count by party! • Seat count by party and state! ! • Visual Encoding: Area, Position, Colour, Length, Thickness, etc.
  • 8. One-dimensional Charts PIE is a one-dimensional chart
  • 9. One-dimensional Charts … A pie could have been a random shape broken by percentage
  • 10. One-dimensional Charts … Pie Amoeba Percentage! Rectangle Donut Percentage! Triangle Bubble Election Donut Funnel Percentage Bar Percentage ! Column #1 - The same data can be Visualized in many (MANY!) different ways.
  • 11. One-dimensional Charts … Source: thehindu.com What is wrong here?
  • 12. One-dimensional Charts … What is wrong here? Problems:! • Colour communicates no data! • 3D communicates no data Source: thehindu.com
  • 13. One-dimensional Charts … Source: thehindu.com #2 - Your goal is to communicate data. Wrong use of visual encoding confuses. Problems:! • Colour communicates no data! • 3D communicates no data
  • 14. One-dimensional Charts … Source: firstpost.com What is wrong here?
  • 15. One-dimensional Charts … What is wrong here? Problems:! • Colour! • Too many values. Too cluttered. Source: firstpost.com
  • 16. One-dimensional Charts … Problems:! • Colour! • Too many values. Too cluttered. #3 - AREA encoding is useful for only few values after which it is unreadable. Source: firstpost.com
  • 17. One-dimensional Charts … Solution to problem of restricted space? Create a custom chart.
  • 18. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance
  • 19. Grouped One-dimensional Charts Party Alliance Percentage A NDA 38% B NDA 25% C NDA 20% D UPA 15% E Others 2%
  • 20. Grouped One-dimensional Charts Group various bubbles by colours Party Alliance Percentage A NDA 38% B NDA 25% C NDA 20% D UPA 15% E Others 2%
  • 21. Grouped One-dimensional Charts Group various bubbles by colours Party Alliance Percentage A NDA 38% B NDA 25% C NDA 20% D UPA 15% E Others 2% #4 - You can always fit in an extra dimension (GROUP) in charts using colour.
  • 22. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance Two dimensional: ! Which party won in which year
  • 23. Two-dimensional Charts Plot two data points Party Constituency A Z B Y C X D V E W 23Visual encoding: Position, Length
  • 24. Two-dimensional Charts… Connect the dots and you get a line chart.
  • 25. Two-dimensional Charts… Scatter Line Area Bar Column Spider All these charts require the same data.#5 - Number of dimensions in data determines which chart to use
  • 26. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance Two dimensional: ! Which party won in which constituency Weighted Two dimensional: ! Which party won in which constituency by what vote margin
  • 27. Weighted Two-dimensional Charts This is a 2d chart.
  • 28. Weighted Two-dimensional Charts … Let’s add weight to it, hence now we have three data points X axis Y axis Weight A Z 40 B Y 20 C X 1 D V 300 E W 60 28Visual encoding: Position, Length, Area
  • 29. Weighted Two-dimensional Charts … Weighted Scatter Circle Comparison All these charts require the same data.#6 - You can always fit in an extra fact (WEIGHT) in charts using size.
  • 30. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance Two dimensional: ! Which party won in which constituency Weighted Two dimensional: ! Which party won in which constituency by what vote margin Grouped Weighted Two dimensional: ! Which party won in which constituency by what vote margin grouped by alliance
  • 31. Grouped Weighted Two-dimensional Charts Grouped Weighted Scatter Grouped Circle Comparison 31Visual encoding: Position, Length, Area, Colour
  • 32. Multi-series Two-dimensional Charts … RangeGanttMulti-series Line Group Column Stack Column Group Stack Column Stack Area Stack Percentage Area Add more dimensions in creative ways.
  • 33. Multi-series Two-dimensional Charts … What is right and wrong here? Source: livemint.com Is the equities rally percolating into the broader market?
  • 34. Multi-series Two-dimensional Charts … What is right and wrong here? Source: livemint.com Is the equities rally percolating into the broader market? Bad parts:! • BSE Small-cap lines is not visible and that’s the story.
  • 35. Multi-series Two-dimensional Charts … What is right and wrong here? Good parts:! • Y axis from 97 instead of 0 Source: livemint.com Is the equities rally percolating into the broader market? Bad parts:! • BSE Small-cap lines is not visible and that’s the story. #7 - Purpose of line chart is to show trend. Focus on it.
  • 36. Multi-series Two-dimensional Charts … What is wrong here? Source: livemint.com Does IMF wear rose-tinted glasses?
  • 37. Multi-series Two-dimensional Charts … What is wrong here? Source: livemint.com Problems:! • Cannot find the IMF line. Does IMF wear rose-tinted glasses?
  • 38. Multi-series Two-dimensional Charts … What is wrong here? Source: livemint.com Does IMF wear rose-tinted glasses? Problems:! • Cannot find the IMF line. #8 - Highlight the story for the user. Use color to highlight, not confuse.
  • 39. New Data Set All the data we encountered so far was RDBMS i.e. could fit in a SpreadSheet. (rows and columns). ! ! Sometimes data is more complex. It can have“relationships”. ! ! Types of relationships:! • Hierarchy / Tree! • Multi-level relationships
  • 40. Tree Charts { "name": "root", "children": [ { "name": "A", "children": [ {"name": "A1"}, {"name": "A2"}, {"name": "A3"}, {"name": "A4"} ] 40Visual encoding: Position
  • 41. Tree Charts Dendrogram Circular Dendrogram
  • 42. Grouped Weighted Tree Charts Packed Circle Sunburst Tree Rectangle Tree Bar Grouped Weighted Tree 42Visual encoding: Position, Size, Colour
  • 43. Grouped Weighted Tree Charts Sunburst 43Visual encoding: Position, Size, Colour
  • 44. Grouped Multi-level Relationship Charts { “nodes”: [ {“name”: “A”, “group”: “G1”}, {“name”: “B”, “group”: “G2”}, … ], "relations": [ {"from": “A”, "to": “B”}, {"from": “A”, "to": “C”}, … ] 44Visual encoding: Position
  • 45. Grouped Multi-level Relationship Charts Graph Collapsible Graph Hive #9 - Look for relationships across data sets.
  • 46. Weighted Grouped Multi-level Relationship Charts Sankey 46Visual encoding: Position, Color, Size
  • 47. Case: Mumbai Local Fare Chart A fare exists for travel between station "A" and “B”. Hence, it is a relationship chart.
  • 48. Case: Mumbai Local Fare Chart Matrix Half Matrix [ {"node1": "A", "node2": "B", "weight": 300}, {"node1": "A", "node2": "C", "weight": 900}, … ]
  • 49. Case: Mumbai Local Fare Chart 49 #9 - Look for limitations. They can help you improve design.
  • 50. Weighted Two-level Relationship Charts … Chord Number of people travel between various stations
  • 51. • One dimensional charts! • Grouped one dimensional charts! ! • Two dimensional charts! • Weighted Two dimensional charts! • Grouped Two dimensional charts! • Grouped Weighted Two dimensional charts! ! • Multi-dimensional Charts! ! • Tree Charts! • Grouped Weighted Tree Charts! ! • Multi-level Relationships Charts! • Grouped Weighted Multi-level Relationships Charts! ! • Two-level Relationships Charts! • Grouped Weighted Two-level Relationships Charts Taxonomy of Standard Data Visualizations
  • 52. The same data can be visualized in many (MANY!) ways. Without exploring the data, you will end up visualizing all your data in pies, lines and bars. Most Imp. Lesson
  • 53. One Dimension Two Dimension Multi- Dimension Relationship Hierarchical Geo Maps Dimension: Time N Y Y N N N Dimension: Group Y Y Y Y Y Y Fact: Weight N Y Y Y Y Y Group and Weight N Y Y Y Y N Fact: Many values May be Y Y Y Y Y Multiple levels / Zoomable N N N Y Y Y Implications
  • 54. List of Visual Encodings Source: http://complexdiagrams.com/properties
  • 55. Case Study #1: Let’s apply what we learnt IPL Score Card
  • 56. ESPNCricInfo Score Card 56
  • 57. 57 Ball by ball! Commentary Per Batsman Statistics Per Bowler Statistics Fall of Wickets Partnerships Two innings Pre-match: Toss, Playing 11, Location, Time Post-match: Win, by how much, Man of the match Second Innings: Current Run Rate, Required Run Rate, Target score
  • 58. Overs: Most important data-point 1. Overs = Time! 2. One over ! 1. has_many balls! 2. has_one bowler! 3. has_many batsmen! 3. Existence of batsmen across overs is partnerships! 4. Partnerships and Fall of wickets are the same different data set
  • 59. Ball by ball Commentary
  • 60. Partnerships
  • 61. Combine the two Weighted two-dimensional chart Y-axis: Balls per over X-axis: Overs + Bowlers Gantt chart Y-axis: Batsmen X-axis: Overs + Bowlers All other “zoomable" information is shown via interactions
  • 62. Putting it all together
  • 63. Let’s see it live http://www.firstpost.com/cricket-live-score/IPL/1-jun-2014- kolkata-knight-riders-versus-kings-xi-punjab/2173/175977
  • 64. Less reading. No scrolling. More awareness.
  • 65. Case Study #2: Let’s apply what we learnt Election Counting Day
  • 66. Election Counting Day Data Set:! • India has 50+ regional parties and two national parties.! • During Election Counting Day (live), seats are either “Leading” or “Won”! ! Data Properties / Relationships:! • Hierarchical Relation between Alliance and Party! • Won is confirmed. Leading is transient.! ! What did readers want to know this Election:! • How badly would UPA lose! • How big would be the BJP victory! • How big would the impact of AAP would be! ! Real world facts to inspire design! • BJP is a right wing party! • AAP is left most followed by UPA! • The Sansad Hall is a semi-circle
  • 67. Election Counting Day Data Set:! • India has 50+ regional parties and two national parties.! • During Election Counting Day (live), seats are either “Leading” or “Won”! ! Data Properties / Relationships:! • Hierarchical Relation between Alliance and Party! • Won is confirmed. Leading is transient.! ! What did readers want to know this Election:! • How badly would UPA lose! • How big would be the BJP victory! • How big would the impact of AAP would be! ! Real world facts to inspire design:! • BJP is a right wing party! • AAP is left most followed by UPA! • The Sansad Hall is a semi-circle —> Group —> Tree —> Weight —> Limitation } Hence, all other parties! can be clubbed into ! other —> Shape —> Placement}
  • 68. Choosing the right Grouped Weighted Tree Chart Packed Circle Sunburst Tree Rectangle Tree Bar Grouped Weighted Tree 68Visual encoding: Position, Size, Colour
  • 69. Election Counting Day … Sunburst
  • 70. Sansad Chart
  • 71. Sansad Chart Focus on what is most imp.! Alliance is more imp. than Party. We spent 200% more time reversing hierarchy
  • 72. Let’s see it live http://firstpost.com/election-results
  • 73. Summary 1. Study properties and relationships of your Data Set! 2. Use your visual encodings wisely
  • 74. Challenges in Data Journalism
  • 75. Data Collection What’s the story Visualize Story Journalist Developer Designer • Govt. data! • APIs! • Scrape! • Mine web! • PDFs • Clean the data! • Model the data! • Investigate • Design! • Build Write Technology is an integral part of data journalism. Steps in data journalism
  • 76. Data Driven Stories Visualization App Day-to-day short stories derived from data Big apps. to educate large and important event e.g. budget, election, etc. Formats in data journalism
  • 77. Format #1 - Data Driven Stories Source: http://factchecker.in/data-are-crimes-against-scheduled-castes-on-an-upswing-in-india/ Badaun Case —> Find legit Data —> Analyse —> Plot —> Story
  • 78. Format #2 - Visualization Apps
  • 79. Data Collection What’s the story Visualize Story • Govt. data! • APIs! • Scrape! • Mine web! • PDFs • Clean the data! • Model the data! • Investigate • Design! • Build Write Format: Visualization app Format: Data Driven Stories Journalist Developer Designer Journalist Implication
  • 80. High Level ! 1. Quick access to appropriate data set 2. Quick analysis of this data 3. Consistently churn out neat charts, graphs and maps Challenges
  • 81. High Level ! 1. Quick access to appropriate data set 2. Quick analysis of this data 3. Consistently churn out neat charts, graphs and maps ! Technical ! 1. Live Data Modelling 2. SEO 3. How to handle high traffic Challenges
  • 82. High Level ! 1. Quick access to appropriate data set 2. Quick analysis of this data 3. Consistently churn out neat charts, graphs and maps ! Technical ! 1. Live Data Modelling 2. SEO 3. How to handle high traffic ! From pykih perspective ! 1. How do you consistently build beautiful, real-time Visualizations? Challenges
  • 83. What we are doing about it In-house tool called "Backstage"
  • 84. #1 - Instead of waiting for data to be standardised, we want to make large scale, high- velocity, multi-format, data extraction durable. ! #2 - Instead of expecting data-users / journalists to have analytical skills, we are: • simplifying exploration of large data sets • automating extraction of metadata from data sets • simplifying assisted data standardisation • building tools for assisted analysis ! #3 - Instead of expecting data-users / journalists to Visualize data correctly, we are attempting automate meta-data driven Visualization ! Other Experiments • A data-driven blogging software • Configuration Editor Principles —> Demo the worker —> Demo the census dashboard —> ISO example —> Demo NLP based Date Standardiser —> Story is in the outliers Example: If data is ordinal then colour automatically leverages saturation and if data is ordinal then colour is distinct
  • 85. Data Visualization company => Data and Visualization company ! ! Effective Data Journalism leverages: You will end up NoSQL, Memory based databases, NLP, OLAP modelling, Free Text Search, Statistics, etc. Summary
  • 86. We are at @pykih Fun fact: The word pykih came to us in a CAPTCHA. That’s the day we decided that till we do good work it does not matter what we are called.