- 1. Visualizing Data Journalism Ritvvij Parrikh, Founder, www.pykih.com ! ! Fifth Elephant, Delhi Run-up Event, India Today Mediaplex, June 14, 2014
- 2. Pykih is a data Visualization company. We build custom visual representations of large data sets to make data actionable for readers. We have satisﬁed customers in six countries. Introduction
- 3. • Data Viz. • Theory • Case Study 1 • Case Study 2 • Summary • Challenges in Data Journalism • What we are doing about it for ourselves Agenda
- 4. Data Visualization
- 5. Let’s explore the humble pie chart… Party Percentage E 38% D 25% C 20% B 15% A 2% Break the whole into parts.
- 6. Let’s explore the humble pie chart… Party Percentage E 38% D 25% C 20% B 15% A 2% Break the whole into parts. Data: One dimensional Visual Encoding: Area
- 7. New Terms • Dimension: Columns by which you group data.! ! • Facts: Numbers that you can count, sum, average, etc.! ! • Examples:! • Seat count by party! • Seat count by party and state! ! • Visual Encoding: Area, Position, Colour, Length, Thickness, etc.
- 8. One-dimensional Charts PIE is a one-dimensional chart
- 9. One-dimensional Charts … A pie could have been a random shape broken by percentage
- 10. One-dimensional Charts … Pie Amoeba Percentage! Rectangle Donut Percentage! Triangle Bubble Election Donut Funnel Percentage Bar Percentage ! Column #1 - The same data can be Visualized in many (MANY!) different ways.
- 11. One-dimensional Charts … Source: thehindu.com What is wrong here?
- 12. One-dimensional Charts … What is wrong here? Problems:! • Colour communicates no data! • 3D communicates no data Source: thehindu.com
- 13. One-dimensional Charts … Source: thehindu.com #2 - Your goal is to communicate data. Wrong use of visual encoding confuses. Problems:! • Colour communicates no data! • 3D communicates no data
- 14. One-dimensional Charts … Source: ﬁrstpost.com What is wrong here?
- 15. One-dimensional Charts … What is wrong here? Problems:! • Colour! • Too many values. Too cluttered. Source: ﬁrstpost.com
- 16. One-dimensional Charts … Problems:! • Colour! • Too many values. Too cluttered. #3 - AREA encoding is useful for only few values after which it is unreadable. Source: ﬁrstpost.com
- 17. One-dimensional Charts … Solution to problem of restricted space? Create a custom chart.
- 18. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance
- 19. Grouped One-dimensional Charts Party Alliance Percentage A NDA 38% B NDA 25% C NDA 20% D UPA 15% E Others 2%
- 20. Grouped One-dimensional Charts Group various bubbles by colours Party Alliance Percentage A NDA 38% B NDA 25% C NDA 20% D UPA 15% E Others 2%
- 21. Grouped One-dimensional Charts Group various bubbles by colours Party Alliance Percentage A NDA 38% B NDA 25% C NDA 20% D UPA 15% E Others 2% #4 - You can always ﬁt in an extra dimension (GROUP) in charts using colour.
- 22. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance Two dimensional: ! Which party won in which year
- 23. Two-dimensional Charts Plot two data points Party Constituency A Z B Y C X D V E W 23Visual encoding: Position, Length
- 24. Two-dimensional Charts… Connect the dots and you get a line chart.
- 25. Two-dimensional Charts… Scatter Line Area Bar Column Spider All these charts require the same data.#5 - Number of dimensions in data determines which chart to use
- 26. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance Two dimensional: ! Which party won in which constituency Weighted Two dimensional: ! Which party won in which constituency by what vote margin
- 27. Weighted Two-dimensional Charts This is a 2d chart.
- 28. Weighted Two-dimensional Charts … Let’s add weight to it, hence now we have three data points X axis Y axis Weight A Z 40 B Y 20 C X 1 D V 300 E W 60 28Visual encoding: Position, Length, Area
- 29. Weighted Two-dimensional Charts … Weighted Scatter Circle Comparison All these charts require the same data.#6 - You can always ﬁt in an extra fact (WEIGHT) in charts using size.
- 30. New Data Set One dimensional: ! Seat count by party Grouped One dimensional: ! Seat count by party grouped by alliance Two dimensional: ! Which party won in which constituency Weighted Two dimensional: ! Which party won in which constituency by what vote margin Grouped Weighted Two dimensional: ! Which party won in which constituency by what vote margin grouped by alliance
- 31. Grouped Weighted Two-dimensional Charts Grouped Weighted Scatter Grouped Circle Comparison 31Visual encoding: Position, Length, Area, Colour
- 32. Multi-series Two-dimensional Charts … RangeGanttMulti-series Line Group Column Stack Column Group Stack Column Stack Area Stack Percentage Area Add more dimensions in creative ways.
- 33. Multi-series Two-dimensional Charts … What is right and wrong here? Source: livemint.com Is the equities rally percolating into the broader market?
- 34. Multi-series Two-dimensional Charts … What is right and wrong here? Source: livemint.com Is the equities rally percolating into the broader market? Bad parts:! • BSE Small-cap lines is not visible and that’s the story.
- 35. Multi-series Two-dimensional Charts … What is right and wrong here? Good parts:! • Y axis from 97 instead of 0 Source: livemint.com Is the equities rally percolating into the broader market? Bad parts:! • BSE Small-cap lines is not visible and that’s the story. #7 - Purpose of line chart is to show trend. Focus on it.
- 36. Multi-series Two-dimensional Charts … What is wrong here? Source: livemint.com Does IMF wear rose-tinted glasses?
- 37. Multi-series Two-dimensional Charts … What is wrong here? Source: livemint.com Problems:! • Cannot ﬁnd the IMF line. Does IMF wear rose-tinted glasses?
- 38. Multi-series Two-dimensional Charts … What is wrong here? Source: livemint.com Does IMF wear rose-tinted glasses? Problems:! • Cannot ﬁnd the IMF line. #8 - Highlight the story for the user. Use color to highlight, not confuse.
- 39. New Data Set All the data we encountered so far was RDBMS i.e. could ﬁt in a SpreadSheet. (rows and columns). ! ! Sometimes data is more complex. It can have“relationships”. ! ! Types of relationships:! • Hierarchy / Tree! • Multi-level relationships
- 40. Tree Charts { "name": "root", "children": [ { "name": "A", "children": [ {"name": "A1"}, {"name": "A2"}, {"name": "A3"}, {"name": "A4"} ] 40Visual encoding: Position
- 41. Tree Charts Dendrogram Circular Dendrogram
- 42. Grouped Weighted Tree Charts Packed Circle Sunburst Tree Rectangle Tree Bar Grouped Weighted Tree 42Visual encoding: Position, Size, Colour
- 43. Grouped Weighted Tree Charts Sunburst 43Visual encoding: Position, Size, Colour
- 44. Grouped Multi-level Relationship Charts { “nodes”: [ {“name”: “A”, “group”: “G1”}, {“name”: “B”, “group”: “G2”}, … ], "relations": [ {"from": “A”, "to": “B”}, {"from": “A”, "to": “C”}, … ] 44Visual encoding: Position
- 45. Grouped Multi-level Relationship Charts Graph Collapsible Graph Hive #9 - Look for relationships across data sets.
- 46. Weighted Grouped Multi-level Relationship Charts Sankey 46Visual encoding: Position, Color, Size
- 47. Case: Mumbai Local Fare Chart A fare exists for travel between station "A" and “B”. Hence, it is a relationship chart.
- 48. Case: Mumbai Local Fare Chart Matrix Half Matrix [ {"node1": "A", "node2": "B", "weight": 300}, {"node1": "A", "node2": "C", "weight": 900}, … ]
- 49. Case: Mumbai Local Fare Chart 49 #9 - Look for limitations. They can help you improve design.
- 50. Weighted Two-level Relationship Charts … Chord Number of people travel between various stations
- 51. • One dimensional charts! • Grouped one dimensional charts! ! • Two dimensional charts! • Weighted Two dimensional charts! • Grouped Two dimensional charts! • Grouped Weighted Two dimensional charts! ! • Multi-dimensional Charts! ! • Tree Charts! • Grouped Weighted Tree Charts! ! • Multi-level Relationships Charts! • Grouped Weighted Multi-level Relationships Charts! ! • Two-level Relationships Charts! • Grouped Weighted Two-level Relationships Charts Taxonomy of Standard Data Visualizations
- 52. The same data can be visualized in many (MANY!) ways. Without exploring the data, you will end up visualizing all your data in pies, lines and bars. Most Imp. Lesson
- 53. One Dimension Two Dimension Multi- Dimension Relationship Hierarchical Geo Maps Dimension: Time N Y Y N N N Dimension: Group Y Y Y Y Y Y Fact: Weight N Y Y Y Y Y Group and Weight N Y Y Y Y N Fact: Many values May be Y Y Y Y Y Multiple levels / Zoomable N N N Y Y Y Implications
- 54. List of Visual Encodings Source: http://complexdiagrams.com/properties
- 55. Case Study #1: Let’s apply what we learnt IPL Score Card
- 56. ESPNCricInfo Score Card 56
- 57. 57 Ball by ball! Commentary Per Batsman Statistics Per Bowler Statistics Fall of Wickets Partnerships Two innings Pre-match: Toss, Playing 11, Location, Time Post-match: Win, by how much, Man of the match Second Innings: Current Run Rate, Required Run Rate, Target score
- 58. Overs: Most important data-point 1. Overs = Time! 2. One over ! 1. has_many balls! 2. has_one bowler! 3. has_many batsmen! 3. Existence of batsmen across overs is partnerships! 4. Partnerships and Fall of wickets are the same different data set
- 59. Ball by ball Commentary
- 60. Partnerships
- 61. Combine the two Weighted two-dimensional chart Y-axis: Balls per over X-axis: Overs + Bowlers Gantt chart Y-axis: Batsmen X-axis: Overs + Bowlers All other “zoomable" information is shown via interactions
- 62. Putting it all together
- 63. Let’s see it live http://www.ﬁrstpost.com/cricket-live-score/IPL/1-jun-2014- kolkata-knight-riders-versus-kings-xi-punjab/2173/175977
- 64. Less reading. No scrolling. More awareness.
- 65. Case Study #2: Let’s apply what we learnt Election Counting Day
- 66. Election Counting Day Data Set:! • India has 50+ regional parties and two national parties.! • During Election Counting Day (live), seats are either “Leading” or “Won”! ! Data Properties / Relationships:! • Hierarchical Relation between Alliance and Party! • Won is conﬁrmed. Leading is transient.! ! What did readers want to know this Election:! • How badly would UPA lose! • How big would be the BJP victory! • How big would the impact of AAP would be! ! Real world facts to inspire design! • BJP is a right wing party! • AAP is left most followed by UPA! • The Sansad Hall is a semi-circle
- 67. Election Counting Day Data Set:! • India has 50+ regional parties and two national parties.! • During Election Counting Day (live), seats are either “Leading” or “Won”! ! Data Properties / Relationships:! • Hierarchical Relation between Alliance and Party! • Won is conﬁrmed. Leading is transient.! ! What did readers want to know this Election:! • How badly would UPA lose! • How big would be the BJP victory! • How big would the impact of AAP would be! ! Real world facts to inspire design:! • BJP is a right wing party! • AAP is left most followed by UPA! • The Sansad Hall is a semi-circle —> Group —> Tree —> Weight —> Limitation } Hence, all other parties! can be clubbed into ! other —> Shape —> Placement}
- 68. Choosing the right Grouped Weighted Tree Chart Packed Circle Sunburst Tree Rectangle Tree Bar Grouped Weighted Tree 68Visual encoding: Position, Size, Colour
- 69. Election Counting Day … Sunburst
- 70. Sansad Chart
- 71. Sansad Chart Focus on what is most imp.! Alliance is more imp. than Party. We spent 200% more time reversing hierarchy
- 72. Let’s see it live http://ﬁrstpost.com/election-results
- 73. Summary 1. Study properties and relationships of your Data Set! 2. Use your visual encodings wisely
- 74. Challenges in Data Journalism
- 75. Data Collection What’s the story Visualize Story Journalist Developer Designer • Govt. data! • APIs! • Scrape! • Mine web! • PDFs • Clean the data! • Model the data! • Investigate • Design! • Build Write Technology is an integral part of data journalism. Steps in data journalism
- 76. Data Driven Stories Visualization App Day-to-day short stories derived from data Big apps. to educate large and important event e.g. budget, election, etc. Formats in data journalism
- 77. Format #1 - Data Driven Stories Source: http://factchecker.in/data-are-crimes-against-scheduled-castes-on-an-upswing-in-india/ Badaun Case —> Find legit Data —> Analyse —> Plot —> Story
- 78. Format #2 - Visualization Apps
- 79. Data Collection What’s the story Visualize Story • Govt. data! • APIs! • Scrape! • Mine web! • PDFs • Clean the data! • Model the data! • Investigate • Design! • Build Write Format: Visualization app Format: Data Driven Stories Journalist Developer Designer Journalist Implication
- 80. High Level ! 1. Quick access to appropriate data set 2. Quick analysis of this data 3. Consistently churn out neat charts, graphs and maps Challenges
- 81. High Level ! 1. Quick access to appropriate data set 2. Quick analysis of this data 3. Consistently churn out neat charts, graphs and maps ! Technical ! 1. Live Data Modelling 2. SEO 3. How to handle high traﬃc Challenges
- 82. High Level ! 1. Quick access to appropriate data set 2. Quick analysis of this data 3. Consistently churn out neat charts, graphs and maps ! Technical ! 1. Live Data Modelling 2. SEO 3. How to handle high traﬃc ! From pykih perspective ! 1. How do you consistently build beautiful, real-time Visualizations? Challenges
- 83. What we are doing about it In-house tool called "Backstage"
- 84. #1 - Instead of waiting for data to be standardised, we want to make large scale, high- velocity, multi-format, data extraction durable. ! #2 - Instead of expecting data-users / journalists to have analytical skills, we are: • simplifying exploration of large data sets • automating extraction of metadata from data sets • simplifying assisted data standardisation • building tools for assisted analysis ! #3 - Instead of expecting data-users / journalists to Visualize data correctly, we are attempting automate meta-data driven Visualization ! Other Experiments • A data-driven blogging software • Conﬁguration Editor Principles —> Demo the worker —> Demo the census dashboard —> ISO example —> Demo NLP based Date Standardiser —> Story is in the outliers Example: If data is ordinal then colour automatically leverages saturation and if data is ordinal then colour is distinct
- 85. Data Visualization company => Data and Visualization company ! ! Eﬀective Data Journalism leverages: You will end up NoSQL, Memory based databases, NLP, OLAP modelling, Free Text Search, Statistics, etc. Summary
- 86. We are at @pykih Fun fact: The word pykih came to us in a CAPTCHA. That’s the day we decided that till we do good work it does not matter what we are called.

