This talk was given to a Data Visualization course, which is part of the Masters of Science in Analytics program at the Northwestern School of Engineering.
It walks through:
- Why to visualize data
- A common (linear) approach to data problems
- A look at a problem in an ambiguos world, and why the linear approach does not always get one to their ideal end point
- A better (iterative) approach
- how to get started on a project through the important practice of brainstorming
-An informal project example. In this example, an iterative approach to the visualization helped the creator to gain new insights which changed her story's focus all-together.
-A case study of a project done for Procter & Gamble. In this example, an iterative approach redirected us from a more complicated network graph of the company (which we initially assumed would be an end-result) to displaying data in a simpler way (e.g. bar charts), which was more ideal for the client.
-Another case study. In this example, an iterative approach led us to create a less obvious / more creative visualization that stressed the things that were most important to the client. Nearly every single iteration step (all of which were shown to the client) are shown in the slides.
It ends with a reminder that doing is better than planning. You really can't learn what your ideal end-product will be until you get started; while working, one must constantly ask questions and gain feedback, and refine the approach accordingly.
One of the biggest challenges in the data age is overcoming the problematic belief that data has all the answers. The truth is – data is a resource, not a solution. In order to extract valuable and actionable insights, it is necessary to ask and re-ask certain questions. This talk is about figuring out what these questions are and exposes some of the limitations of common, and seemingly intuitive, approaches to data problems. As an alternative, I introduce the concept of using human-centered design principles and an iterative process to approach what you do with Big (and small) Data. As exemplars, I will walk-through a quick informal example and a real Datascope client project to highlight the flexibility and speed of these techniques.
Slides from a Brighttalk given 09-10-15. Problem-first thinking, what's actually exciting about Big Data, and how to get there.
Video is here: https://www.brighttalk.com/webcast/9059/169665
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Dawn Anderson MSc DigM
This talk looks at the ways in which search engines are evolving to understand further the nuance of linguistics in natural language processing and in understanding searcher intent.
What is BERT? It is Google's neural network-based technique for natural language processing (NLP) pre-training. BERT stands for Bidirectional Encoder Representations from Transformers. It was opened-sourced last year and written about in more detail on the Google AI blog. In this presentation we look at what Google BERT means for SEOs and marketers and how Google BERT is and will continue to impact the search landscape. We also look at the back story to Google BERT, including transformers and natural language understanding and computational linguistics.
Whilst passage indexing may seem like a small tweak to search ranking, it is potentially much more symptomatic of the beginning of a fundamental shift in the way that search engines understand unstructured content, determine relevance in natural language, and rank efficiently and effectively.
It could also be a means of assessing overall quality of content and a means of dynamic index pruning. We will look at the landscape, and also provide some takeaways for brands and business owners looking to improve quality in unstructured content overall in this fast changing landscape.
Google BERT is many things, including the name of a Google Search algorithm update. There is lots of confusion as to what Google BERT is, where it has come from and what SEOs and marketers need to do about it (if anything). Here we look at the solutions the introduction of Google BERT by Google seeks to provide and explore the background to natural language processing and computational linguistics.
One of the biggest challenges in the data age is overcoming the problematic belief that data has all the answers. The truth is – data is a resource, not a solution. In order to extract valuable and actionable insights, it is necessary to ask and re-ask certain questions. This talk is about figuring out what these questions are and exposes some of the limitations of common, and seemingly intuitive, approaches to data problems. As an alternative, I introduce the concept of using human-centered design principles and an iterative process to approach what you do with Big (and small) Data. As exemplars, I will walk-through a quick informal example and a real Datascope client project to highlight the flexibility and speed of these techniques.
Slides from a Brighttalk given 09-10-15. Problem-first thinking, what's actually exciting about Big Data, and how to get there.
Video is here: https://www.brighttalk.com/webcast/9059/169665
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Dawn Anderson MSc DigM
This talk looks at the ways in which search engines are evolving to understand further the nuance of linguistics in natural language processing and in understanding searcher intent.
What is BERT? It is Google's neural network-based technique for natural language processing (NLP) pre-training. BERT stands for Bidirectional Encoder Representations from Transformers. It was opened-sourced last year and written about in more detail on the Google AI blog. In this presentation we look at what Google BERT means for SEOs and marketers and how Google BERT is and will continue to impact the search landscape. We also look at the back story to Google BERT, including transformers and natural language understanding and computational linguistics.
Whilst passage indexing may seem like a small tweak to search ranking, it is potentially much more symptomatic of the beginning of a fundamental shift in the way that search engines understand unstructured content, determine relevance in natural language, and rank efficiently and effectively.
It could also be a means of assessing overall quality of content and a means of dynamic index pruning. We will look at the landscape, and also provide some takeaways for brands and business owners looking to improve quality in unstructured content overall in this fast changing landscape.
Google BERT is many things, including the name of a Google Search algorithm update. There is lots of confusion as to what Google BERT is, where it has come from and what SEOs and marketers need to do about it (if anything). Here we look at the solutions the introduction of Google BERT by Google seeks to provide and explore the background to natural language processing and computational linguistics.
Solving for ambiguity: what the data literate can learn from the design processDean Malmgren
This was presented during Innovation Days at NORC on 2014.02.25
Regardless of whether you call it "business intelligence", "big data", "analytics" or just plain old "math", we have many tried and true techniques for dealing with uncertainty. But ambiguity is a separate matter and, at least in my experience, is the hardest part of creating value from data. During this talk, I will illustrate how the design process can be used to solve ambiguous problems by drawing on projects we've done at Datascope.
Data Science and Design: Fickleness and How We Solve Itbo_p
This is a talk I gave on Jan 31, 2014 as part of Harvard's Institute for Applied Computational Science (IACS) seminar series.
Abstract: When solving problems, data scientists often encounter added layers of complexity when the problems to be solved are not well defined, and their solutions unclear. In these cases, standard, more straightforward approaches fall short, as they are not amenable to vague problems, and are thus not guaranteed to reliably produce useful results. At Datascope Analytics, we adopt methodologies from the design community and use a "continuous feedback loop" to iteratively improve dashboards, algorithms, and data sources to ensure that the resulting tool will be useful and well received. During this talk, I will illustrate our approach by sharing a detailed example from one of our projects.
NLP & Machine Learning - An Introductory Talk Vijay Ganti
An Introductory talk with the goal of getting people started on the NLP/ML journey. A practitioner's perspective. Code that makes it real and accessible.
Tools and Resources for Transition from Libraries to Wider Community Use Cent...CILIP
Leon Cruickshank's (Professor of Design and Creative Exchange, Lancaster University) presentation to the CILIP 2017 Conference in Manchester #CILIPConf17
This is an interactive session to introduce a collection of freely available tools and resources enabling the transition from libraries into wider community use centres. These tools were co-designed by a group of 20 librarians in Lancashire this co-design process brought together expertise from junior staff to Julie Bell, the head of libraries for Lancashire. They worked in close collaboration with design researchers from Lancaster University, funded by the Leapfrog project (www.Leapfrog.tools). Leapfrog is a £1.2million project that seeks to transform public engagement by design.
quant skillz beyond wall st: deriving value from large, non-financial datasetsDean Malmgren
This presentation was prepared for a talk on 2014.08.06 at the NYC Algorithmic Trading meetup (http://www.meetup.com/NYC-Algorithmic-Trading/events/197749772/)
Regardless of whether you call it "data science", "business intelligence", "analytics", "statistics" or just plain old "math", we have many tried and true techniques for dealing with uncertainty (particularly in quantitative finance). But ambiguity—what problem do we need to solve in the first place?—is a separate matter and, at least in my experience, is the hardest part of creating value from data. During this talk, I'll discuss how we address ambiguity by giving a guided tour of some of our client projects, such as how to reduce legal e-discovery costs by 99% (hint: supervised binary classification of text documents) or how to assemble project teams on emerging R&D opportunities in a multinational organization (hint: unsupervised classification of employee expertise).
STEM in Libraries: A Marketing Communications Toolkitlaurieputnam
These tips will help you promote your STEM programs today, and over time, build the perception of libraries as hubs for STEM learning in your community. For the complete STEM in Libraries Marketing Communications Toolkit, see http://j.mp/STEMinLibrariesMarComToolkit
Reaching Peak Performance for Knowledge WorkersRichard Thripp
A presentation about attention- and time-management for "knowledge workers": people who solve problems and approach problems creatively, and who deal primarily in knowledge (mental labor) rather than physical (manual) labor.
Prepared and presented by Richard Thripp of Toastmasters of Port Orange, FL on 2015-05-20, in fulfillment of Competent Communication Project #6: "Vocal Variety" in the Toastmasters curriculum.
Metaphic or the art of looking another way.Suresh Manian
For all intents and purposes, we are our words. And verbs and adjectives capture actions and sentiments better than any other tool. Metaphic is premised on the belief that a grammar book and a calculator are all you really need to make sense of web search and social media chatter, apart from all text, in general.
There's an old joke that goes, “The two hardest things in programming are cache invalidation, naming things, and off-by-one errors.” In this talk, we'll discuss the subtle art of naming things – a practice we do every day but rarely talk about.
Design Sprints for Awesome Teams: Workshop at Museums & the Web 2017Dana Mitroff Silvers
Slides from "Design Sprints for Awesome Teams: Running Design Sprints for Rapid Digital Product Development" at the 2017 Museums and the Web conference in Cleveland, Ohio.
Solving for ambiguity: what the data literate can learn from the design processDean Malmgren
This was presented during Innovation Days at NORC on 2014.02.25
Regardless of whether you call it "business intelligence", "big data", "analytics" or just plain old "math", we have many tried and true techniques for dealing with uncertainty. But ambiguity is a separate matter and, at least in my experience, is the hardest part of creating value from data. During this talk, I will illustrate how the design process can be used to solve ambiguous problems by drawing on projects we've done at Datascope.
Data Science and Design: Fickleness and How We Solve Itbo_p
This is a talk I gave on Jan 31, 2014 as part of Harvard's Institute for Applied Computational Science (IACS) seminar series.
Abstract: When solving problems, data scientists often encounter added layers of complexity when the problems to be solved are not well defined, and their solutions unclear. In these cases, standard, more straightforward approaches fall short, as they are not amenable to vague problems, and are thus not guaranteed to reliably produce useful results. At Datascope Analytics, we adopt methodologies from the design community and use a "continuous feedback loop" to iteratively improve dashboards, algorithms, and data sources to ensure that the resulting tool will be useful and well received. During this talk, I will illustrate our approach by sharing a detailed example from one of our projects.
NLP & Machine Learning - An Introductory Talk Vijay Ganti
An Introductory talk with the goal of getting people started on the NLP/ML journey. A practitioner's perspective. Code that makes it real and accessible.
Tools and Resources for Transition from Libraries to Wider Community Use Cent...CILIP
Leon Cruickshank's (Professor of Design and Creative Exchange, Lancaster University) presentation to the CILIP 2017 Conference in Manchester #CILIPConf17
This is an interactive session to introduce a collection of freely available tools and resources enabling the transition from libraries into wider community use centres. These tools were co-designed by a group of 20 librarians in Lancashire this co-design process brought together expertise from junior staff to Julie Bell, the head of libraries for Lancashire. They worked in close collaboration with design researchers from Lancaster University, funded by the Leapfrog project (www.Leapfrog.tools). Leapfrog is a £1.2million project that seeks to transform public engagement by design.
quant skillz beyond wall st: deriving value from large, non-financial datasetsDean Malmgren
This presentation was prepared for a talk on 2014.08.06 at the NYC Algorithmic Trading meetup (http://www.meetup.com/NYC-Algorithmic-Trading/events/197749772/)
Regardless of whether you call it "data science", "business intelligence", "analytics", "statistics" or just plain old "math", we have many tried and true techniques for dealing with uncertainty (particularly in quantitative finance). But ambiguity—what problem do we need to solve in the first place?—is a separate matter and, at least in my experience, is the hardest part of creating value from data. During this talk, I'll discuss how we address ambiguity by giving a guided tour of some of our client projects, such as how to reduce legal e-discovery costs by 99% (hint: supervised binary classification of text documents) or how to assemble project teams on emerging R&D opportunities in a multinational organization (hint: unsupervised classification of employee expertise).
STEM in Libraries: A Marketing Communications Toolkitlaurieputnam
These tips will help you promote your STEM programs today, and over time, build the perception of libraries as hubs for STEM learning in your community. For the complete STEM in Libraries Marketing Communications Toolkit, see http://j.mp/STEMinLibrariesMarComToolkit
Reaching Peak Performance for Knowledge WorkersRichard Thripp
A presentation about attention- and time-management for "knowledge workers": people who solve problems and approach problems creatively, and who deal primarily in knowledge (mental labor) rather than physical (manual) labor.
Prepared and presented by Richard Thripp of Toastmasters of Port Orange, FL on 2015-05-20, in fulfillment of Competent Communication Project #6: "Vocal Variety" in the Toastmasters curriculum.
Metaphic or the art of looking another way.Suresh Manian
For all intents and purposes, we are our words. And verbs and adjectives capture actions and sentiments better than any other tool. Metaphic is premised on the belief that a grammar book and a calculator are all you really need to make sense of web search and social media chatter, apart from all text, in general.
There's an old joke that goes, “The two hardest things in programming are cache invalidation, naming things, and off-by-one errors.” In this talk, we'll discuss the subtle art of naming things – a practice we do every day but rarely talk about.
Design Sprints for Awesome Teams: Workshop at Museums & the Web 2017Dana Mitroff Silvers
Slides from "Design Sprints for Awesome Teams: Running Design Sprints for Rapid Digital Product Development" at the 2017 Museums and the Web conference in Cleveland, Ohio.
Similar to Datascope: Designing your Data Viz - The (Iterative) Process (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Datascope: Designing your Data Viz - The (Iterative) Process
1. Mollie Pettit • @MollzMP
designing your data viz - the process
for Data Visualization Course • MSiA • Northwestern School of Engineering • May 18, 2017
4. agenda
- why visualize data?
- a common approach to data problems
- trying to plan in an ambiguous world
- a better (iterative) approach
- getting started (the importance of brainstorming)
- case studies (three examples)
17. identify the best locations to plant new trees
unclear problems
the data world
18. how many?
what kinds of trees?
move old trees?
replace old trees?
identify the best locations to plant new trees
unclear problems
the data world
19. aesthetically pleasing?
maximize growth?
increase foliage?
offset CO2 emissions?
how many?
what kinds of trees?
move old trees?
replace old trees?
identify the best locations to plant new trees
unclear problems
the data world
20. ecosystem consequences
new bug species
toxins in ground soil
drought starts
aesthetically pleasing?
maximize growth?
increase foliage?
offset CO2 emissions?
how many?
what kinds of trees?
move old trees?
replace old trees?
identify the best locations to plant new trees
new data emerge
unclear problems
the data world
21. aesthetically pleasing?
maximize growth?
increase foliage?
offset CO2 emissions?
how many?
what kinds of trees?
move old trees?
replace old trees?
identify the best locations to plant new trees
new data emerge
ecosystem consequences
new bug species
toxins in ground soil
drought starts
and new solutions!
build a bike path
add benches
plant bushes
introduce ladybugs
unclear problems
the data world
22. identify the best locations to plant new trees
why trees?
unclear problems
the data world
30. - defer judgement (no blocking of ideas)
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
31. - defer judgement (no blocking of ideas)
- encourage WILD ideas
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
32. - defer judgement (no blocking of ideas)
- encourage WILD ideas
- build on the ideas of others
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
33. - defer judgement (no blocking of ideas)
- encourage WILD ideas
- build on the ideas of others
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
34. - defer judgement (no blocking of ideas)
- encourage WILD ideas
- build on the ideas of others
- stay focused on the topic
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
35. - defer judgement (no blocking of ideas)
- encourage WILD ideas
- build on the ideas of others
- stay focused on the topic
- one conversation at a time
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
36. - defer judgement (no blocking of ideas)
- encourage WILD ideas
- build on the ideas of others
- stay focused on the topic
- one conversation at a time
- be visual
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
37. - defer judgement (no blocking of ideas)
- encourage WILD ideas
- build on the ideas of others
- stay focused on the topic
- one conversation at a time
- be visual
- go for QUANTITY
brainstorm
https://challenges.openideo.com/blog/seven-tips-on-better-brainstorming
67. Lorem Ipsum: a narrative about blankets.
Author: Charlie Brown
Date: 31 Jan 2012
Lorem Ipsum is a dummy text used when typesetting or marking up documents. It has a
long history starting from the 1500s and is still used in digital millennium for typesetting
electronic documents, page designs, etc.
In itself, the original text of Lorem Ipsum might have been taken from an ancient Latin
book that was written about 50 BC. Nevertheless, Lorem Ipsum’s words have been
changed so they don’t read as a proper text.
Naturally, page designs that are made for text documents must contain some text rather
than placeholder dots or something else. However, should they contain proper English
words and sentences almost every reader will deliberately try to interpret it eventually,
missing the design itself.
However, a placeholder text must have a natural distribution of letters and punctuation or
otherwise the markup will look strange and unnatural. That’s what Lorem Ipsum helps to
achieve.
I would like to thank Peppermint Pattyfor her support on studying Lorem
Ipsum as well as the infinite wisdom of Linus van Peltand his willingness to
use his blanket in my experiments.
data-driven expertise exploration
procter & gamble
74. what matters & why?
search engine
with relevance
metrics
demographics
human readable
expertise
summary
moving forward
procter & gamble
75. Lorem Ipsum: a narrative about blankets.
Author: Charlie Brown
Date: 31 Jan 2012
Lorem Ipsum is a dummy text used when typesetting or marking up documents. It has a
long history starting from the 1500s and is still used in digital millennium for typesetting
electronic documents, page designs, etc.
In itself, the original text of Lorem Ipsum might have been taken from an ancient Latin
book that was written about 50 BC. Nevertheless, Lorem Ipsum’s words have been
changed so they don’t read as a proper text.
Naturally, page designs that are made for text documents must contain some text rather
than placeholder dots or something else. However, should they contain proper English
words and sentences almost every reader will deliberately try to interpret it eventually,
missing the design itself.
However, a placeholder text must have a natural distribution of letters and punctuation or
otherwise the markup will look strange and unnatural. That’s what Lorem Ipsum helps to
achieve.
I would like to thank Peppermint Pattyfor her support on studying Lorem
Ipsum as well as the infinite wisdom of Linus van Peltand his willingness to
use his blanket in my experiments.
data-driven expertise exploration
procter & gamble
79. procter & gamble
data-driven expertise exploration
– Kathie Felber, Technical CoP Leader
They delivered an innovative,
one-of- a-kind tool that I use
every day to increase
collaboration and better
understand our company
90. I think you learn
about computer
safety.
:)
When you are a
genis at
electronics
:)
Using code and
fixing and making
computers.
:)
91. science on a computer
programming / coding
how computers work
how to use computers
studying computers
a website, program, or game
typing / testing
using computers to solve problems
engineering
someone good at computers
internet safety
experiments / research / modeling
I like it!
class / learning / lessons
making apps, games, or websites
I don’t know
what’s inside a computer
-How we go about all data projects at data scope
-not specific to data viz
-Case studies are data viz specific, but sometimes might speak more generally
Often in data science, you will get your hands on a new data set and will ask yourself…
-now that they have this data, what do you do with it?
-In the case of data visualization,
-What are the interesting questions to ask/answer?
-Why tell the story we’re trying to tell?
-Why display it one way over the other?
-Why visualize it in the first place?
-comes down to process and asking WHY questions.
-without thinking about the WHY behind what we’re doing
-There is a Limit to how much we can accomplish
-limit to how effective our end product can be
-share particular insights, conclusions tell a story to your audience
-in a way that’s much easier to digest than table or written word.
Could be anything from telling a story in the NYT, or expressing information through an internal tool
obvious one
-most common
-often overlooked
-get to better questions and figure out next steps.
-same summary statistics
-by visualizing, can see these differences immediately and can figure out how you want to treat them.
-Great example of how visualizations can help you make better decisions faster.
An oversimplified depiction of a typical approach to data problems goes something like this….
You start with some dataset
based on data, identify the problem you want to solve
determine the solution you will take
communicate the results in some way, be it an interface or a report.
call this the “Linear” approach - makes a lot of sense on paper. (And wouldn’t it be nice if everything were so simple?)
Issue - requires a clearly defined problem from the onset.
Not taking into consideration other datasets that might have been useful to define the problem
-Additionally, when we start by focusing just on the data we currently have, it limits the kind of problems and solutions you end up exploring and pursuing.
-In real-life application, we rarely have such clearly defined problems.
-In reality, we need to allow room for the problem (and the solution) to morph and evolve.
For instance, take this challenge:
-identify the best place to plant new trees.
-This is an pretty ambiguous problem - so, let’s make it more clearly defined
For instance, we can define “new trees” by figuring out details like
-how many,
-what kinds,
-whether we want to move or replace old ones, etc.
And we can define what we mean by “best locations” by asking
-best for who and/or what and in what timeframe.
-potential goal - bring more people to the river and so we want what is aesthetically pleasing and adds foliage / shade.
-Or focus on environmental impact with the goal of offsetting CO2 emissions.
-By considering whether the benefit is for the environment, people, or people 5 generations from now, we can set a clear actionable objective and if you want, you could spend a whole year (or more) refining your plan to maximize the exact impact you want.
But what happens when plans actually start going into action?
Typically, new data emerges or circumstances shift.
-All of a sudden we discover that there are unintended ecosystem consequences, scientists discover a new bug species, or a drought starts that we didn’t predict.
-new information is not inherently problematic, but it is an opportunity to find better solutions or augment current plans by allowing the solution to be redefined.
By locking in that our solution will be “new trees” too early,
-miss the opportunity for more impactful results, especially if we invest too heavily in this initial plan.
-Maybe to get people to use this space, we really need more benches and a bike path.
-And maybe for the environmental impact we want, we need to introduce ladybugs and plant bushes.
But why am I talking about trees in a talk about data viz process?
I’m using this example to illustrate how in every problem,
-there is a lot of room for interpretation and to make choices.
-From the onset, no one path is inherently better than another but context can make some solutions more appropriate for a given need.
-And not giving yourself the option to redefine this need restricts what comes next.
Better approach - iterative one
-Start with understanding of problem statement
-use this to generate lots of ideas about what problems to solve and how (sketches)
-From ideas, build low-investment prototypes (sketches)
-For evaluation/feedback - put crappy sketches in front of person solving problem for
-Based on feedback, find out what sucks about your ideas and generate new ideas to better define problem and possible solutions
-next set of prototypes (more sketches/wireframes)
-find out why these suck
-generate more ideas
-higher fidelity prototypes
-REPEAT and REPEAT
-Rapid iterations help you to hone in on what problems you should actually be solving and coming up with more refined solutions that can actually address this need.
-By doing this process in quick spurts, you prevent locking in too early on one path and the crappiness falls away.
(1-4 week iterations)
-not new or unique to data science.
-You can see it expressed in lots of different ways (though spoken of in different terms).
-For instance, designers in the human-centered design community (like the folks at IDEO),
entrepreneurs advocating the lean startup,
and coders involved in agile programming all embrace practices tied to rapid iteration and frequent feedback.
Really you could even say it is rooted in the scientific method, where you hypothesize, experiment and measure results, just done on overdrive and, over and over again.
(Some line about this making us data scientists…)
REFERENCE IDEO <3
QUANTITY OVER QUALITY
To illustrate how the iterative process works I’ll now walk through a quick informal example.
-Last year, my coworker Jess worked on passion project that started with pretty ambiguous goals.
-New to Chicago - wanted to learn about the city, play with social media data, and write a blog post at the end.
The brainstorm started with thinking about
-different social media platforms,
-what data they contained,
-what questions could be asked of the data, and
-for what purpose were the questions being asked.
The project ended up starting as an attempt to map Instagram activity to find out
-which restaurants were popular among foodies (who take all those food porn ‘grams).
-As a quick exploration / sanity check - one day’s activity was plotted to a map.
Right away clear that there were:
-particular clusters of activity,
-but more interestingly, some distinct gaps.
Was something missing?
Or could there be some way to explain what she was observing?
She started to think about the patterns that might be found if one dug deeper.
-proxy for population density?
-Or are there areas that are over- and under-represented despite taking population density into account?
-If so, does it even matter?
-Maybe some folks just like Instagram and other don’t.
-Why should anyone care?
These questions seemed a lot more interesting than the initial question
-problem / solution evolved from wanting to figure out
-where to eat in a new city
-to…
-to who was (not) represented in Chicago’s Instagram activity.
-The reason this is interesting is because social media datasets are sometimes analyzed to find insights about a population, if the social media data itself is skewed, then any results will be skewed.
-more instagram data was collected
-necessary census data,
-iterations occurred on exactly how to combine, explore, and visualize the data.
In the end, it was found that black and hispanic populations were indeed underrepresented while white and asian populations were over represented.
-blog post - “Instagram’s Blind Spot: Chicago as a case study in the limitations of social media datasets”.
This blog was
-featured on Datascope’s blog,
-cross-posted on Partially Derivative’s blog,
-and spawned another post about “The Pros and Cons of Social Media Datasets” for Markets for Good.
If she had taken a linear approach to this project, it would have started with the question about restaurants
and resulted in recommendations.
By using an iterative approach, this evolved into a question that is far more interesting and important to explore.
I like this example because this is a case where using a non-linear iterative approach:
-gave the creator, Jess, the flexibility to discover what was to them the more interesting story,
-and tell that story accordingly through visualization.
-At this point, you might be thinking, “Well sure, of course this works for something like a blog post where you make all the decisions. What about the business world?”
-two case studies that I think highlight different points,
-both are real Datascope client projects.
A little backstory,
-P&G is a Fortune 50 multinational company with 110k+ employees.
-matrix company structure: employees are grouped in 10 product categories and under each product exist all sorts of functions (HR, legal, R&D, etc).
-In order to foster innovation, P&G also has communities of practice (CoP) where folks that share the same function but belong to different product verticals collaborate.
P&G - had the goal of how can we better train people in their R&D departments.. INTERNATIONALLY. We decided to identify individuals who would be best nodes of transmission of information. To do this, we looked for domain expertise.
P&G is a big company with lots of people.
Our client was Kathie, a manager of an R&D CoP. (community of practice)
-Thing to know about Kathie is that she has a memory like a steeltrap rolodex.
-If she meets you, she remembers your name, the last meal you had together, your home life, the last project you worked on, etc.
Somehow, she could actually use this skill to memorize
-the names,
-roles, and
-research history
of all 2,000 employees that she managed.
This extraordinary ability helped her perform the important task of fielding requests for subject matter experts from researchers and engineers working on projects.
But when Kathie was promoted from overseeing 2,000 employees…
…to 20,000.
-Not even Kathie with her phenomenal memory could retain all the information in her head to keep fostering collaboration. She needed a new system. That’s where we came in.
Our goal was to help Kathie continue to effectively foster collaboration with all these unknowns.
If we had taken a linear approach to this problem, we would have likely stuck to our initial assumption of the best solution…
…which was to visualize expertise in a company.
-This would have likely manifested with expertise being overlaid on top of a…
social network graph of the company.
-In fact, someone who was involved in the project recently said they were so certain this was going to be the final output that they would have taken bets on it at the onset of the project.
This is NOT where we ended up though. I’ll now walk you through how we actually got to our final solution.
-In a workshop session, we brainstormed on the data sources that could contain the data we needed about people’s connections and expertise.
-All the while, we generated a lot of ideas on how to visualize.
-From this ideation, we honed in on 2 internal unstructured datasets:
-personnel records, which had names, roles, and location information, and
-internal research papers, which contained information on who worked on what with whom.
-Knowing we had these building blocks, we began the quick and dirty prototyping process by drawing many, many sketches to explore…
-how to present the information we extracted and
-to learn what pieces of information would be truly valuable.
-By presenting Kathie with these low-fidelity and low-investment versions of interface components, we got instantaneous feedback and were able to glean insights that were not immediately apparent or articulated.
As mentioned earlier, Some of our earliest proposed solutions revolved around the expertise overlaid social network view
-To Kathie, these all looked like hairballs stuck at the bottom of the drain, they were hard to interpret, and they wouldn’t help her much --- she hated them!
-But she loved the idea of presenting expertise in a clean and simple bar chart that she could quickly read and easily understand.
-From here we continued to iterate on what to present and how.
-A few of the questions we explored and Kathie’s feedback included:
Q: Should expertise be presented as static or as dynamic over time?
Kathie: the kind of skills she was concerned with did not tend to change over time.
Q: What is the best way to present evidence of expertise? A facebook type timeline of relevant activity? A profile with capability to click through to original source material?
Kathie: both are too text heavy and unnecessary. What was most important was to know who a person was and their expertise, not how they got it.
Q: How should the tool go about finding the needed connections? Using a search engine that returned recommendations? A “missing links” capability that used machine learning to find missed opportunities for people to work on similar projects across products or locations?
Kathie: “missing links” could be very useful down the line, but it does not address her immediate need of fielding requests for subject matter expertises for current projects.
Q: in a search engine option, how should the search function? Should it be Google style with freeform text? Or Yahoo style with click-through hierarchies?
Kathie: hierarchies only add to what has to be memorized. But being able to paste in the exact requests from employees and retrieve recommendations based on keywords would be very useful.
Q: are recommendations enough or should we include metrics of expertise and connectedness? After all, if we want collaboration to continue past current projects, people tend to build stronger connections when they have more in common.
Kathie: metrics to make informed recommendations would definitely help and continued collaboration would be ideal.
These explorations, along with several I didn’t show you, helped us to determine that for Kathie a valuable interface would include a
-search engine with relevance metrics,
-profiles with demographics, and
-human-readable expertise summary.
Understanding these needs we had a blueprint for how to proceed.
As we continued to iterate and experiment with exactly how to extract and analyze the relevant data from our unstructured textual sources, we continued to iterate on what to show and how to show it in the final interface.
-We did this by continuing to present Kathie with prototypes, moving gradually from low to higher fidelity.
-We also ended up incorporating information that we were only able to learn was important during the exploration of the data itself,
(such as presenting the likelihood of collaboration as captured by activity over time.
-This mattered because when researcher activity suddenly dropped it usually indicated a shift toward a managerial role and, while expertise did not wane, their ability / willingness to collaborate in research would.)
-We did this by continuing to present Kathie with prototypes, moving gradually from low to higher fidelity.
-We also ended up incorporating information that we were only able to learn was important during the exploration of the data itself,
-We did this by continuing to present Kathie with prototypes, moving gradually from low to higher fidelity.
-We also ended up incorporating information that we were only able to learn was important during the exploration of the data itself,
-Had we stuck to our original understanding of the problem and solution, we could have spent the whole budget on making…
…the social network interface pretty badass but we wouldn’t have ended up where we did.
-This is not to say that the original idea was wrong. In fact, I still think it’s a great one just not for Kathie’s particular needs.
Through this iterative process, we were able to develop a tool that Kathie ended up using every single day and met her particular needs.
Through this iterative process, we were able to develop a tool that Kathie ended up using every single day and met her particular needs.
Through this iterative process, we were able to develop a tool that Kathie ended up using every single day and met her particular needs.
So in this example, we started with a much more complex visualization in mind, but in the end, simple bar charts were the most ideal.
US is pushing to improve STEM education in the US. code.org is a nonprofit helping to make this happen by providing resource to teachers and schools looking to incorporate computer science education into their curriculum.
-Important for them to figure out if they’re doing a good job at what they’re trying to do. Reached out to Outlier who helped them develop a series of surveys to help evaluate how they’re doing.
-Outlier reached out to Datascope because they wanted to create an visualization for the general public that could hone in on a specific subset of the survey results.
-The subset we had to work with is a snapshot of student’s opinions prior to exposure to the data science curriculum.
Free form text response
It turns out the students did not have an agreement on what computer science is. Small sampling of three out of thousands of responses.
-When the understanding of what computer science is varies so drastically, it’s hard to have a conversation about the topic.
Outlier grouped them into categories and came to us with these categories and responses. Wanted them to be visualized in some way, but didn’t know the best way to go about it.
Sat down, had a brainstorm. Sketched out as many ideas as we could think of. Good and bad. Presented them to the client and did our best to convey what would be stressed or not stressed by different types of visualizations.Pen, marker, paper. Nothing fancy.
Really easy to group people by category in this way and then make comparisons among them.
very good for easy comparison
scales well
no granularity
no concept of the size of the study
no individuality to students - all clumped together
Also thought about how to show information on subgroups with barographs. Maybe they’d expand with a click
or be shown in stack bar charts
or maybe more information could be displayed on hover. Could get more context, maybe some key words.
-See everything on more of a macro-scale
-show relative sizes in a compact amount of space
group responses into these bubbles. Circles are not the easiest for the brain to understand, but we thought, well, it might be fun. Because that’s what you do - you write down all the ideas.
Also, basically same thing in boxes, maybe simpler for comparison
Although I don’t love tree maps because they make comparisons difficult, it was worth considering because it would emphasize the fragmented nature of the data.
Called it cohort view because every single student would be represented by a dot, with different categories represented by different colors.
-Harder to show comparisons in the data here, and a bit noisier.
-What’s good about it is it would
-give individuality to the students and
-really emphasize the size of the study that was done.
-Also more visually compelling
We also decided to take a step away from categories and subcategories and explore what was going on within the responses themselves.
Word clouds with:
-all words in responses
could also do this for emotion words, like exciting, cool, fun
as well as proper nouns such as code.org, google, and angry birds
Although we don’t super love word clouds, by getting these kind of ideas in front of the client, you can get an understanding of whether or not this kind of information is important to them, and you can come up with other ways that these representations can be done that might have more analytic value
How often certain words co-occur together
How about the punctuation happening? Are there lots of questions or strong feelings?
Maybe the kids nowadays only talk in emojis. How many happy or sad faces are going on in these responses?
How important is it to see specific responses? Do we want to display the raw data itself?
Also, for each response, would they want to know if they were positive or negative, and perhaps get a collection of the top in each category?
Upon showing all our sketches to the client and having a conversation about all of them, we learned a few things.
-subgroups
We also found out what they really wanted to stress
-have the individuality of students shine through
-to stress the size of the study - how many people took the survey
-individual responses and uniqueness of them
Still a lot of things to figure out. How will we fit in individual responses, etc. That’s when we started prototyping higher-fidelity versions.
Here we’re switching from our pencil and paper drawings to starting to get into designing with javascript and D3.
Even at that point, there is constant iteration and we feel it is important to get designs in front of the client early and often so that we can have constant feedback on what works, what doesn’t, and what’s important.
One of the first things we found out is that they thought having the dots in a grid like this made them lose their personality, and made it less clear that these are actual people. Wanted it to be more like a crowd.
So, gave them a crowd-like feel.
Also, didn’t like that the dots weren’t filled in. It was hard to see the color.
So, filled that in.
-Additionally, you’ll notice that almost ever single slide has different colors because unsurprisingly it’s difficult to find 26 colors for a group that don’t look like each other. We eventually, later on, came up with something that worked well.
-Wanted to show different ways that responses could be displayed.
-We made examples using keynote slides because we didn’t want to spend time implementing this if we didn’t know which direction to take.
blurb list: shows top five responses
slide blurbs: show more sentences, gives user control to browse or just watch them cycle through, more compact
slide blurbs: show more sentences, gives user control to browse or just watch them cycle through, more compact
crowd blurbs: have quotes pop up and disappear in a rotation. cute, humanizes the dots.
First two really good at giving the user control over what they saw. The third was really good at furthering the individuality of the dots, but took away audience control.
Here’s the end result
As you’ll see, we did end up going with the popups, but we had to play with exactly how to implement that.
In the end, what we ended up coming to after all these iterative cycles was something that really highlighted what our client wanted to highlight. It also presented things across different cohorts so that you could explore that easily.
Would like to point out that, as far as iterations go, this was actually a pretty straight forward project. They are often a lot messier than this. If we had started with this idea and gone forward, perhaps it wouldn’t have looked all that different from the end product. However, as you saw earlier, even discovering that this is the idea we should start with was part of the discovery in this process.
Additionally, there were a lot of details along the way that needed to be worked out, and feedback from our client helped us to know what tweaks needed to be made to deliver the best end product.
This is all to say that doing is better than planning.
And while doing, refine your approach, get feedback, and keep asking questions.
Because, with data visualizations, there’s no one correct path on how to visualize a thing.
Because, with data visualizations, there’s no one correct path on how to visualize a thing.
-Sometimes data is best presented in a simple bar graph as in this P&G internal tool.
-Other times, it is best presented in a way that is less obvious or more complex, like with the Outlier visualization.
-And sometimes your story will change all together as you gather more information, as with the Instagram post.
Without being open to letting your problem and solution evolve, you might very well get stuck with a less-optimal result and not finding the best way to present your information.
All three more successful due to the iterative process and a constant feedback loop.
The entire story being told by the visualization changed based on discoveries made during data exploration
A corporate tool that took on a form much different than what was originally envisioned.
A data visualization created for public consumption that evolved to best stress specific aspects of the study, tell the story in a particular way.
TKTK - sometimes complex is better, sometimes simple. Best to go through iterative process to find the best result for that use-case.
Data Visualization for the Masters of Science in Analytics (MSiA) program at Northwestern School of Engineering.