Guest lecture at Prof. David Gotz's UNC Chapel Hill INLS 690 Visual Analytics class (Given remotely) on Nov 10, 2015.
Many demos can also be accessed from interactive.twitter.com and kristw.yellowpigz.com
This document summarizes Krist Wongsuphasawat's work in data visualization at Twitter. It describes how he obtains tweet data, visualizes it using tools like R and D3 to show trends over time, locations, and text. Examples include visualizations of events like the World Cup and State of the Union. The process involves getting relevant data, visualizing it, evaluating the results, and iterating. The goal is to transform big Twitter data into smaller, insightful visualizations that tell stories.
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France.
ABSTRACT
Logging user activities is essential to data analysis for internet products and services.
Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization.
This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights.
In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types.
Two interactive visualizations were developed for these purposes:
we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.
In this talk, I reflect on the tasks commonly involved in crafting visualizations and show examples of different applications of information/data visualization. Along this ride I will share my workflow, point out the common pitfalls and provide recommendations.
These slides were from my guest lecture in InfoVis class at UC Berkeley iSchool on Apr 11, 2016. Thank you Prof. Marti Hearst for inviting.
The document describes Krist Wongsuphasawat's background and work in data visualization. It notes that he has a PhD in Computer Science from the University of Maryland, where he studied information visualization. He currently works as a data visualization scientist at Twitter, where he builds internal tools to analyze log data and monitor changes over time. Some of his projects include Scribe Radar, which allows users to search through and visualize client event data in order to find patterns and monitor effects of product changes. The document provides details on his approaches for dealing with large log datasets and visualizing user activity sequences.
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsKrist Wongsuphasawat
I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014.
http://nlp.stanford.edu/events/illvi2014/index.html
ABSTRACT
Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.
Slides from the VIS in practice panel "Increasing the Impact of Visualization Research" during IEEE VIS 2017 in Phoenix, AZ. http://www.visinpractice.rwth-aachen.de/panel.html
This document discusses expectations when visualizing data and creating visualizations. It covers 6 main points:
1. Expect to find the real need by understanding audience goals, questions, and intended use of the visualization. Compromise may be needed.
2. Expect to spend significant time (70-80%) cleaning data due to issues like multiple data sources and formats, missing values, and errors.
3. Expect trials and errors in the prototyping process to solve problems and meet deadlines. Iteration is important.
4. For larger datasets, expect challenges in processing, analyzing, and reducing size to find relevant insights. Tools like Hadoop can help handle bigger data.
5.
The document analyzes Twitter data related to two contrasting events in July 2011: the Norway attacks and Amy Winehouse's death. It finds that the Norway attacks received more tweets initially but interest declined more gradually, while Winehouse's death sparked a large initial volume of tweets that fell off steeply. Both events saw mostly neutral tweets, with negative tweets outnumbering others for the Norway attacks later on. The analysis used Apache Hadoop and Hive to process Twitter data on Amazon EC2, identifying challenges around duplicates, parsing errors, and incomplete data recovery.
This document summarizes Krist Wongsuphasawat's work in data visualization at Twitter. It describes how he obtains tweet data, visualizes it using tools like R and D3 to show trends over time, locations, and text. Examples include visualizations of events like the World Cup and State of the Union. The process involves getting relevant data, visualizing it, evaluating the results, and iterating. The goal is to transform big Twitter data into smaller, insightful visualizations that tell stories.
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France.
ABSTRACT
Logging user activities is essential to data analysis for internet products and services.
Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization.
This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights.
In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types.
Two interactive visualizations were developed for these purposes:
we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.
In this talk, I reflect on the tasks commonly involved in crafting visualizations and show examples of different applications of information/data visualization. Along this ride I will share my workflow, point out the common pitfalls and provide recommendations.
These slides were from my guest lecture in InfoVis class at UC Berkeley iSchool on Apr 11, 2016. Thank you Prof. Marti Hearst for inviting.
The document describes Krist Wongsuphasawat's background and work in data visualization. It notes that he has a PhD in Computer Science from the University of Maryland, where he studied information visualization. He currently works as a data visualization scientist at Twitter, where he builds internal tools to analyze log data and monitor changes over time. Some of his projects include Scribe Radar, which allows users to search through and visualize client event data in order to find patterns and monitor effects of product changes. The document provides details on his approaches for dealing with large log datasets and visualizing user activity sequences.
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsKrist Wongsuphasawat
I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014.
http://nlp.stanford.edu/events/illvi2014/index.html
ABSTRACT
Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.
Slides from the VIS in practice panel "Increasing the Impact of Visualization Research" during IEEE VIS 2017 in Phoenix, AZ. http://www.visinpractice.rwth-aachen.de/panel.html
This document discusses expectations when visualizing data and creating visualizations. It covers 6 main points:
1. Expect to find the real need by understanding audience goals, questions, and intended use of the visualization. Compromise may be needed.
2. Expect to spend significant time (70-80%) cleaning data due to issues like multiple data sources and formats, missing values, and errors.
3. Expect trials and errors in the prototyping process to solve problems and meet deadlines. Iteration is important.
4. For larger datasets, expect challenges in processing, analyzing, and reducing size to find relevant insights. Tools like Hadoop can help handle bigger data.
5.
The document analyzes Twitter data related to two contrasting events in July 2011: the Norway attacks and Amy Winehouse's death. It finds that the Norway attacks received more tweets initially but interest declined more gradually, while Winehouse's death sparked a large initial volume of tweets that fell off steeply. Both events saw mostly neutral tweets, with negative tweets outnumbering others for the Norway attacks later on. The analysis used Apache Hadoop and Hive to process Twitter data on Amazon EC2, identifying challenges around duplicates, parsing errors, and incomplete data recovery.
"Apache Spark™ is a fast and general engine for large-scale data processing."" Above statement is taken from Apache Spark welcome page. It's one of those definitions that, while describing the product in one sentence and being 100 % true, tell still little to the wondering noob.
Why take interest in Apache Spark? Apache Spark promise being up to 100x faster than Hadoop MapReduce in certain scenarios. It provide comprehensible programming model (familiar to everyone who is used to functional programming) and vast ecosystem of tools.
In my talk I will try to reveal secrets of Apache Spark for the very beginners.
We will do first quick introduction to the set of problems commonly known as BigData: what they try to solve, what are their obstacles and challenges and how those can be addressed. We will quickly take a pick on MapReduce: theory and implementation. We will then move to Apache Spark. We will see what was the main factor that drove its creators to introduce yet another large-scala processing engine. We will see how it works, what are its main advantages. Presentation will be mix of slides and code examples.
Slides for the sixth meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Building a Graph-based Analytics PlatformKenny Bastani
Meetup is a valuable source of data for understanding trends around products or brands. Meetup does not support an analytics package to track group statistics overtime unless you are an administrator of a group. There are no third-party tools or websites that analyze Meetup trends to understand how communities grow.
In this talk I will present a graph-based analytics platform that uses the Meetup.com API to collect and analyze membership statistics over time.
This talk will cover:
How to poll and import periodic data from the Meetup.com API into Neo4j using Node.js.
How to track meetup group growth over time using a Neo4j graph database using Node.js.
How to apply tags to meetup groups and report combined growth of all groups over time.
How to build an interactive documented analytics API to support applications using Node.js and Neo4j.
How to build a business dashboard to visualize time-based statistics and reports using a Node.js based REST API that queries Neo4j.
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...Edureka!
The free webinar on Python titled "Mastering Python - An Excellent tool for Web Scraping and Data Analysis" was conducted by Edureka on 14th November 2014
Google Desktop is desktop search software that indexes files on a computer and allows users to search emails, files, music, photos and more from a sidebar. It features file indexing, a sidebar with gadgets for email, notes, photos, news and weather, and quick searching across the computer from the sidebar or taskbar. Google Desktop runs on Mac OS X, Linux and Windows and continues to index files in the background as they change.
The document discusses web scraping and outlines a step-by-step process for scraping comments from a Dutch website called GeenStijl. It begins with using regular expressions to scrape the comments, but notes that existing parsers can make the process more elegant, especially for complex websites. It then demonstrates using the lxml module and XPath to scrape reviews from another site in a more structured way. The document provides remarks on regular expressions and XPath, and encourages exploring different scraping techniques.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
This document summarizes a presentation on unsupervised and supervised machine learning techniques for automated content analysis. It recaps types of automated content analysis, describes unsupervised techniques like principal component analysis (PCA) and latent Dirichlet allocation (LDA), and supervised machine learning techniques like regression. It provides examples of applying these techniques to cluster Facebook messages and predict newspaper reading. The document concludes by noting the presenter will use a portion of labeled data to estimate models and check predictions against the remaining labeled data.
This document provides an overview and objectives of a course on web scraping and analytics with Python. The course covers web scraping concepts and the BeautifulSoup package for scraping websites. It also demonstrates scraping an IMDB webpage to extract movie data and the PyDoop package for performing analytics on large datasets with Hadoop using Python. Examples of preprocessing text data with NLTK on Hadoop are also provided.
This document provides an overview of a presentation on automated content analysis using regular expressions and natural language processing. The presentation covers topics like bottom-up vs top-down analysis, what regular expressions are and how they can be used in Python, stemming, parsing sentences, and combining techniques like stemming and stopword removal. Examples are given on using regular expressions to count actors in articles and check the number of a document from LexisNexis. The takeaway message is about an upcoming take-home exam and future meetings.
This document provides information about Google Search and search engines. It discusses how search engines work and lists some of the major search engines. It provides background on Google's founding and growth. The document outlines several search operators and tips for using them, such as using quotation marks, AND/OR, wildcards, and excluding terms. It also discusses searching specific file types, unit conversions directly in search, and searching offline without an internet connection.
The Sourcecon webinar slides delivered by Andy Headworth from http://sironaconsulting.com/ on 22nd October 2014. It is about using Twitter and Google Plus to source candidates.
It covers sourcing individuals on both Google+ and Twitter as well as sourcing candidates from Communities and Twitter Lists.
600+ SEARCHABLE Sourcing Tools compiled by Susanna Frazier @ohsusannamarieSusanna Frazier
This document provides a list of over 600 sourcing tools categorized by their functions. It describes each tool's name, current version, category and a brief description. The tools cover a wide range of functions including search, social media, email, documents, scheduling and more. They allow users to easily access information, automate tasks and integrate various online services.
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Effective and efficient google searching power point tutorialJaclyn Lee Parrott
This document provides guidance on effective Google searching. It discusses Google's mission to organize the world's information and make it accessible. It also notes that Google profiles users to target advertising and its products may change. The document then provides examples of basic Google searches and demonstrates more advanced search techniques. It stresses evaluating sources and avoiding plagiarism. Finally, it includes an exercise for readers to practice advanced Google searches.
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake
Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.
The Midterm Presentation of the Con-Action project represents the combined effort of all 20 members of the Digital Media Master projects DaVisMo (data visualization and student mobility) and Confetti (embodied conversational agents) in 2009/2010 at the University of Bremen and the University of the Arts Bremen in Germany.
The accompanying videos can be found at: http://vimeo.com/channels/digitalmedia
Delineating Cancer Genomics through Data VisualizationRupam Das
In spite in advances in technologies for working with data, people spend undue amount of time in understanding the data and manipulating it into holistic visualization. Data visualization software for complex dataset such as in cancer genomics (which we have taken as case study) are not able to provide effective visualization for the users. Identification and characterization of cancer detection are important areas of research that are based on the integrated analysis of multiple heterogeneous genomics datasets. In this report, we review the key issues and challenges associated with cancer genomics through exploration of data visualization techniques, interactions and methods, which will in-turn advance the state of the art.
"Apache Spark™ is a fast and general engine for large-scale data processing."" Above statement is taken from Apache Spark welcome page. It's one of those definitions that, while describing the product in one sentence and being 100 % true, tell still little to the wondering noob.
Why take interest in Apache Spark? Apache Spark promise being up to 100x faster than Hadoop MapReduce in certain scenarios. It provide comprehensible programming model (familiar to everyone who is used to functional programming) and vast ecosystem of tools.
In my talk I will try to reveal secrets of Apache Spark for the very beginners.
We will do first quick introduction to the set of problems commonly known as BigData: what they try to solve, what are their obstacles and challenges and how those can be addressed. We will quickly take a pick on MapReduce: theory and implementation. We will then move to Apache Spark. We will see what was the main factor that drove its creators to introduce yet another large-scala processing engine. We will see how it works, what are its main advantages. Presentation will be mix of slides and code examples.
Slides for the sixth meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Building a Graph-based Analytics PlatformKenny Bastani
Meetup is a valuable source of data for understanding trends around products or brands. Meetup does not support an analytics package to track group statistics overtime unless you are an administrator of a group. There are no third-party tools or websites that analyze Meetup trends to understand how communities grow.
In this talk I will present a graph-based analytics platform that uses the Meetup.com API to collect and analyze membership statistics over time.
This talk will cover:
How to poll and import periodic data from the Meetup.com API into Neo4j using Node.js.
How to track meetup group growth over time using a Neo4j graph database using Node.js.
How to apply tags to meetup groups and report combined growth of all groups over time.
How to build an interactive documented analytics API to support applications using Node.js and Neo4j.
How to build a business dashboard to visualize time-based statistics and reports using a Node.js based REST API that queries Neo4j.
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...Edureka!
The free webinar on Python titled "Mastering Python - An Excellent tool for Web Scraping and Data Analysis" was conducted by Edureka on 14th November 2014
Google Desktop is desktop search software that indexes files on a computer and allows users to search emails, files, music, photos and more from a sidebar. It features file indexing, a sidebar with gadgets for email, notes, photos, news and weather, and quick searching across the computer from the sidebar or taskbar. Google Desktop runs on Mac OS X, Linux and Windows and continues to index files in the background as they change.
The document discusses web scraping and outlines a step-by-step process for scraping comments from a Dutch website called GeenStijl. It begins with using regular expressions to scrape the comments, but notes that existing parsers can make the process more elegant, especially for complex websites. It then demonstrates using the lxml module and XPath to scrape reviews from another site in a more structured way. The document provides remarks on regular expressions and XPath, and encourages exploring different scraping techniques.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
This document summarizes a presentation on unsupervised and supervised machine learning techniques for automated content analysis. It recaps types of automated content analysis, describes unsupervised techniques like principal component analysis (PCA) and latent Dirichlet allocation (LDA), and supervised machine learning techniques like regression. It provides examples of applying these techniques to cluster Facebook messages and predict newspaper reading. The document concludes by noting the presenter will use a portion of labeled data to estimate models and check predictions against the remaining labeled data.
This document provides an overview and objectives of a course on web scraping and analytics with Python. The course covers web scraping concepts and the BeautifulSoup package for scraping websites. It also demonstrates scraping an IMDB webpage to extract movie data and the PyDoop package for performing analytics on large datasets with Hadoop using Python. Examples of preprocessing text data with NLTK on Hadoop are also provided.
This document provides an overview of a presentation on automated content analysis using regular expressions and natural language processing. The presentation covers topics like bottom-up vs top-down analysis, what regular expressions are and how they can be used in Python, stemming, parsing sentences, and combining techniques like stemming and stopword removal. Examples are given on using regular expressions to count actors in articles and check the number of a document from LexisNexis. The takeaway message is about an upcoming take-home exam and future meetings.
This document provides information about Google Search and search engines. It discusses how search engines work and lists some of the major search engines. It provides background on Google's founding and growth. The document outlines several search operators and tips for using them, such as using quotation marks, AND/OR, wildcards, and excluding terms. It also discusses searching specific file types, unit conversions directly in search, and searching offline without an internet connection.
The Sourcecon webinar slides delivered by Andy Headworth from http://sironaconsulting.com/ on 22nd October 2014. It is about using Twitter and Google Plus to source candidates.
It covers sourcing individuals on both Google+ and Twitter as well as sourcing candidates from Communities and Twitter Lists.
600+ SEARCHABLE Sourcing Tools compiled by Susanna Frazier @ohsusannamarieSusanna Frazier
This document provides a list of over 600 sourcing tools categorized by their functions. It describes each tool's name, current version, category and a brief description. The tools cover a wide range of functions including search, social media, email, documents, scheduling and more. They allow users to easily access information, automate tasks and integrate various online services.
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Effective and efficient google searching power point tutorialJaclyn Lee Parrott
This document provides guidance on effective Google searching. It discusses Google's mission to organize the world's information and make it accessible. It also notes that Google profiles users to target advertising and its products may change. The document then provides examples of basic Google searches and demonstrates more advanced search techniques. It stresses evaluating sources and avoiding plagiarism. Finally, it includes an exercise for readers to practice advanced Google searches.
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake
Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.
The Midterm Presentation of the Con-Action project represents the combined effort of all 20 members of the Digital Media Master projects DaVisMo (data visualization and student mobility) and Confetti (embodied conversational agents) in 2009/2010 at the University of Bremen and the University of the Arts Bremen in Germany.
The accompanying videos can be found at: http://vimeo.com/channels/digitalmedia
Delineating Cancer Genomics through Data VisualizationRupam Das
In spite in advances in technologies for working with data, people spend undue amount of time in understanding the data and manipulating it into holistic visualization. Data visualization software for complex dataset such as in cancer genomics (which we have taken as case study) are not able to provide effective visualization for the users. Identification and characterization of cancer detection are important areas of research that are based on the integrated analysis of multiple heterogeneous genomics datasets. In this report, we review the key issues and challenges associated with cancer genomics through exploration of data visualization techniques, interactions and methods, which will in-turn advance the state of the art.
This document provides an overview and agenda for a data visualization capstone project for a social networking platform for business families. It discusses the client and problem to be solved through developing a new family tree visualization. The document then covers the project goals, context diagram, challenges in deciding the visual model, research conducted, experimentation, requirements management, and quality attributes to guide the architecture and development of the new family tree system.
Information Visualization for Knowledge Discovery: An IntroductionKrist Wongsuphasawat
This document provides an introduction to information visualization and its role in knowledge discovery. It discusses the challenges of understanding large datasets and how information visualization techniques like scatter plots, maps, and interactive visualizations can help identify patterns, trends, outliers and support communication and discovery. Examples of information visualization tools and techniques are presented across different data types like temporal, hierarchical, and network data.
This document discusses research on applying text mining and information retrieval techniques for fact finding in regulatory investigations from electronic documents. The researchers are developing methods for semantic search in e-discovery to iteratively retrieve relevant evidence from emails, forums, and other sources by integrating structural context and extracting knowledge from unstructured text. Their current work includes using Twitter mining as a form of conversational search and entity linking to semantically enrich documents.
This document discusses data visualization in Python and Django. It provides motivation for representing business analytic data graphically using charts and diagrams. It describes sources of data, preprocessing data, and categorizing data as real-time or batch-based. Visualization can be done on the server or client. Tools are discussed for data analysis and visualization libraries like Matplotlib are mentioned. Appendices provide code examples for scatter plots, loading data from databases, and refreshing views.
The document discusses collaboration between artists and scientists and suggests that great collaboration takes both art and science. It recommends helping the artists and scientists on your team work together. It also mentions that you can try Mindjet for free for 30 days.
Collaboration: A hands-on demo using Confluence wikiSarah Maddox
A Scriptorium webinar about technical communication, collaboration and Confluence wiki. This slide deck includes screenshots of the parts of the demo that were live on the wiki during the presentation.
d3Kit is a set of tools to speed D3 related project development. It is a lightweight library to help you do the basic groundwork tasks you need when building visualization with d3.
Presentation on the Art of Visual Thinking and the application in the Visual Practice. Why and How it works. Presentation made at Innovation in Mind 2012 and for EMBA program at University of Geneva. For more information on research on this topic go to ForbesOste.com
Overview of Confluence and its features and how it is useful for enterprises. Updated with new social features in Confluence 3.0 and SharePoint Integration
This document summarizes Damian Trilling's workshop on analyzing big Twitter data. The workshop covers collecting Twitter data using yourTwapperkeeper, formatting it as CSV files, and writing a Python script to analyze the data. Trilling demonstrates a Python script that identifies tweets mentioning Poland by searching tweet texts and counting matches. Participants are then instructed to download example files and write their own analysis script.
This document provides a tour of data visualization and its relationship to data science. It discusses how visualization can turn data into valuable insights through exploratory data analysis and storytelling. Examples are given of how visualization has been used at Twitter to analyze Ballon d'Or voting data, New Year's tweets, user activity logs, and user sessions to glean insights. Visualization is described as taking data and turning it into visual displays and interactive tools to help audiences understand large amounts of information and for communicating known facts or exploring data.
Microsoft Graph is the rich, robust API for an increasing number of products across Microsoft. Microsoft Graph has a large footprint of tools, SDKs, and API capabilities you can incorporate in your projects. Come see what's new across products and available for developers -- you'll take away code and tools you'll undoubtedly use as you build apps and services.
Microsoft Graph is the rich, robust API for an increasing number of products across Microsoft. Microsoft Graph has a large footprint of tools, SDKs, and API capabilities you can incorporate in your projects. Come see what's new across products and available for developers -- you'll take away code and tools you'll undoubtedly use as you build apps and services.
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Unlock your Big Data with Analytics and BI on Office 365Brian Culver
Companies have huge amounts of data waiting to be explored. With Azure HDInsights you can realize the value of your data. With Microsoft Excel 2013 and Office 365, you have a complete platform for BI solutions and services. Power BI allows companies to manipulate and study a variety of data points, gain actionable insights and share their insights. PowerPivot, Power View, Power Query, Power Map and Power BI Sites let users analyze and make decisions using structured and unstructured data.
Attendee Takeaways:
1. Learn to setup and configure HDInsights on Microsoft Azure.
2. Understand how to use Excel for BI capabilities.
3. Build a BI Dashboard in Office365.
Want more apps to be built on your open data? Discover ways to make data more developer friendly. We will look at the history of the Internet and current trends to build an understanding of standards and interfaces to make your data future friendly. At the same time it will make it more useful to developers, citizens and your own organization.
Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. See http://strataconf.com/strata2011/public/schedule/detail/17714 for an overview of the talk.
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
Architecting for change: LinkedIn's new data ecosystemYael Garten
2016 StrataHadoop NYC conference talk.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52182
Abstract:
Last year, LinkedIn embarked on an ambitious mission to completely revamp the mobile experience for its members. This would mean a completely new mobile application, reimagined user experiences, and new interaction concepts. As the team evaluated the impact of this big rewrite on the data analytics ecosystem, they observed a few problems.
Over the past few years, LinkedIn has become extremely good at incrementally changing the site one mini-feature at a time, often in conjunction with hundreds of other incremental changes. LinkedIn’s experimentation platform ensures that it is always monitoring a wide gamut of impacted metrics with every change before rolling fully forward. However, when it comes to rolling out a big change like this, different challenges crop up. You have to rollout the entire application all at once; the new experience means that you have no baseline on new metrics; and existing metrics may see double digit changes just because of the new experience or because the metric’s logic is no longer accurate—the challenge is in figuring out which is which.
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
This document summarizes a presentation about building a real-time analytics API at scale using Citus, an open-source PostgreSQL extension. The presentation discusses how Algolia moved from ElasticSearch to Citus to enable sub-second analytics queries on billions of events per day. Key points include how Algolia configured Citus to shard and distribute data across clusters, used roll-up tables to aggregate raw events into aggregated metrics on different time intervals, and could perform queries on these aggregated tables with sub-800ms latency at scale. The approach using Citus as the foundation has proven successful for Algolia's analytics needs.
B365 saturday practical guide to building a scalable search architecture in s...Thuan Ng
This document outlines Thuan Nguyen's presentation on building a scalable search architecture in SharePoint 2013. The presentation covers common misunderstandings about search architecture, the logical components of search, and a practical guide to assessing needs, designing, implementing, and verifying a scalable search solution. It provides examples of sample search architectures for different volumes of content and use cases. The document concludes with references and a call for questions.
Usually, DataOps means applying DevOps principles to existing data analytics projects. We accidentally reversed it, taking a DevOps initiative and catalyzing adoption of data-driven practices across our company.
What started as a practical initiative to bring better reliability and visibility to our software product had the unexpected effect of catalyzing a transformation that helped our organization become more data-driven across the company. What we learned in the process was how and why DevOps principles can naturally expand the role of a traditional operations team and bring wider culture change to the organization.
Georgi Kobilarov presented on the status and future of DBpedia. DBpedia extracts structured data from Wikipedia and makes it available as linked open data. Current challenges include improving data quality, handling live Wikipedia updates, adding other data sources, and developing a new approach for infobox extraction using a domain-specific ontology. The vision is for DBpedia to become the Wikipedia of structured data and enable users and applications to access and query this data without having to understand its technical implementation.
Similar to Adventure in Data: A tour of visualization projects at Twitter (20)
“Which visualization library should I use?” Typically, making this decision is not about whether one library is “better” than another, but whether the specific library is more suitable for what the developer is trying to achieve.To answer this question thoroughly, we need to better understand the design space of visualization libraries. The talk will give a tour of many kinds of visualization libraries on the web across the design space, while explaining the framework and design philosophy that the audience can learn along the way. The audience will expand their horizon and be more aware of the wide universe of libraries. The next time they come across a new package, they can use this framework as a lens to analyze its own offerings and how it is different from or similar to the libraries that they already know.
Encodable: Configurable Grammar for Visualization ComponentsKrist Wongsuphasawat
There are so many libraries of visualization components nowadays with their APIs often different from one another. Could these components be more similar, both in terms of the APIs and common functionalities? For someone who is developing a new visualization component, how should the API look like? This work drew inspiration from visualization grammar, decoupled the grammar from its rendering engine and adapted it into a configurable grammar for individual components called Encodable. Encodable helps component authors define grammar for their components, and parse encoding specifications from users into utility functions for the implementation.
This document discusses expectations and challenges when visualizing data. The key points are:
1. Expect to find the real need by understanding the audience and goals better than the client. Expect to clean data, which can take a significant amount of time due to multiple sources and formats.
2. Prepare to iterate as the initial visualization may not meet needs or deadlines. Celebrate failures as learning opportunities.
3. Visualization projects include storytelling projects with strict deadlines and analytical tools to support data exploration by technical teams over the long term. The project lifecycle involves identifying needs, prototyping, refining, and maintaining the visualization.
This document summarizes the key expectations and challenges when visualizing data or building visual analytics tools. There are several main points:
1. Expect potential mismatches between what clients think they need versus what the data and visualization actually require, requiring clear communication and compromise.
2. Different projects will have different goals that require flexibility in the types of visualizations created, whether for presentation, exploration, or both.
3. A significant amount of time, often 70-80%, will be spent cleaning and preparing data prior to visualization due to issues like missing values, formatting inconsistencies, and data quality problems.
4. Iteration is essential to work out bugs and refine visualizations to best meet requirements and dead
This document discusses storytelling with data and data visualization. It begins with an introduction to the speaker and their background. It then covers topics like data sources, challenges in working with big data, applications of data analysis, examples of data stories, the data analysis process, and a case study analyzing tweets about the TV show Game of Thrones. Throughout there are references to iterating on prototypes and using feedback to improve. The overall message is that telling stories from data takes collecting relevant data, exploring it through multiple iterations, and presenting insights in an engaging way.
Reveal the talking points of every episode of Game of Thrones from fans' conv...Krist Wongsuphasawat
You may not be sure how Lord Varys collects information from his little birds, but in this talk you will hear how we can collect information from our little birds.
@kristw shares a behind-the-scenes view of his latest data visualization project, which shows how each #GameOfThrones episode was discussed on Twitter. Using data visualization, we can extract and reveal the stories of every episode from fans’ Tweets.
https://interactive.twitter.com/game-of-thrones
These slides are from a talk given at Bay Area d3 User Group meetup on June 9, 2016.
http://www.meetup.com/Bay-Area-d3-User-Group/events/231281298
A talk at Data Visualization Summit 2014 in Santa Clara, CA
ABSTRACT: What is the thought process that transforms data into visualizations? In this presentation, I will talk about guidelines that will help you when starting with raw data, walk through standard techniques, and also discuss things to keep in mind when making design decisions.
This document proposes a narrative display to summarize sports tournaments using a tree layout to show tournament structure, small multiples to detail individual matches, and tweets to capture fan reactions, using data from the 2012-2013 UEFA Champions League as an example. The display would be hosted at uclfinal.twitter.com and divided into three main sections for tournament overview, match details, and fans' reactions.
The document summarizes Krist Wongsuphasawat's presentation on visualizing event sequences at the 2013 Data Visualization Summit in San Francisco. Wongsuphasawat discussed techniques for visualizing event sequences, including using glyphs on a timeline to represent events, using interval width to represent duration, color and shape to distinguish event types, faceting for high density sequences, and aggregation techniques like binning and kernel density estimation. He demonstrated the LifeFlow tool for providing overviews and summaries of event sequence data. Wongsuphasawat also discussed alignment of sequences, outcome-based aggregation with the Outflow tool, and applications to analyzing big event sequence data like customer checkout processes at eBay.
Krist Wongsuphasawat's Dissertation Proposal Slides: Interactive Exploration ...Krist Wongsuphasawat
This dissertation by Krist Wongsuphasawat from the University of Maryland describes research on interactive exploration of event sequences in temporal categorical data. The document discusses how this type of data arises in domains like electronic health records and student records. It proposes designing effective visualization and interaction techniques to support users in exploring event sequences when they are uncertain about what they are looking for. The research aims to provide an overview of event sequences in temporal categorical data as well as a flexible temporal search approach.
Outflow: Exploring Flow, Factors and Outcome of Temporal Event SequencesKrist Wongsuphasawat
My presentation at IEEE VisWeek 2012 in Seattle, WA
//// Abstract:
Event sequence data is common in many domains, ranging from electronic medical records (EMRs) to sports events. Moreover, such sequences often result in measurable outcomes (e.g., life or death, win or loss). Collections of event sequences can be aggregated together to form event progression pathways. These pathways can then be connected with outcomes to model how alternative chains of events may lead to different results. This paper describes the Outflow visualization technique, designed to (1) aggregate multiple event sequences, (2) display the aggregate pathways through different event states with timing and cardinality, (3) summarize the pathways’ corresponding outcomes, and (4) allow users to explore external factors that correlate with specific pathway state transitions. Results from a user study with twelve participants show that users were able to learn how to use Outflow easily with limited training and perform a range of tasks both accurately and rapidly.
This document discusses information visualization and its uses for knowledge discovery through visual representations of data and user interactions. It provides examples of visualizations of different data types, such as maps, networks, temporal data. Visualizations can benefit data analysis by helping detect patterns and trends, and aid presentation by helping communicate information. However, they also carry drawbacks if used to mislead. The document promotes visualization tools like ManyEyes for collaborative data analysis.
This document discusses using information visualization techniques in healthcare, specifically for electronic medical records (EMRs). It provides examples of systems like Lifelines and LifeFlow that visualize patient data longitudinally over time in 1-3 sentences to help clinicians understand large amounts of patient data and identify patterns. Visualizations of EMR data can help improve healthcare quality by enabling faster decision making and better recall of patient information.
LifeFlow: Understanding Millions of Event Sequences in a Million PixelsKrist Wongsuphasawat
The document describes LifeFlow, a novel visualization tool that provides an overview and summary of millions of event sequence records. LifeFlow addresses challenges in displaying and exploring large event data at scale while preserving important information about all possible sequences and time gaps. It is demonstrated on medical and transportation event data use cases. LifeFlow supports exploration, identification of anomalies and errors, and asking richer questions of event sequence data.
Finding Comparable Temporal Categorical Records: A Similarity Measure with an...Krist Wongsuphasawat
1. The document proposes a new similarity measure called M&M (Match and Mismatch) to compare temporal categorical records and determine their similarity.
2. It also introduces an interactive visualization tool called Similan that uses the M&M measure and scatterplot visualization to help users find the most similar records to a target record.
3. An initial usability study found that Similan was easy to use but had some interface issues, and that the scatterplot effectively explained how records were dissimilar while giving an overview of the data. Ongoing work focuses on improving the similarity measure and interface.
Paper presentation at the Workshop on Visual Analytics in Healthcare in conjunction with the IEEE VisWeek 2011, Providence, RI, 2011.
Abstract:
Electronic Medical Record (EMR) databases contain a large amount of temporal events
such as diagnosis dates for various symptoms.
Analyzing disease progression pathways in terms of these observed events
can provide important insights into how diseases evolve over time.
Moreover, connecting these pathways to the eventual outcomes of the corresponding patients
can help clinicians understand how certain progression paths may lead to better or worse outcomes.
In this paper, we describe the Outflow visualization technique,
designed to summarize temporal event data that has been extracted from the EMRs of a cohort of patients.
We include sample analyses to show examples of the insights that can be learned from this visualization.
The document summarizes research on visualizing temporal categorical data from electronic health records. It describes tools like LifeLines for visualizing a single patient record, LifeLines2 for searching and comparing multiple records, and Similan for similarity-based search. A new tool called LifeFlow is introduced for aggregating and visualizing patterns across large numbers of records. The research was conducted over 10+ years at the University of Maryland by researchers including Krist Wongsuphasawat, Taowei David Wang, Catherine Plaisant, and Ben Shneiderman.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
3. Computer Engineer
Bangkok, Thailand
PhD in Computer Science
Univ. of Maryland
Information Visualization
IBM
Microsoft
Data Visualization Scientist
Twitter
Krist Wongsuphasawat / @kristw
4. Krist Wongsuphasawat / @kristw
Adventure in data
A whirlwind tour of visualization projects at Twitter
9. • Too much data. Want only relevant Tweets
• hashtag: #BRA
• keywords: “goal”
• Need to aggregate & reduce size
• Long processing time (hours)
Challenges
17. To understand the world
and share the stories
To understand Twitter users
and improve the service
To showcase the data
and inspire
Projects
Storytelling
Analytics Tools
Creative
18. Storytelling1
World Cup Election
Oscars
TV Shows New Year
Earthquake
Super Bowl
Protest
…
Behaviors
Sleeping
Daylight saving
Language
…
Events
Fasting
Information spread
Commute
58. Time + Text + Geo State of the Union
twitter.github.io/interactive/sotu2014
59. 1) timeline + topic from Tweets
4) Density map of
Tweets about
selected topic
3) Volume of Tweets
by topics
during selected
part of the SOTU
2) context
(speech)
twitter.github.io/interactive/sotu2014
Time + Text + Geo State of the Union
93. Log data
in Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Engineers & Data Scientists
Client event collection
94. Log data
in Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Engineers & Data Scientists
Client event collection
(Who-to-Follow)
99. client page section component element action
Search
Find
Log data
in Hadoop
Aggregate
web home * * impression*
Client event collection
Engineers & Data Scientists
100. client page section component element action
Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
web home * * impression*
Client event collection
Engineers & Data Scientists
101. client page section component element action
Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
search can be better
Client event collection
Engineers & Data Scientists
102. client page section component element action
Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
10,000+ event types
search can be better
Client event collection
Engineers & Data Scientists
103. client page section component element action
Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
search can be better
10,000+ event types
not everybody knows
What are all sections under web:home?
Client event collection
Engineers & Data Scientists
104. client page section component element action
Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
Aggregate
search can be better
one graph / event
10,000+ event types
not everybody knows
What are all sections under web:home?
Client event collection
Engineers & Data Scientists
105. client page section component element action
Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
Aggregate
search can be better
one graph / event
x 10,000
10,000+ event types
not everybody knows
What are all sections under web:home?
Client event collection
Engineers & Data Scientists
111. See
How to visualize?
narrow down
Client event collection
Engineers & Data Scientists
Interactions
search box => filter
112. See
How to visualize?
narrow down
Client event collection
Engineers & Data Scientists
client : page : section : component : element : actionInteractions
search box => filter
126. Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression banana : search : - : - : - : impression
home page
profile page search page
Specify all funnels manually!
n jobs
n hours
127. Goal
banana : home : - : - : - : impression
… ……
1 job => all funnels, visualized
home page
128. • Visualize an overview of event sequences
!
• Big data? eBay checkout sequences
Related work
[Wongsuphasawat et al. 2011, Monroe et al. 2013, …]
[Shen et al. 2013]
161. 1. Define set of events
2. Pick alignment, direction and window size
3. Run Hadoop job (with more aggregation)
4. Wait for it… (2+ hrs)
5. Visualize
Final process
~100,000 patterns (10MB)
gazillion patterns (TBs)
163. • Large-scale User Activity Logs + Visual Analytics
• Used in day-to-day operations at Twitter
• Generalize to smaller systems
Summary
Challenge
big data
small data
visualize & interact
aggregate
& sacrifice
166. To understand the world
and share the stories
To understand Twitter users
and improve the service
To showcase the data
and inspire
Projects
Storytelling
Analytics Tools
Creative
168. To understand the world
and share the stories
To understand Twitter users
and improve the service
To showcase the data
and inspire
Projects
Storytelling
Analytics Tools
Creative
Reusable
Toolkits
To implement
once and for all
171. Conclusions
Data are everywhere.
Many applications:
Journalism, Product development, Art, etc.
Combine visualization with other skills:
HCI, Design, Stats, ML, etc.
Don’t repeat yourself.
Krist Wongsuphasawat / @kristw
interactive.twitter.com kristw.yellowpigz.com
173. Conclusions
Data are everywhere.
Many applications:
Journalism, Product development, Art, etc.
Combine visualization with other skills:
HCI, Design, Stats, ML, etc.
Don’t repeat yourself.
Krist Wongsuphasawat / @kristw
interactive.twitter.com kristw.yellowpigz.com