Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. See http://strataconf.com/strata2011/public/schedule/detail/17714 for an overview of the talk.
Architectural Tradeoff in Learning-Based SoftwarePooyan Jamshidi
In classical software development, developers write explicit instructions in a programming language to hardcode the explicit behavior of software systems. By writing each line of code, the programmer instructs the software to have the desirable behavior by exploring a specific point in program space.
Recently, however, software systems are adding learning components that, instead of hardcoding an explicit behavior, learn a behavior through data. The learning-intensive software systems are written in terms of models and their parameters that need to be adjusted based on data. In learning-enabled systems, we specify some constraints on the behavior of a desirable program (e.g., a data set of input–output pairs of examples) and use the computational resources to search through the program space to find a program that satisfies the constraints. In neural networks, we restrict the search to a continuous subset of the program space.
This talk provides experimental evidence of making tradeoffs for deep neural network models, using the Deep Neural Network Architecture system as a case study. Concrete experimental results are presented; also featured are additional case studies in big data (Storm, Cassandra), data analytics (configurable boosting algorithms), and robotics applications.
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsKrist Wongsuphasawat
I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014.
http://nlp.stanford.edu/events/illvi2014/index.html
ABSTRACT
Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France.
ABSTRACT
Logging user activities is essential to data analysis for internet products and services.
Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization.
This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights.
In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types.
Two interactive visualizations were developed for these purposes:
we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.
Business Models in the Data Economy: A Case Study from the Business Partner D...Boris Otto
Data management seems to experience a renaissance today. One particular trend in the so-called data economy has been the emergence of business models based on the provision of high-quality data. In this context, the paper
examines business models of business partner data providers. The paper explores as to how and why these business models differ. Based on a study of six cases, the paper identifies three different business model patterns. A resource-based view is taken to explore the details of these patterns. Furthermore, the paper develops a set of propositions that help understand why the different business models evolved and how they may develop in the future. Finally, the paper discusses the ongoing market transformation process indicating a shift from traditional value chains toward value networks—a change which, if it is sustainable, would seriously threaten the business models of well-established data providers, such as Dun & Bradstreet, for example.
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...SocialBiz UserGroup
In this tip, speaker Scott Padget explains how socialytics provides customer and competitive insights as well as real-time operational insights. He introduces the SIFT (Social Intelligence Fusion Toolkit) Solution that funnels big social data into actionable business intelligence. Scott also describes the lifecycle of socialytics and gives a live demo. Obviously, the slides don’t capture the exact live demo, but they do show some screenshot examples of the SIFT Solution in action.
Architectural Tradeoff in Learning-Based SoftwarePooyan Jamshidi
In classical software development, developers write explicit instructions in a programming language to hardcode the explicit behavior of software systems. By writing each line of code, the programmer instructs the software to have the desirable behavior by exploring a specific point in program space.
Recently, however, software systems are adding learning components that, instead of hardcoding an explicit behavior, learn a behavior through data. The learning-intensive software systems are written in terms of models and their parameters that need to be adjusted based on data. In learning-enabled systems, we specify some constraints on the behavior of a desirable program (e.g., a data set of input–output pairs of examples) and use the computational resources to search through the program space to find a program that satisfies the constraints. In neural networks, we restrict the search to a continuous subset of the program space.
This talk provides experimental evidence of making tradeoffs for deep neural network models, using the Deep Neural Network Architecture system as a case study. Concrete experimental results are presented; also featured are additional case studies in big data (Storm, Cassandra), data analytics (configurable boosting algorithms), and robotics applications.
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsKrist Wongsuphasawat
I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014.
http://nlp.stanford.edu/events/illvi2014/index.html
ABSTRACT
Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France.
ABSTRACT
Logging user activities is essential to data analysis for internet products and services.
Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization.
This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights.
In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types.
Two interactive visualizations were developed for these purposes:
we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.
Business Models in the Data Economy: A Case Study from the Business Partner D...Boris Otto
Data management seems to experience a renaissance today. One particular trend in the so-called data economy has been the emergence of business models based on the provision of high-quality data. In this context, the paper
examines business models of business partner data providers. The paper explores as to how and why these business models differ. Based on a study of six cases, the paper identifies three different business model patterns. A resource-based view is taken to explore the details of these patterns. Furthermore, the paper develops a set of propositions that help understand why the different business models evolved and how they may develop in the future. Finally, the paper discusses the ongoing market transformation process indicating a shift from traditional value chains toward value networks—a change which, if it is sustainable, would seriously threaten the business models of well-established data providers, such as Dun & Bradstreet, for example.
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...SocialBiz UserGroup
In this tip, speaker Scott Padget explains how socialytics provides customer and competitive insights as well as real-time operational insights. He introduces the SIFT (Social Intelligence Fusion Toolkit) Solution that funnels big social data into actionable business intelligence. Scott also describes the lifecycle of socialytics and gives a live demo. Obviously, the slides don’t capture the exact live demo, but they do show some screenshot examples of the SIFT Solution in action.
Data Driven PR: 8 Steps to Building Media Attention with ResearchWalkerSands
Do you want to learn how your internal data can be used to gain media coverage in The New York Times, USA Today, and Mashable? Or how a simple consumer survey can lead to hundreds of new leads for your business?
Learn how in this presentation from Mike Santoro, President of tech PR firm Walker Sands, and Andrea Kempfer, Director of Marketing at market research firm Lab42.
The recorded presentation can be viewed at: http://www.walkersands.com/Data-Driven-PR-Webinar
Analyzing social conversation: a guide to data mining and data visualization Tempero UK
These slides were presented by Mick Conroy of Tempero and Jonathan Stray of Associated Press/Overview Project as part of Social Media Week New York #smwnyc
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
A deck presented at the MRS 'Maximising the Value of Big Data' conference in London, January 2013.
Presents my view of big data and the potential it gives us for mapping the systems that we deal with on a day-to-day basis. Big data holds the promise of providing us with a meta-view of the systems that we all think we are so familiar with. I think we will find that the woods look nothing like the trees.
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningThinkVine
This presentation discusses the use of agent-based modeling and its proven advantages to media planners, including the abilities to create effective media plans based on consumer differences, accurately attribute results to media tactics, quantify long-term effects, and forecast sales and ROI results.
This presentation explains how brands can mine social media data, both text and images, in order to find insights about your customers and markets that can provide real business value.
Staying on the Right Side of the Fence when Analyzing Human DataDataSift
Data is all around us and comes from many different sources. This data is generated by human behavior and it’s growing at an astonishing rate. Companies are collecting this data and using it in ways they could have never imagined.
This brings a sense of unease among people that their intimate information is no longer their own. Yet this data is central to companies ability to better serve customers, but it is necessary that companies find the balance and honor customers privacy. How can we strike the balance?
Join this webinar and you will learn:
About the current and future challenges in this data-rich world
How to be a good guy, and still achieve your business objectives while analyzing Human Data
About PYLON for Facebook Topic Data and how you can build insights from Facebook while protecting user privacy
Data Driven PR: 8 Steps to Building Media Attention with ResearchWalkerSands
Do you want to learn how your internal data can be used to gain media coverage in The New York Times, USA Today, and Mashable? Or how a simple consumer survey can lead to hundreds of new leads for your business?
Learn how in this presentation from Mike Santoro, President of tech PR firm Walker Sands, and Andrea Kempfer, Director of Marketing at market research firm Lab42.
The recorded presentation can be viewed at: http://www.walkersands.com/Data-Driven-PR-Webinar
Analyzing social conversation: a guide to data mining and data visualization Tempero UK
These slides were presented by Mick Conroy of Tempero and Jonathan Stray of Associated Press/Overview Project as part of Social Media Week New York #smwnyc
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
A deck presented at the MRS 'Maximising the Value of Big Data' conference in London, January 2013.
Presents my view of big data and the potential it gives us for mapping the systems that we deal with on a day-to-day basis. Big data holds the promise of providing us with a meta-view of the systems that we all think we are so familiar with. I think we will find that the woods look nothing like the trees.
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningThinkVine
This presentation discusses the use of agent-based modeling and its proven advantages to media planners, including the abilities to create effective media plans based on consumer differences, accurately attribute results to media tactics, quantify long-term effects, and forecast sales and ROI results.
This presentation explains how brands can mine social media data, both text and images, in order to find insights about your customers and markets that can provide real business value.
Staying on the Right Side of the Fence when Analyzing Human DataDataSift
Data is all around us and comes from many different sources. This data is generated by human behavior and it’s growing at an astonishing rate. Companies are collecting this data and using it in ways they could have never imagined.
This brings a sense of unease among people that their intimate information is no longer their own. Yet this data is central to companies ability to better serve customers, but it is necessary that companies find the balance and honor customers privacy. How can we strike the balance?
Join this webinar and you will learn:
About the current and future challenges in this data-rich world
How to be a good guy, and still achieve your business objectives while analyzing Human Data
About PYLON for Facebook Topic Data and how you can build insights from Facebook while protecting user privacy
Intervento di Paolo Bajardi al secondo incontro del corso di formazione per dirigenti sindacali "Le parole dell'innovazione e il lavoro", nato da una progettazione congiunta tra ISMEL e le segreterie CGIL, CISL e UIL di Torino e tenutosi tra marzo e maggio 2019.
Mining Social Web APIs with IPython Notebook (PyCon 2014)Matthew Russell
From the tutorial description at https://us.pycon.org/2014/schedule/presentation/134/ -
Description
Social websites such as Twitter, Facebook, LinkedIn, Google+, and GitHub have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from the thoroughly revised 2nd Edition of Mining the Social Web.
Abstract
This workshop teaches you fundamental data mining techniques as applied to popular social websites by adapting example code from Mining the Social Web (2nd Edition, O'Reilly 2013) in a tutorial-style step-by-step manner that is designed specifically to accommodate attendees with very little programming or domain experience. This workshop's extensive use of IPython Notebook facilitates interactive learning with turn-key examples against a Vagrant-based virtual machine that takes care of installing all 3rd party dependencies that are needed. The barriers to entry are truly minimal, which allows maximal use of the time to be spent on interactive learning.
The workshop is somewhat broadly designed and acclimates you to mining social data from Twitter, Facebook, LinkedIn, Google+, and GitHub APIs in five corresponding modules with the following memorable approach for each of them:
* Aspire - Set out to answer a question or test a hypothesis as part of a data science experiment
* Acquire - Collect and store the data that you need to answer the question or test the hypothesis
* Analyze - Use fundamental data mining techniques to explore and exploit the data
* Summarize - Present analytical findings in a compact and meaningful way
Each module consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Time will be set aside at the end of each module for attendees to hack on the code, discuss examples, and ask any lingering questions.
Python and Oracle : allies for best of data managementLaurent Leturgez
In this presentation, I described Python and how Python can Interact with Oracle database, and Oracle Cloud Infrastructure in various project : from data visualisation to data science.
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
A presentation to the Owen Graduate School of Management (Vanderbilt University) about social media and some of the technology behind the future uses of social media that are likely to shape the future of the Web as we know it.
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Matthew Russell
Sunday 9:55 a.m.–10:45 a.m.
Why Twitter Is All the Rage: A Data Miner's Perspective
Presenter: Matthew Russell
Audience level: Novice
Description:
In order to be successful, technology must amplify a meaningful aspect of our human experience, and Twitter’s success largely has been dependent on its ability to do this quite well. Although you could describe Twitter as just a “free, high-speed, global text-messaging service,” that would be to miss the much larger point that Twitter scratches some of the most fundamental itches of our humanity.
Abstract:
This talk explains explains why Twitter is "all the rage" by examining Twitter in light of fundamental questions about our humanity:
* We want to be heard
* We want to satisfy our curiosity
* We want it easy
* We want it now
This session examines Twitter's ability to examine these questions and presents its underlying conceptual architecture as an interest graph.
Even if you have minimal programming skills, you'll come away empowered with the ability to think about data mining on Twitter in more effective ways and apply a powerful collection of easily adaptable recipes to fully exploit the 5 kilobytes of metadata that decorates those 140 characters that you commonly think of as a tweet. Learn how to access Twitter's API, search for tweets, discover trending topics, process tweets in real-time from the firehose, and much more.
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
A presentation to the Nashville Data Science Meetup that introduces Mining the Social Web as an Open Source Software project/book, its virtual machine experience, the codebase, and a brief primer on data mining with Twitter
Why Twitter Is All the Rage: A Data Miner's PerspectiveMatthew Russell
A presentation on data mining with Twitter that was originally presented as an O'Reilly webinar. See http://oreillynet.com/pub/e/2928 for the archived webinar video.
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Matthew Russell
Slides from a 2-hour workshop at Data Day Texas 2014 on how to mine social web APIs. This workshop specifically focused on extracting insight from Twitter data and was partitioned into two hour long segments. The first segment focused on familiarity with Twitter's API, while the latter segment focused on using pandas to extract insight from tweets from the firehose via the Streaming API.
Mining Social Web APIs with IPython Notebook (Strata 2013)Matthew Russell
Slides from my Strata / Hadoop World 2013 (NYC) hands-on workshop.
Workshop Description from http://strataconf.com/stratany2013/public/schedule/detail/30863
Social web properties such as Twitter, Facebook, LinkedIn, and Google+ have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from Mining the Social Web (2nd Edition) with IPython Notebook.
Each module consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Even attendees with minimal programming experience should be able to walk away from this workshop with a working knowledge of the material and be equipped with sample code that can be easily repurposed given the design of this tutorial.
Time will be set aside at the end of each module’s follow-along presentation for attendees to hack on the code, discuss examples, and ask any lingering questions.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Unleashing Twitter Data for Fun and Insight
1. Unleashing Twitter Data
for fun and insight
Matthew A. Russell
http://linkedin.com/in/ptwobrussell
@ptwobrussell
Agile Data Solutions Social Web
Mining the
3. Mining the Social Web
Chapters 1-5
Introduction: Trends, Tweets, and Twitterers
Microformats: Semantic Markup and Common Sense Collide
Mailboxes: Oldies but Goodies
Friends, Followers, and Setwise Operations
Twitter: The Tweet, the Whole Tweet, and
Nothing but the Tweet
4. Mining the Social Web
Chapters 6-10
LinkedIn: Clustering Your Professional Network For Fun (and
Profit?)
Google Buzz: TF-IDF, Cosine Similarity, and Collocations
Blogs et al: Natural Language Processing (and Beyond)
Facebook: The All-In-One Wonder
The Semantic Web: A Cocktail Discussion
5. O verview
• Trends, Tweets, and Retweet Visualizations
• Friends, Followers, and Setwise Operations
• The Tweet, the Whole Tweet, and Nothing but the Tweet
6. Insight Matters
• What is @user's potential influence?
• What are @user's passions right now?
• Who are @user's most trusted friends?
7. Part 1:
Tweets, Trends, and Retweet
Visualizations
Agile Data Solutions Social Web
Mining the
8. A point to ponder:
Twitter : Data :: JavaScript : Programming Languages (???)
19. Search Results (continued)
"profile_image_url": "http://a1.twimg.com/profile_images/80...",
"source": "<a href="http://twitter.com/&quo...",
"text": "im nt gonna go to sleep happy unless i see ...",
"to_user_id": null
}
... output truncated - 99 more tweets ...
],
"results_per_page": 100,
"since_id": 0
},
... output truncated - 4 more pages ...
]
20. Lexical Diversity
• Ratio of unique terms to total terms
• A measure of "stickiness"?
• A measure of "group think"?
• A crude indicator of retweets to originally authored tweets?
21. Distilling Tweet Text
>>> # search_results is already defined
>>> tweets = [ r['text']
... for result in search_results
... for r in result['results'] ]
>>> words = []
>>> for t in tweets:
... words += [ w for w in t.split() ]
...
22. Analyzing Data
Agile Data Solutions Social Web
Mining the
23. Lexical Diversity
>>> len(words)
7238
>>> # unique words
>>> len(set(words))
1636
>>> # lexical diversity
>>> 1.0*len(set(words))/len(words)
0.22602928985907708
>>> # average number of words per tweet
>>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets)
14.476000000000001
24. Size Frequency Matters
• Counting: always the first step
• Simple but effective
• NLTK saves us a little trouble
27. Tweet and RT were sitting on a fence.
Tweet fell off. Who was left?
28. RTs: past, present, & future
• Retweet: Tweeting a tweet that's already been tweeted
• RT or via followed by @mention
• Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?
• Relatively new APIs were rolled out last year for retweeting sans
conventions
29. Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two
problems. -- Jamie Zawinski
30. Parsing Retweets
>>> example_tweets = ["Visualize Twitter search results w/ this simple script
http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via
@SocialWebMining @ptwobrussell)"]
>>> import re
>>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)",
... re.IGNORECASE)
>>> rt_origins = []
>>> for t in example_tweets:
... try:
... rt_origins += [mention.strip()
... for mention in rt_patterns.findall(t)[0][1].split()]
... except IndexError, e:
... pass
>>> [rto.strip("@") for rto in rt_origins]
41. Insight Matters
• What is my potential influence?
• Who are the most popular people in my network?
• Who are my mutual friends?
• What common friends/followers do I have with @user?
• Who is not following me back?
• What can I learn from analyzing my friendship cliques?
42. Getting Data
Agile Data Solutions Social Web
Mining the
43. OAuth (1.0a)
import twitter
from twitter.oauth_dance import oauth_dance
# Get these from http://dev.twitter.com/apps/new
consumer_key, consumer_secret = 'key', 'secret'
(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
consumer_key, consumer_secret)
auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
consumer_key, consumer_secret)
t = twitter.Twitter(domain='api.twitter.com', auth=auth)
44. Getting Friendship Data
friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1)
follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1)
# store the data somewhere...
47. Rate Limits
• 350 requests/hr for authenticated requests
• 150 requests/hr for anonymous requests
• Coping mechanisms:
• Caching & Archiving Data
• Streaming API
• HTTP 400 codes
• See http://dev.twitter.com/pages/rate-limiting
48. The Beloved Fail Whale
• Twitter is sometimes "overcapacity"
• HTTP 503 Error
• Handle it just as any other HTTP error
• RESTfulness has its advantages
49. Abstraction Helps
friend_ids = []
wait_period = 2 # secs
cursor = -1
while cursor != 0:
response = makeTwitterRequest(t, # twitter.Twitter instance
t.friends.ids,
screen_name=screen_name,
cursor=cursor)
friend_ids += response['ids']
cursor = response['next_cursor']
# break out of loop early if you don't need all ids
50. Abstracting Abstractions
screen_name = 'timoreilly'
# This is what you ultimately want...
friend_ids = getFriends(screen_name)
follower_ids = getFollowers(screen_name)
51. Storing Data
Agile Data Solutions Social Web
Mining the
54. A relational database?
import sqlite3 as sqlite
conn = sqlite.connect('data.db')
c = conn.cursor()
c.execute('''create table
friends...''')
c.execute('''insert into friends...
''')
# Lots of fun...sigh...
55. Redis (A Data Structures Server)
import redis
r = redis.Redis()
[ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ]
r.smembers("timoreilly$friend_ids") # returns a set
Project page: http://redis.io
Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
56. Redis Set Operations
• Key/value store...on typed values!
• Common set operations
• smembers, scard
• sinter, sdiff, sunion
• sadd, srem, etc.
• See http://code.google.com/p/redis/wiki/CommandReference
• Don't forget to $ easy_install redis
57. Analyzing Data
Agile Data Solutions Social Web
Mining the
60. Count Your Blessings
# A utility function
def getRedisIdByScreenName(screen_name, key_name):
return 'screen_name$' + screen_name + '$' + key_name
# Number of friends
n_friends = r.scard(getRedisIdByScreenName(screen_name,
'friend_ids'))
# Number of followers
n_followers = r.scard(getRedisIdByScreenName(screen_name,
'follower_ids'))
61. Asymmetric Relationships
# Friends who aren't following back
friends_diff_followers = r.sdiffstore('temp', [
getRedisIdByScreenName(screen_name, 'friend_ids'),
getRedisIdByScreenName(screen_name, 'follower_ids')
])
# ... compute interesting things ...
r.delete('temp')
64. Sample Output
timoreilly is following 663
timoreilly is being followed by 1,423,704
131 of 663 are not following timoreilly back
1,423,172 of 1,423,704 are not being followed back by
timoreilly
timoreilly has 532 mutual friends
65. Who Isn't Following Back?
user_ids = [ ... ] # Resolve these to user info objects
while len(user_ids) > 0:
user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ])
user_ids = user_ids[100:]
response = t.users.lookup(user_id=user_ids)
if type(response) is dict: response = [response]
r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp))
for resp in response]))
r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'),
json.dumps(resp)) for resp in response]))
66. Friends in Common
# Assume we've harvested friends/followers and it's in Redis...
screen_names = ['timoreilly', 'mikeloukides']
r.sinterstore('temp$friends_in_common',
[getRedisIdByScreenName(screen_name, 'friend_ids')
for screen_name in screen_names])
r.sinterstore('temp$followers_in_common',
[getRedisIdByScreenName(screen_name,'follower_ids')
for screen_name in screen_names])
# Manipulate the sets
67. Potential Influence
• My followers?
• My followers' followers?
• My followers' followers' followers?
•for n in range(1, 7): # 6 degrees?
print "My " + "followers' "*n + "followers?"
71. Breadth-First Traversal
Create an empty graph
Create an empty queue to keep track of unprocessed nodes
Add the starting point to the graph as the "root node"
Add the root node to a queue for processing
Repeat until some maximum depth is reached or the queue is empty:
Remove a node from queue
For each of the node's neighbors:
If the neighbor hasn't already been processed:
Add it to the graph
Add it to the queue
Add an edge to the graph connecting the node & its neighbor
72. Breadth-First Harvest
next_queue = [ 'timoreilly' ] # seed node
d = 1
while d < depth:
d += 1
queue, next_queue = next_queue, []
for screen_name in queue:
follower_ids = getFollowers(screen_name=screen_name)
next_queue += follower_ids
getUserInfo(user_ids=next_queue)
73. The Most Popular Followers
freqs = {}
for follower in followers:
cnt = follower['followers_count']
if not freqs.has_key(cnt):
freqs[cnt] = []
freqs[cnt].append({'screen_name': follower['screen_name'],
'user_id': f['id']})
popular_followers = sorted(freqs, reverse=True)[:100]
74. Average # of Followers
all_freqs = [k for k in keys for user in freqs[k]]
avg = sum(all_freqs) / len(all_freqs)
75. @timoreilly's Popular Followers
The top 10 followers from the sample:
aplusk 4,993,072
BarackObama 4,114,901
mashable 2,014,615
MarthaStewart 1,932,321
Schwarzenegger 1,705,177
zappos 1,689,289
Veronica 1,612,827
jack 1,592,004
stephenfry 1,531,813
davos 1,522,621
76. Futzing the Numbers
• The average number of timoreilly's followers' followers: 445
• Discarding the top 10 lowers the average to around 300
• Discarding any follower with less than 10 followers of their
own increases the average to over 1,000!
• Doing both brings the average to around 800
78. Friendship Graphs
for i in ids: #ids is timoreilly's id along with friend ids
info = json.loads(r.get(getRedisIdByUserId(i, 'info.json')))
screen_name = info['screen_name']
friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name,
'friend_ids')))
for friend_id in [fid for fid in friend_ids if fid in ids]:
friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json')))
g.add_edge(screen_name, friend_info['screen_name'])
nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle
80. Calculating Cliques
cliques = [c for c in nx.find_cliques(g)]
num_cliques = len(cliques)
clique_sizes = [len(c) for c in cliques]
max_clique_size = max(clique_sizes)
avg_clique_size = sum(clique_sizes) / num_cliques
max_cliques = [c for c in cliques if len(c) == max_clique_size]
num_max_cliques = len(max_cliques)
people_in_every_max_clique = list(reduce(
lambda x, y: x.intersection(y),[set(c) for c in max_cliques]
))
81. Cliques for @timoreilly
Num cliques: 762573
Avg clique size: 14
Max clique size: 26
Num max cliques: 6
Num people in every max clique: 20
83. Graphs, etc
• Your first instinct is naturally
G = (V, E) ?
84. Dorling Cartogram
• A location-aware bubble chart (ish)
• At least 3-dimensional
• Position, color, size
• Look at friends/followers by state
85. Sunburst of Friends
• A very compact visualization
• Slice and dice friends/followers by
gender, country, locale, etc.
86. Part 3:
The Tweet, the Whole Tweet, and
Nothing but the Tweet
Agile Data Solutions Social Web
Mining the
87. Insight Matters
• Which entities frequently appear in @user's tweets?
• How often does @user talk about specific friends?
• Who does @user retweet most frequently?
• How frequently is @user retweeted (by anyone)?
• How many #hashtags are usually in @user's tweets?
93. Entities & Annotations
• Entities
• Opt-in now but will "soon" be standard
• $ easy_install twitter_text
• Annotations
• User-defined metadata
• See http://dev.twitter.com/pages/annotations_overview
94. Manual Entity Extraction
import twitter_text
extractor = twitter_text.Extractor(tweet['text'])
mentions = extractor.extract_mentioned_screen_names_with_indices()
hashtags = extractor.extract_hashtags_with_indices()
urls = extractor.extract_urls_with_indices()
# Splice info into a tweet object
98. As easy as sitting on the couch
• Get it - http://www.couchone.com/get
• Install it
• Relax - http://localhost:5984/_utils/
• Also - $ easy_install couchdb
99. Storing Timeline Data
import couchdb
import twitter
TIMELINE_NAME = "user" # or "home" or "public"
t = twitter.Twitter(domain='api.twitter.com', api_version='1)
server = couchdb.Server('http://localhost:5984')
db = server.create(DB)
page_num = 1
while page_num <= MAX_PAGES:
api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline')
tweets = makeTwitterRequest(t, api_call, page=page_num)
db.update(tweets, all_or_nothing=True)
print 'Fetched %i tweets' % len(tweets)
page_num += 1
106. Filtering Tweet Entities
• Let's find out how often someone talks about
specific friends
• We have friend info on hand
• We've extracted @mentions from the tweets
• Let's cound friend vs non-friend mentions
107. @timoreilly's friend mentions
Number of user entities in tweets who are
Number of @user entities in tweets: 20
not friends: 2
Number of @user entities in tweets who
are friends: 18
n2vip
timoreilly
ahier andrewsavikas
pkedrosky gnat
CodeforAmerica slashdot
nytimes OReillyMedia
brady dalepd
carlmalamud mikeloukides
pahlkadot monkchips
make fredwilson
jamesoreilly digiphile
andrewsavikas
112. Retweet Counts
• An API resource /statuses/retweet_count exists (and is now functional)
• Example: http://twitter.com/statuses/show/29016139807.json
• retweet_count
• retweeted
123. Hashtag Analysis
• TeaParty: ~ 5 hashtags per tweet.
• Example: “Rarely is the questioned asked: Is our children
learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty
#GOP #FF
• JustinBieber: ~ 2 hashtags per tweet
• Example: #justinbieber is so coool
128. Juxtaposing Friendships
• Harvest search results for #JustinBieber and #TeaParty
• Get friend ids for each @mention with /friends/ids
• Resolve screen names with /users/lookup
• Populate a NetworkX graph
• Analyze it
• Visualize with Graphviz