The document discusses analyzing social networks and Twitter data using Python. It provides an introduction to analyzing the Twitter network of the user @clouderati, including 2072 followers. The presentation will cover topics like mentions, hashtags, retweets, and constructing a social graph to analyze connections and groups (cliques). The goal is to illustrate how to work with Twitter API objects and data to explore social network analysis.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Closing keynote by Trey Grainger from Activate 2018 in Montreal, Canada. Covers trends in the intersection of Search (Information Retrieval) and Artificial Intelligence, and the underlying capabilities needed to deliver those trends at scale.
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords based upon conceptual cohesion to reduce noise, summarize documents by extracting their most significant terms, generate recommendations and personalized search, and power numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. This talk will walk you through how to setup and use this plugin in concert with other open source tools (probabilistic query parser, SolrTextTagger for entity extraction) to parse, interpret, and much more correctly model the true intent of user searches than traditional keyword-based search approaches.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
While traditional keyword search is still useful, pure text-based keyword matching is quickly becoming obsolete; today, it is a necessary but not sufficient tool for delivering relevant results and intelligent search experiences.
In this talk, we'll cover some of the emerging trends in AI-powered search, including the use of thought vectors (multi-level vector embeddings) and semantic knowledge graphs to contextually interpret and conceptualize queries. We'll walk through some live query interpretation demos to demonstrate the power that can be delivered through these semantic search techniques leveraging auto-generated knowledge graphs learned from your content and user interactions.
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
The Next Generation of AI-powered SearchTrey Grainger
What does it really mean to deliver an "AI-powered Search" solution? In this talk, we’ll bring clarity to this topic, showing you how to marry the art of the possible with the real-world challenges involved in understanding your content, your users, and your domain. We'll dive into emerging trends in AI-powered Search, as well as many of the stumbling blocks found in even the most advanced AI and Search applications, showing how to proactively plan for and avoid them. We'll walk through the various uses of reflected intelligence and feedback loops for continuous learning from user behavioral signals and content updates, also covering the increasing importance of virtual assistants and personalized search use cases found within the intersection of traditional search and recommendation engines. Our goal will be to provide a baseline of mainstream AI-powered Search capabilities available today, and to paint a picture of what we can all expect just on the horizon.
Balancing the Dimensions of User IntentTrey Grainger
The first step in returning relevant search results is successfully interpreting the user’s intent. This requires combining a holistic understanding of your content, your users, and your domain. Traditional keyword search focuses on the content understanding dimension. Knowledge graphs are then typically built and leveraged to represent an understanding of your domain. Finally, Collaborative recommendations and user profile learning are typically the tools of choice for generating and modeling an understanding of the preferences of each user.
While these systems (search, recommendations, and knowledge graphs) are often built and used in isolation, combining them together is the key to truly understanding a user’s query intent. For example, combining traditional keyword search with your knowledge graph leads to semantic search capabilities, and combining traditional keyword search with recommendations leads to personalized search experiences. Combining all of these dimensions together in an appropriately balanced way will ultimately lead to the most accurate interpretation of a user’s query, resulting in a better query to the core search engine and ultimately a better, more relevant search experience.
In this talk, we’ll demonstrate strategies for delivering and combining each of these dimensions of user intent, and we’ll walk through concrete examples of how to balance the nuances of each so that you also don’t over-personalize, over-contextualize, or under appreciate the nuances of your user’s intent.
Convolutional Neural Networks and Natural Language ProcessingThomas Delteil
Presentation on Convolutional Neural Networks and their application to Natural Language Processing. In-depth walk-through the Crepe architecture from Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
Loosely based on ODSC London 2016 talk: https://www.slideshare.net/MiguelFierro1/deep-learning-for-nlp-67182819
Code: https://github.com/ThomasDelteil/TextClassificationCNNs_MXNet
Demo: https://thomasdelteil.github.io/TextClassificationCNNs_MXNet/
(flattened pdf, no animation, email author for .pptx)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
Slides from my talk at Deep Learning World 2020. The talk covered use cases, special challenges and solutions for building Interpretable and Secure AI systems using Pytorch.
- Tools for building Interpretable models
- How to build secure, privacy preserving AI models with Pytorch
- Use cases and insights from the field
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Closing keynote by Trey Grainger from Activate 2018 in Montreal, Canada. Covers trends in the intersection of Search (Information Retrieval) and Artificial Intelligence, and the underlying capabilities needed to deliver those trends at scale.
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords based upon conceptual cohesion to reduce noise, summarize documents by extracting their most significant terms, generate recommendations and personalized search, and power numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. This talk will walk you through how to setup and use this plugin in concert with other open source tools (probabilistic query parser, SolrTextTagger for entity extraction) to parse, interpret, and much more correctly model the true intent of user searches than traditional keyword-based search approaches.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
While traditional keyword search is still useful, pure text-based keyword matching is quickly becoming obsolete; today, it is a necessary but not sufficient tool for delivering relevant results and intelligent search experiences.
In this talk, we'll cover some of the emerging trends in AI-powered search, including the use of thought vectors (multi-level vector embeddings) and semantic knowledge graphs to contextually interpret and conceptualize queries. We'll walk through some live query interpretation demos to demonstrate the power that can be delivered through these semantic search techniques leveraging auto-generated knowledge graphs learned from your content and user interactions.
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
The Next Generation of AI-powered SearchTrey Grainger
What does it really mean to deliver an "AI-powered Search" solution? In this talk, we’ll bring clarity to this topic, showing you how to marry the art of the possible with the real-world challenges involved in understanding your content, your users, and your domain. We'll dive into emerging trends in AI-powered Search, as well as many of the stumbling blocks found in even the most advanced AI and Search applications, showing how to proactively plan for and avoid them. We'll walk through the various uses of reflected intelligence and feedback loops for continuous learning from user behavioral signals and content updates, also covering the increasing importance of virtual assistants and personalized search use cases found within the intersection of traditional search and recommendation engines. Our goal will be to provide a baseline of mainstream AI-powered Search capabilities available today, and to paint a picture of what we can all expect just on the horizon.
Balancing the Dimensions of User IntentTrey Grainger
The first step in returning relevant search results is successfully interpreting the user’s intent. This requires combining a holistic understanding of your content, your users, and your domain. Traditional keyword search focuses on the content understanding dimension. Knowledge graphs are then typically built and leveraged to represent an understanding of your domain. Finally, Collaborative recommendations and user profile learning are typically the tools of choice for generating and modeling an understanding of the preferences of each user.
While these systems (search, recommendations, and knowledge graphs) are often built and used in isolation, combining them together is the key to truly understanding a user’s query intent. For example, combining traditional keyword search with your knowledge graph leads to semantic search capabilities, and combining traditional keyword search with recommendations leads to personalized search experiences. Combining all of these dimensions together in an appropriately balanced way will ultimately lead to the most accurate interpretation of a user’s query, resulting in a better query to the core search engine and ultimately a better, more relevant search experience.
In this talk, we’ll demonstrate strategies for delivering and combining each of these dimensions of user intent, and we’ll walk through concrete examples of how to balance the nuances of each so that you also don’t over-personalize, over-contextualize, or under appreciate the nuances of your user’s intent.
Convolutional Neural Networks and Natural Language ProcessingThomas Delteil
Presentation on Convolutional Neural Networks and their application to Natural Language Processing. In-depth walk-through the Crepe architecture from Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
Loosely based on ODSC London 2016 talk: https://www.slideshare.net/MiguelFierro1/deep-learning-for-nlp-67182819
Code: https://github.com/ThomasDelteil/TextClassificationCNNs_MXNet
Demo: https://thomasdelteil.github.io/TextClassificationCNNs_MXNet/
(flattened pdf, no animation, email author for .pptx)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
Slides from my talk at Deep Learning World 2020. The talk covered use cases, special challenges and solutions for building Interpretable and Secure AI systems using Pytorch.
- Tools for building Interpretable models
- How to build secure, privacy preserving AI models with Pytorch
- Use cases and insights from the field
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays
apidays LIVE Australia 2021 - Accelerating Digital
September 15 & 16, 2021
Tracing across your distributed process boundaries using OpenTelemetry
Dasith Wijes, Senior Consultant at Microsoft (Azure Cloud & AI Team)
Social Media Data Collection & AnalysisScott Sanders
A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more.
Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook!
We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...OpenCredo
Your microservice system has been up and running for a while. You know you’ve diligently employed every ounce of your experience and knowledge over time to design a sensible application architecture, with hopefully sensible boundaries. But time is now throwing new questions your way: Are my boundaries still sensible?
Have any anti-patterns crept in, do I have the dreaded distributed monolith?
This talk explores how network science techniques can be applied to help gain insight into, and explore questions about your microservices architecture.
Discovery and Open Data: slides from #discopen session at JISC cross programme meeting in April 2012. Author: Amber Thomas, JISC. Discusses the data space around discovery issues in education and research, with a focus on open data. CC BY. Please see slide 2 for permissions.
Outlines the vision and philosophy for Wakari.io with a basic overview of popular python data analysis packages. Most of the talk is conducted in Wakari and is not visible on these slides. 90 minutes for PyData NYC, November 8th 2013.
Python is the choice llanguage for data analysis,
The aim of this slide is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of the steps you need to learn to use Python for data analysis.
Big Data with IOT approach and trends with case studySharjeel Imtiaz
The Big Data with IOT approach and trends. It will give you complete exposure of data science process and also will give insight how the step by step data science process explore the big data of TripAdvisor case study.
Agile Data Rationalization for Operational IntelligenceInside Analysis
The Briefing Room with Eric Kavanagh and Phasic Systems
Live Webcast Mar. 26, 2013
The complexity of today's information architectures creates a wide range of challenges for executives trying to get a strategic view of their current operations. The data and context locked in operational systems often get diluted during the normalization processes of data warehousing and other types of analytic solutions. And the ultimate goal of seeing the big picture gets derailed by a basic inability to reconcile disparate organizational views of key information assets and rules.
Register for this episode of The Briefing Room to learn from Bloor Group CEO Eric Kavanagh, who will explain how a tightly controlled methodology can be combined with modern NoSQL technology to resolve both process and system complexities, thus enabling a much richer, more interconnected information landscape. Kavanagh will be briefed by Geoffrey Malafsky of Phasic Systems who will share his company's tested methodology for capturing and managing the business and process logic that run today's data-driven organizations. He'll demonstrate how a “don't say no” approach to entity definitions can dissolve previously intractable disagreements, opening the door to clear, verifiable operational intelligence.
Visit: http://www.insideanalysis.com
OSCON 2013 - Planning an OpenStack Cloud - Tom FifieldOSCON Byrum
The flexibility of OpenStack is a dual-edged sword, giving you unprecedented control over your infrastructure, but potentially becoming a nightmare for the indecisive manager, architect or sysadmin!
In this presentation, Tom Fifield – co-author of the OpenStack Operations Guide, and Community Manager at the OpenStack Foundation – takes you through some of the decisions you will face when planning your OpenStack cloud. In addition to a brief introduction on OpenStack and advice on how to interact with the community, he will cover topics such as:
How to approach your deployment, ranging from DIY to a turn-key solution from the ecosystem
Storage and networking decisions, including plugin options
Automating deployment and configuration with popular tools like Puppet and Chef
Through discussion of the ecosystem, customization and scaling, you’ll walk away with an understanding of ‘what it takes’ to build your OpenStack cloud.
Protecting Open Innovation with the Defensive Patent LicenseOSCON Byrum
The Defensive Patent License (DPL) is a new legal mechanism to protect innovators by creating a patent network that is committed to defense and "de-weaponizing" patents. It draws from the theories and values of F/OSS licensing to create obligations that "travel with the patent"--preventing troll from taking over open technologies and pulling them out of the public domain.
Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum
"Using Cascalog to build an app with City of Palo Alto Open Data" by Paco Nathan, presented at OSCON 2013 in Portland. Based on a case study from "Enterprise Data Workflows with Cascading" http://shop.oreilly.com/product/0636920028536.do
Finite State Machines are overlooked at best, ignored at worst, and virtually always dismissed. This is tragic since FSMs are not just about Door Locks (the most commonly used example). On the contrary, these FSMs are invaluable in clearly defining communication protocols – ranging from low-level web-services through complex telephony application to reliable interactions between loosely-coupled systems. Properly using them can significantly enhance the stability and reliability of your systems.
Join me as I take you through a crash-course in FSMs, using erlang’s gen_fsm behavior as the background, and hopefully leaving you with a better appreciation of both FSM and erlang in the process.
OpenCar covers OS development for a new market: automotive apps. In-car apps are poised to explode for open source developers. The market is transforming from an inefficient, proprietary model to an HTML5-based “app store” model. To enter and participate in this new target category, developers need access to automakers, automotive systems, and knowledge of industry standards and platforms. http://sdk.opencar.com
How we built our community using Github - Uri CohenOSCON Byrum
The journey of GigaSpaces as a company in building the Cloudify open source product, what worked and what didn't and how it used Github as the platform for not just hosting the code
The Vanishing Pattern: from iterators to generators in PythonOSCON Byrum
The core of the talk is refactoring a simple iterable class from the classic Iterator design pattern (as implemented in the GoF book) to compatible but less verbose implementations using generators. This provides a meaningful context to understand the value of generators. Along the way the behavior of the iter function, the Sequence protocol and the Iterable interface are presented. The motivating examples of this talk are database applications.
This talk covers why Apache Zookeeper is a good fit for coordinating processes in a distributed environment, prior Python attempts at a client and the current state of the art Python client library, how unifying development efforts to merge several Python client libraries has paid off, features available to Python processes, and how to gracefully handle failures in a set of distributed processes.
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON Byrum
Presented by Diane Mueller, ActiveState @pythondj
Are you unsure what the security and privacy implications are for sensitive corporate data? US Patriot Act is causing many of us to hesitate on leveraging the cloud.
Organizations are thinking long and hard about the legal and regulatory implications of cloud computing. When it comes to actual corporate data, no matter what the efficiency gains are, legal departments are often directing IT departments to steer clear of any service that eliminates their ability to keep potential sensitive information out of the hands of Federal prosecutors.
Despite all the hype about every application moving into the cloud, some practical patterns are starting to emerge in the types of data corporations are willing to move to the cloud.
Covered in this session:
(a) Introduction to the US Patriot Act and Data Privacy issues Implications for on Cloud Computing Jurisdictional Issues
(b) Best Practices & Practical Patterns Classes of applications that best leverage the cloud
(c)What types of applications should stay on-premise Private Cloud Model(s) Building a Compliant Cloud Strategy
For more information:
email me at dianem {at} activestate {period} com
or ping me on twitter at @pythondj
visit http://activestate.com/stackato
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking OSCON Byrum
The BodyTrack project develops open source tools self tracking tools to aggregate and visualize data from diverse sources such as wearable sensors, observations from mobile apps, photos, and environmental data. Our goal is to empower individuals to explore potential environment/health interactions (food sensitivities, asthma or migraine triggers, sleep problems, etc.) and better assess strategies they think might help.
A Look at the Network: Searching for Truth in Distributed ApplicationsOSCON Byrum
A talk by C. Scott Andreas (@cscotta) of Boundary on "the network" and designing / deploying distributed applications.
This session offers a deep-dive into how application-level problems manifest at the network level. Some of these cases range from basic network partitions and node outages to sophisticated application-level changes such as garbage collections on managed runtimes, classes of bugs which evade conventional monitoring but constitute partial failures, changes in network activity based on database partitioning, load balancing, and sharding, and other warning signs that crop up at layer three long before wreaking havoc at layer seven as customer-visible failures begin to occur. Combining application-level metrics with network analytics is a powerful cocktail for identifying hot spots quickly, and connecting the dots out to the client closes the loop.
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
Bring your ideas to life! Convince your boss to that open source development is faster and cheaper than the "safe" COTS solution they probably hate anyway. Let's investigate ways to get real-life, functional prototypes up with blazing speed. We'll look at and compare tools for truly rapid development including Python, Django, Flask, PHP, Amazon EC2 and Heroku.
Comparing open source private cloud platformsOSCON Byrum
Private cloud computing has become an integral part of global business. While each platform provides a way for virtual machines to be deployed, implementations vary widely. It can be difficult to determine which features are right for your needs. This session will discuss the top open source private cloud platforms and provide analysis on which one is the best fit for you.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Mission to Decommission: Importance of Decommissioning Products to Increase E...
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
1. The Art of Social Media
Analysis
with Twitter & Python
krishna sankar
@ksankar
http://www.oscon.com/oscon2012/public/schedule/detail/23130
2. Intro
API,
Objects,…
o House
Rules
(1
of
2)
Twitter
Network We will analyze @clouderati,
o Doesn’t
assume
any
knowledge
Analysis 2072 followers, exploding to
of
Twitter
API
Pipeline
~980,000 distinct users down
one level
o Goal:
Everybody
in
the
same
page
&
get
a
working
knowledge
of
Twitter
API
NLP, NLTK,
o To
bootstrap
your
exploration
@mention Cliques, social
Sentiment
network graph
into
Social
Network
Analysis
&
Analysis
Twitter
Rewteeet analytics,
Growth,
#tag Network Information
o Simple
programs,
to
illustrate
contagion weakties
usage
&
data
manipulation
3. Intro
API,
Objects,…
Twitter
o House
Rules
(2
of
2)
Network We will analyze @clouderati,
Analysis 2072 followers, exploding to
o Am
using
the
requests
library
Pipeline
~980,000 distinct users down
o There
are
good
Twitter
frameworks
one level
for
python,
but
wanted
to
build
from
the
basics.
Once
one
understands
the
fundamentals,
frameworks
can
help
NLP, NLTK,
@mention Cliques, social
Sentiment
o Many
areas
to
explore
–
not
enough
Analysis
network graph
time.
So
decided
to
focus
on
social
graph,
cliques
&
networkx
Rewteeet analytics,
Growth,
#tag Network Information
contagion weakties
4. About Me
• Lead
Engineer/Data
Scientist/AWS
Ops
Guy
at
Genophen.com
o Co-‐chair
–
2012
IEEE
Precision
Time
Synchronization
• http://www.ispcs.org/2012/index.html
o Blog
:
http://doubleclix.wordpress.com/
o Quora
:
http://www.quora.com/Krishna-‐Sankar
• Prior
Gigs
o Lead
Architect
(Egnyte)
o Distinguished
Engineer
(CSCO)
o Employee
#64439
(CSCO)
to
#39(Egnyte)
&
now
#9
!
• Current
Focus:
o Design,
build
&
ops
of
BioInformatics/Consumer
Infrastructure
on
AWS,
MongoDB,
Solr,
Drupal,GitHub,…
o Big
Data
(more
of
variety,
variability,
context
&
graphs,
than
volume
or
velocity
–
so
far
!)
o Overlay
based
semantic
search
&
ranking
• Other
related
Presentations
o http://goo.gl/P1rhc
Big
Data
Engineering
Top
10
Pragmatics
(Summary)
o http://goo.gl/0SQDV
The
Art
of
Big
Data
(Detailed)
o http://goo.gl/EaUKH
The
Hitchhiker’s
Guide
to
Kaggle
OSCON
2011
Tutorial
5. Twitter Tips – A Baker’s Dozen
1. Twitter
APIs
are
(more
or
less)
congruent
&
symmetric
2. Twitter
is
usually
right
&
simple
-‐
recheck
when
you
get
unexpected
results
before
blaming
Twitter
o I
was
getting
numbers
when
I
was
expecting
screen_names
in
user
objects.
o Was
ready
to
send
blasting
e-‐mails
to
Twitter
team.
Decided
to
check
one
more
time
and
found
that
my
parameter
key
was
wrong-‐screen_name
instead
of
user_id
o Always test with one or two records before a long run ! - learned the hard way
3. Twitter
APIs
are
very
powerful
–
consistent
use
can
bear
huge
data
o In
a
week,
you
can
pull
in
4-‐5
million
users
&
some
tweets
!
o Night runs are far more faster & error-free
4. Use
a
NOSQL
data
store
as
a
command
buffer
&
data
buffer
o Would
make
it
easy
to
work
with
Twitter
at
scale
o I
use
MongoDB
The
o Keep
the
schema
simple
&
no
fancy
transformation
End
• And
as
far
as
possible
same
as
the
( json)
response
Beg As Th
inni
o Use
NOSQL
CLI
for
trimming
records
et
al
ng
e
6. Twitter Tips – A Baker’s Dozen
5. Always
use
a
big
data
pipeline
o Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize
o That
way
you
can
orthogonally
extend,
with
functional
components
like
command
buffers,
validation
et
al
6. Use
functional
approach
for
a
scalable
pipeline
o Compose
your
data
big
pipeline
with
well
defined
granular
functions,
each
doing
only
one
thing
o Don’t
overload
the
functional
components
(i.e.
no
collect,
unroll
&
store
as
a
single
component)
o Have
well
defined
functional
components
with
appropriate
caching,
buffering,
checkpoints
&
restart
techniques
• This did create some trouble for me, as we will see later
7. Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh
cycle
o The
equivalent
of
the
traditional
ETL
o Validation
stage
&
validation
routines
are
important
• Cannot
expect
perfect
runs
• Cannot
manually
look
at
data
either,
when
data
is
at
scale
8. Have
control
numbers
to
validate
runs
&
monitor
them
o I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
number through the various runs !
o There will be a separate printout of the control numbers that will be kept in the operations files
7. Twitter Tips – A Baker’s Dozen
9. Program
defensively
o more so for a REST-based-Big Data-Analytics systems
o Expect
failures
at
the
transport
layer
&
accommodate
for
them
10. Have
Erlang-‐style
supervisors
in
your
pipeline
o Fail
fast
&
move
on
o Don’t
linger
and
try
to
fix
errors
that
cannot
be
controlled
at
that
layer
o A
higher
layer
process
will
circle
back
and
do
incremental
runs
to
correct
missing
spiders
and
crawls
o Be
aware
of
visibility
&
lack
of
context.
Validate
at
the
lowest
layer
that
has
enough
context
to
take
corrective
actions
o I have an example in part 2
11. Data
will
never
be
perfect
o Know
your
data
&
accommodate
for
it’s
idiosyncrasies
• for
example:
0
followers,
protected
users,
0
friends,…
8. Twitter Tips – A Baker’s Dozen
12. Check
Point
frequently
(preferably
after
ever
API
call)
&
have
a
re-‐startable
command
buffer
cache
o See a MongoDB example in Part 2
13. Don’t
bombard
the
URL
o Wait
a
few
seconds
before
successful
calls.
This
will
end
up
with
a
scalable
system,
eventually
o I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14. Always
measure
the
elapsed
time
of
your
API
runs
&
processing
o Kind
of
early
warning
when
something
is
wrong
15. Develop
incrementally;
don’t
fail
to
check
“cut
&
paste”
errors
9. Twitter Tips – A Baker’s Dozen
16. The
Twitter
big
data
pipeline
has
lots
of
opportunities
for
parallelism
o Leverage
data
parallelism
frameworks
like
MapReduce
o But
first
:
§ Prototype
as
a
linear
system,
§ Optimize
and
tweak
the
functional
modules
&
cache
strategies,
§ Note
down
stages
and
tasks
that
can
be
parallelized
and
§ Then
parallelize
them
o For the example project, we will see later, I did not leverage any parallel frameworks, but the
opportunities were clearly evident. I will point them out, as we progress through the tutorial
17.
Pay
attention
to
handoffs
between
stages
o They
might
require
transformation
–
for
example
collect
&
store
might
store
a
user
list
as
multiple
arrays,
while
the
model
requires
each
user
to
be
a
document
for
aggregation
o But resist the urge to overload collect with transform
o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
the array to separate documents
o Add transformation as a granular function – of course, with appropriate buffering, caching,
checkpoints & restart techniques
18. Have
a
good
log
management
system
to
capture
and
wade
through
logs
10. Twitter Tips – A Baker’s Dozen
19. Understand
the
underlying
network
characteristics
for
the
inference
you
want
to
make
o Twitter
Network
!=
Facebook
Network
,
Twitter
Graph
!=
LinkedIn
Graph
o Twitter
Network
is
more
of
an
Interest
Network
o So, many of the traditional network mechanisms & mechanics, like network
diameter & degrees of separation, might not make sense
o But, others like Cliques and Bipartite Graphs do
11. Twitter Gripes
1. Need
more
rich
APIs
for
#tags
o Somewhat
similar
to
users
viz.
followers,
friends
et
al
o Might
make
sense
to
make
#tags
a
top
level
object
with
it’s
own
semantics
2. HTTP
Error
Return
is
not
uniform
o Returns
400
bad
Request
instead
of
420
o Granted, there is enough information to figure this out
3. Need
an
easier
way
to
get
screen_name
from
user_id
4. “following”
vs.
“friends_count”
i.e.
“following”
is
a
dummy
variable.
o There are a few like this, most probably for backward compatibility
5. Parameter
Validation
is
not
uniform
o Gives
“404
Not
found”
instead
of
“406
Not
Acceptable”
or
“413
Too
Long”
or
“416
Range
Unacceptable”
6. Overall
more
validation
would
help
o Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
rest is easy to figure out
12. A Fork
&
deep
,NLTK
• NLP weets
into
T ment
4
o Sen ysis
Anal
• Not enough time for both
• I chose the Social Graph route
13. A minute about Twitter as platform & it’s evolution
blog/
er. com/ tter-‐
twitt wi
ps:/ /dev. nsistent-‐t
htt ring-‐co
e
deliv ence
“The micro-blogging service must find the
ri
expe
right balance of running a profitable
business and maintaining a robust
“.. we want to make sure that the Twitter experience is developers' community.” – Chenda, CBS
straightforward and easy to understand -- whether you’re on
news!
Twitter.com or elsewhere on the web”-Michael!
My
Wish
&
Hope
• I
spend
a
lot
of
time
with
Twitter
&
derive
value;
the
platform
is
rich
&
the
APIs
intuitive
• I
did
like
the
fact
that
tweets
are
part
of
LinkedIn.
I
still
used
Twitter
more
than
LinkedIn
o I
don’t
think
showing
Tweets
in
LinkedIn
took
anything
away
from
the
Twitter
experience
o LinkedIn
experience
&
Twitter
experience
are
different
&
distinct.
Showing
tweets
in
LinkedIn
didn’t
change
that
• I
sincerely
hope
that
the
platform
grows
with
a
rich
developer
eco
system
• Orthogonally
extensible
platform
is
essential
• Of
course,
along
with
a
congruent
user
experience
–
“
…
core
Twitter
consumption
experience
through
consistent
tools”
14. • For
Hands
on
Today
Setup
o Python
2.7.3
o easy_install
–v
requests
• http://docs.python-‐requests.org/en/latest/user/quickstart/#make-‐a-‐
request
o easy_install
–v
requests-‐oauth
o Hands
on
programs
at
https://github.com/xsankar/oscon2012-‐handson
• For
advanced
data
science
with
social
graphs
o easy_install
–v
networkx
o easy_install
–v
numpy
o easy_install
–v
nltk
• Not
for
this
tutorial,
but
good
for
sentiment
analysis
et
al
o Mongodb
• I
used
MongoDB
in
AWS
m2.xlarge,
RAID
10
X
8
X
15
GB
EBS
o graphviz
-‐
http://www.graphviz.org/;
easy_install
pygraphviz
o easy_install
pydot
16. Problem Domain For this tutorial
• Data
Science
(trends,
analytics
et
al)
on
Social
Networks
as
observed
by
Twitter
primitives
o Not
for
Twitter
based
apps
for
real
time
tweets
o Not
web
sites
with
real
time
tweets
• By
looking
at
the
domain
in
aggregate
to
derive
inferences
&
actionable
recommendations
• Which
also
means,
you
need
to
be
deliberate
&
systemic
(
i.e.
not
look
at
a
fluctuation
as
a
trend
but
dig
deeper
before
pronouncing
a
trend)
17. Agenda
I. Mechanics
:
Twitter
API
(1:30
PM
-‐
3:00
PM)
o Essential
Fundamentals
(Rate
Limit,
HTTP
Codes
et
al)
o Objects
o API
o Hands-‐on
(2:45
PM
-‐
3:00
PM)
II. Break
(3:00
PM
-‐
3:30
PM)
III. Twitter
Social
Graph
Analysis
(3:30
PM
-‐
5:00
PM)
o Underlying
Concepts
o Social
Graph
Analysis
of
@clouderati
§ Stages,
Strategies
&
Tasks
§ Code
Walk
thru
19. Twi5er API : Read These First
• Using
Twitter
Brand
o New
logo
&
associated
guidelines
:
https://twitter.com/about/logos
o Twitter
Rules
:
https://support.twitter.com/groups/33-‐report-‐a-‐violation/topics/121-‐guidelines-‐
best-‐practices/articles/18311-‐the-‐twitter-‐rules
o Developer
Rules
of
the
road
https://dev.twitter.com/terms/api-‐terms
• Read
These
Links
First
1. https://dev.twitter.com/docs/things-‐every-‐developer-‐should-‐know
2. https://dev.twitter.com/docs/faq
3. Field
Guide
to
Objects
https://dev.twitter.com/docs/platform-‐objects
4. Security
https://dev.twitter.com/docs/security-‐best-‐practices
5. Media
Best
Practices
:
https://dev.twitter.com/media
6. Consolidates
Page
:
https://dev.twitter.com/docs
7. Streaming
APIs
https://dev.twitter.com/docs/streaming-‐apis
8. How
to
Appeal
(Not
that
you
all
would
need
it
!)
https://support.twitter.com/
articles/72585
• Only
One
version
of
Twitter
APIs
20. API Status Page
• https://dev.twitter.com/status
• https://dev.twitter.com/issues
• https://dev.twitter.com/discussions
22. Open This First
• Install
pre-‐req
as
per
the
setup
slide
• Run
o oscon2012_open_this_first.py
o To
test
connectivity
–
“canary
query”
• Run
o oscon2012_rate_limit_status.py
o Use
http://www.epochconverter.com
to
check
reset_time
• Formats
xml,
json,
atom
&
rss
23. Twitter
API
Near-realtime,
High Volume
Follow users,
Core Data,
REST
Streaming
topics, data
Core Twitter mining
Objects
Public
Streams
Seach & User
Streams
Trend
Twitter
Twitter
Site
Streams
REST
Search
Firehose
Build
Profile
Keywords
Create/Post
Tweets
Specific
User
Reply
Trends
Favorite,
Re-‐tweet
Rate
Limit
:
Rate
Limit
:
150/350
Complexity
&
Frequency
25. Rate Limits
• By
API
type
&
Authentication
Mode
API
No authC
authC
Error
REST
150/hr
350/hr
400
Search
Complexity
&
-‐N/A-‐
420
Frequency
Streaming
Upto
1%
Fire
hose
none
none
33. Unexplained Errors
• Traceback
(most
recent
call
last):
•
File
"oscon2012_get_user_info_01.py",
line
39,
in
<module>
•
r
=
client.get(url,
params=payload)
•
File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py",
line
244,
in
get
•
File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py",
line
230,
in
request
•
File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/models.py",
line
609,
in
send
• requests.exceptions.ConnectionError:
HTTPSConnectionPool(host='api.twitter.com',
port=443):
Max
retries
exceeded
with
url:
/1/users/lookup.json?
user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44
614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854
7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8
962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C
While
trying
to
get
details
of
1,000,000
users,
I
get
this
error
–
17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C
usually
10-‐6
AM
PST
42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C
8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%
2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%
Got
around
by
“Trap
&
wait
5
seconds”
2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%
2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155
56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260
Night
Runs
are
relatively
error
free
09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446
14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886
54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C
13727232%2C199803906%2C220435108%2C268531201
34. • {
•
•
…
"date":
"Fri,
06
Jul
2012
03:41:09
GMT",
A Day in the life of
•
"expires":
"Fri,
06
Jul
2012
03:46:09
GMT",
Twitter Rate Limit
•
"server":
"tfe",
•
"set-‐cookie":
"dnt=;
domain=.twitter.com;
path=/;
expires=Thu,
01-‐Jan-‐1970
00:00:00
GMT",
•
"status":
"400
Bad
Request",
•
"vary":
"Accept-‐Encoding",
•
"x-‐ratelimit-‐class":
"api_identified",
•
"x-‐ratelimit-‐limit":
"350",
•
"x-‐ratelimit-‐remaining":
"0",
Missed by 4 min!
•
"x-‐ratelimit-‐reset":
"1341546334",
•
"x-‐runtime":
"0.01918"
• }
• Error,
sleeping
• {
•
…
•
"date":
"Fri,
06
Jul
2012
03:46:12
GMT",
•
…
•
"status":
"200
OK",
•
…
•
"x-‐ratelimit-‐class":
"api_identified",
•
"x-‐ratelimit-‐limit":
"350",
•
"x-‐ratelimit-‐remaining":
"349",
OK after 5 min sleep
•
…
35. Strategies
I
have
no
exotic
strategies,
so
far
!
1. Obvious
:
Track
elapsed
time
&
sleep
when
rate
limit
kicks
in
2. Combine
authenticated
&
non-‐authenticated
calls
3. Use
multiple
API
types
4. Cache
5. Store
&
get
only
what
is
needed
6. Checkpoint
&
buffer
request
commands
7. Distributed
data
parallelism
–
for
example
AWS
instances
http://www.epochconverter.com/
<-‐
useful
to
debug
the
timer
Pl share your tips and tricks for conserving the Rate Limit
37. Authentication
• Three
modes
o Anonymous
o HTTP
Basic
Auth
o OAuth
• As
of
Aug
31,
2010,
only
Anonymous
or
OAuth
are
supported
•
OAuth
enables
the
user
to
authorize
an
application
without
sharing
credentials
• Also
has
the
ability
to
revoke
• Twitter
supports
OAuth
1.0a
• OAuth
2.0
is
the
new
standard,
much
simpler
o No
timeframe
for
Twitter
support,
yet
38. OAuth Pragmatics
• Helpful
Links
o https://dev.twitter.com/docs/auth/oauth
o https://dev.twitter.com/docs/auth/moving-‐from-‐basic-‐auth-‐to-‐oauth
o https://dev.twitter.com/docs/auth/oauth/single-‐user-‐with-‐examples
o http://blog.andydenmark.com/2009/03/how-‐to-‐build-‐oauth-‐consumer.html
• Discussion
on
OAuth
internal
mechanisms
is
better
left
for
another
day
• For
headless
applications
to
get
OAuth
token,
go
to
https://
dev.twitter.com/apps
•
Create
an
application
&
get
four
credential
pieces
o Consumer
Key,
Consumer
Secret,
Access
Token
&
Access
Token
Secret
• All
the
frameworks
have
support
for
OAuth.
So
plug
–in
these
values
&
use
the
framework’s
calls
• I
used
request-‐oauth
library
like
so:
39. request-‐‑oauth
def
get_oauth_client():
Get
client
using
the
consumer_key
=
"5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"
token,
key
&
secret
from
consumer_secret
=
"fceb3aedb960374e74f559caeabab3562efe97b4"
dev.twitter.com/apps
access_token
=
"df919acd38722bc0bd553651c80674fab2b465086782Ls"
access_token_secret
=
"1370adbe858f9d726a43211afea2b2d9928ed878"
header_auth
=
True
oauth_hook
=
OAuthHook(access_token,
access_token_secret,
consumer_key,
consumer_secret,
header_auth)
client
=
requests.session(hooks={'pre_request':
oauth_hook})
return
client
Use
the
client
instead
def
get_followers(user_id):
of
requests
url
=
'https://api.twitter.com/1/followers/ids.json’
payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-‐1,"user_id":scr_name}
r
=
requests.get(url,
params=payload)
def
get_followers_with_oauth(user_id,client):
url
=
'https://api.twitter.com/1/followers/ids.json'
payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-‐1,"user_id":scr_name}
r
=
client.get(url,
params=payload)
Ref: h5p://pypi.python.org/pypi/requests-‐‑oauth
40. OAuth Authorize screen
• The
user
authenticates
with
Twitter
&
grants
access
to
Forbes
Social
• Forbes
social
doesn’t
have
the
users
credentials,
but
uses
OAuth
to
access
the
user’s
account
42. HTTP status Codes
• 0
Never
made
it
to
Twitter
Servers
-‐
• 404
Not
Found
Library
error
• 406
Not
Acceptable
• 200
OK
• 413
Too
Long
• 304
Not
Modified
• 416
Range
Unacceptable
• 400
Bad
Request
• 420
Enhance
Your
Calm
o Check
error
message
for
explanation
o Rate
Limited
o REST
Rate
Limit
!
• 500
Internal
Server
Error
• 401
UnAuthorized
• 502
Bad
Gateway
o Beware
–
you
could
get
this
for
other
o Down
for
maintenance
reasons
as
well.
• 503
Service
Unavailable
• 403
Forbidden
o Overloaded
“Fail
whale”
o Hit
Update
Limit
(>
max
Tweets/day,
• 504
Gateway
Timeout
following
too
many
people)
o Overloaded
h5ps://dev.twi5er.com/docs/error-‐‑codes-‐‑responses
44. HTTP Status Code – Confusing Example
• {
• GET
https://api.twitter.com/1/users/lookup.json?
• …
screen_nme=twitterapi,twitter&include_entities=
•
"pragma":
"no-‐cache",
true
•
"server":
"tfe",
•
…
• Spelling
Mistake
•
"status":
"404
Not
Found",
o Should
be
screen_name
•
…
• But
confusing
error
!
• }
• {
• Should
be
406
Not
Acceptable
or
413
Too
Long
,
•
"errors":
[
showing
parameter
error
•
{
•
"code":
34,
•
"message":
"Sorry,
that
page
does
not
exist"
•
}
•
]
• }
45. HTTP Status Code -‐‑ Example
• {
•
"cache-‐control":
"no-‐cache,
no-‐store,
must-‐revalidate,
pre-‐check=0,
post-‐check=0",
•
"content-‐encoding":
"gzip",
•
"content-‐length":
"112",
•
"content-‐type":
"application/json;charset=utf-‐8",
Sometimes,
the
errors
are
•
"date":
"Sat,
23
Jun
2012
01:23:47
GMT",
not
correct.
I
got
this
error
•
"expires":
"Tue,
31
Mar
1981
05:00:00
GMT",
• …
for
user_timeline.json
w/
•
"status":
"401
Unauthorized",
user_id=20,15,12
•
"www-‐authenticate":
"OAuth
realm="https://api.twitter.com"",
Clearly
a
parameter
error
•
"x-‐frame-‐options":
"SAMEORIGIN",
•
"x-‐ratelimit-‐class":
"api",
(i.e.
more
parameters)
•
"x-‐ratelimit-‐limit":
"150",
•
"x-‐ratelimit-‐remaining":
"147",
•
"x-‐ratelimit-‐reset":
"1340417742",
•
"x-‐transaction":
"d545a806f9c72b98"
• }
• {
•
"error":
"Not
authorized",
•
"request":
"/1/statuses/user_timeline.json?user_id=12%2C15%2C20"
• }
47. Followers
Twitter
Platform
Friends
Are Followed By
Objects
Follow
Users
Status Update
@ user_mentions
Entities
embed
urls
Temporally
Tweets
embe
d
Ordered
media
TimeLine
#
Places
hashtags
h5ps://dev.twi5er.com/docs/platform-‐‑objects
48. Tweets
• A.k.a
Status
Updates
• Interesting
fields
o Coordinates
<-‐
geo
location
o created_at
o entities
(will
see
later)
o Id,
id_str
o possibly
sensitive
o user
(will
see
later)
• perspectival
attributes
embedded
within
a
child
object
of
an
unlike
parent
–
hard
to
maintain
at
scale
• https://dev.twitter.com/docs/faq#6981
o withheld_in_countries
• https://dev.twitter.com/blog/new-‐withheld-‐content-‐fields-‐api-‐responses
h5ps://dev.twi5er.com/docs/platform-‐‑objects/tweets
49. A word about id, id_str
• June
1,
2010
o Snowflake
the
id
generator
service
o “The
full
ID
is
composed
of
a
timestamp,
a
worker
number,
and
a
sequence
number”
o Had
problems
with
JavaScript
to
handle
numbers
>
53
bits
o “id”:819797
o “id_str”:”819797”
h5p://engineering.twi5er.com/2010/06/announcing-‐‑snowflake.html
h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-‐‑development-‐‑talk/ahbvo3VTIYI
h5ps://dev.twi5er.com/docs/twi5er-‐‑ids-‐‑json-‐‑and-‐‑snowflake
50. Tweets -‐‑ example
• Let
us
run
oscon2012-‐tweets.py
• Example
of
tweet
o coordinates
o id
o id_str
52. Users – Let us run some examples
• Run
o oscon_2012_users.py
• Lookup
users
by
screen_name
o oscon12_first_20_ids.py
• Lookup
users
by
user_id
• Inspect
the
results
o id,
name,
status,
status_count,
protected,
followers
(for
top
10
followers),
withheld
users
• Can
use
information
for
customizing
the
user’s
screen
in
your
web
app
53. Entities
• Metadata
&
Contextual
Information
• You
can
parse
them,
but
Entities
parse
them
out
as
structured
data
• REST
API/Search
API
–
include_entities=1
• Streaming
API
–
included
by
default
• hashtags,
media,
urls,
user_mentions
h5ps://dev.twi5er.com/docs/platform-‐‑objects/entities
h5ps://dev.twi5er.com/docs/tweet-‐‑entities
h5ps://dev.twi5er.com/docs/tco-‐‑url-‐‑wrapper
54. Entities
• Run
o oscon2012_entities.py
• Inspect
hashtags,
urls
et
al
55. Places
• attributes
• bounding_box
• Id
(as
a
string!)
• country
• name
h5ps://dev.twi5er.com/docs/platform-‐‑objects/places
h5ps://dev.twi5er.com/docs/about-‐‑geo-‐‑place-‐‑a5ributes
56. Places
• Can
search
for
tweets
near
a
place
like
so:
• Get
latlong
of
conv
center
[45.52929,-‐122.66289]
o Tweets
near
that
place
• Tweets
near
San
Jose
[37.395715,-‐122.102308]
• We
will
not
see
further
here.
But
very
useful
57. Timelines
• Collections
of
tweets
ordered
by
time
• Use
max_id
&
since_id
for
navigation
h5ps://dev.twi5er.com/docs/working-‐‑with-‐‑timelines
58. Other Objects & APIs
• Lists
• Notifications
• Friendships/exists
to
see
if
one
follows
the
other
59. Followers
Twitter
Platform
Friends
Are Followed By
Objects
Follow
Users
Status Update
@ user_mentions
Entities
embed
urls
Temporally
Tweets
embe
d
Ordered
media
TimeLine
#
Places
hashtags
h5ps://dev.twi5er.com/docs/platform-‐‑objects
60. Hands-‐‑on Exercise (15 min)
• Setup
environment
–
slide
#14
• Sanity
Check
Environment
&
Libraries
o oscon2012_open_this_first.py
o oscon2012_rate_limit_status.py
• Get
objects
(show
calls)
o Lookup
users
by
screen_name
-‐
oscon12_users.py
o Lookup
users
by
id
-‐
oscon12_first_20_ids.py
o Lookup
tweets
-‐
oscon12_tweets.py
o Get
entities
-‐
oscon12_entities.py
• Inspect
the
results
• Explore
a
little
bit
• Discussion
62. Twitter
API
Near-realtime,
High Volume
Follow users,
Core Data,
REST
Streaming
topics, data
Core Twitter mining
Objects
Public Streams
Seach & User Streams
Trend
Twitter
Twitter
Site Streams
REST
Search
Firehose
Build Profile
Keywords
Create/Post Tweets
Specific User
Reply
Trends
Favorite, Re-‐‑tweet
Rate Limit :
Rate Limit : 150/350
Complexity & Frequency
63. Twi5er REST API
• https://dev.twitter.com/docs/api
• What
we
were
doing
were
the
REST
API
• Request-‐Response
• Anonymous
or
OAuth
• Rate
Limited
:
o 150/350
64. Twi5er Trends
• oscon2012-‐trends.py
• Trends/weekly,
Trends/monthly
• Let
us
run
some
examples
o oscon2012_trends_daily.py
o oscon2012_trends_weekly.py
• Trends
&
hashtags
o #hashtag
euro2012
o http://hashtags.org/euro2012
o http://sproutsocial.com/insights/2011/08/twitter-‐hashtags/
o http://blog.twitter.com/2012/06/euro-‐2012-‐follow-‐all-‐action-‐on-‐pitch.html
o Top
10
:
http://twittercounter.com/pages/100,
http://twitaholic.com/
65. Brand Rank w/ Twi5er
• Walk
Through
&
results
of
following
o oscon2012_brand_01.py
• Followed
10
user-‐brands
for
a
few
days
to
find
growth
• Brand
Rank
o Growth
of
a
brand
w.r.t
the
industry
o Surge
in
popularity
–
could
be
due
to
–ve
or
+ve
buzz.
Need
to
understand
&
correlate
using
Twitter
APIs
&
metrics
• API
:
url='https://api.twitter.com/1/users/
lookup.json'
• payload={"screen_name":"miamiheat,okcthunder,n
ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,
googleio,OReillyMedia"}
67. Brand Rank w/ Twi5er
Tech Brands
• Google
I/O
showed
a
spike
on
6/27-‐
6/28
• OReillyMedia
shares
some
spike
• Looking
at
a
few
days
worth
of
data,
our
best
inference
is
that
“oscon
doesn’t
track
with
googleio”
• “Clouderati
doesn’t
track
at
all”
68. Brand Rank w/ Twi5er
World of Soccer
• FOXSoccer,UEFAcom
track
each
other
The numbers seldom
decrease. So calculating
–ve velocity will not
work
OTOH, if you see a –ve
velocity, investigate
69. Brand Rank w/ Twi5er
World of Basketball
• NBA,
MiamiHeat,
okcthunder
track
each
other
• Used
%
than
absolute
numbers
to
compare
• The
hike
on
7/6
to
7/10
is
interesting.
70. Brand Rank w/ Twi5er
Rising Tide …
• For
some
reason,
all
numbers
are
going
up
7/6
thru
7/10
–
except
for
clouderati!
• Is
a
rising
(Twitter)
tide
lifting
all
(well,
almost
all)
?
71. Trivia : Search API
• Search(search.twitter.com)
o Built
by
Summize
which
was
acquired
by
Twitter
in
2008
o Summize
described
itself
as
“sentiment
mining”
72. Search API
• Very
simple
o GET
http://search.twitter.com/search.json?q=<blah>
• Based
on
a
search
criteria
• “The Twitter Search API is a dedicated API for
running searches against the real-time index of
recent Tweets”
• Recent
=
Last
6-‐9
days
worth
of
tweets
• Anonymous
Call
• Rate
Limit
o Not
No.
of
calls/hour,
but
Complexity
&
Frequency
h5ps://dev.twi5er.com/docs/using-‐‑search
h5ps://dev.twi5er.com/docs/api/1/get/search
73. Search API
• Filters
o Search
URL
encoded
o @
=
%40,
#=%23
o
emoticons
:)
and
:(,
o http://search.twitter.com/search.atom?q=sometimes+%3A)
o http://search.twitter.com/search.atom?q=sometimes+%3A(
• Location
Filters,
date
filters
• Content
searches
74. Streaming API
• Not
request
response;
but
stream
• Twitter
frameworks
have
the
support
• Rate
Limit
:
Upto
1%
• Stall
warning
if
the
client
is
falling
behind
• Good
Documentation
Links
o https://dev.twitter.com/docs/streaming-‐apis/connecting
o https://dev.twitter.com/docs/streaming-‐apis/parameters
o https://dev.twitter.com/docs/streaming-‐apis/processing
75. Firehose
• ~
400
million
public
tweets/day
• If
you
are
working
with
Twitter
firehose,
I
envy
you
!
• If
you
hit
real
limits,
then
explore
the
firehose
route
• AFAIK,
it
is
not
cheap,
but
worth
it
76. API Best Practices
1. Use
JSON
2. Use
user_id
than
screen_name
o User_id
is
constant
while
screen_name
can
change
3. max_id
and
since_id
o For
example
direct
messages,
if
you
have
last
message
use
since_id
for
search
o max_id
how
far
to
go
back
4. Cache
as
much
as
you
can
5. Set
the
User-‐Agent
header
for
debugging
I have listed a few good blogs that have API best practices, in the
reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the
sources
77. Twitter
API
Near-realtime,
High Volume
Follow users,
Core Data,
REST
Streaming
topics, data
Core Twitter mining
Objects
Public Streams
Seach & User Streams
Trend
Twitter
Twitter
Site Streams
REST
Search
Firehose
Build Profile
Questions
?
Keywords
Create/Post Tweets
Specific User
Reply
Trends
Favorite, Re-‐‑tweet
Rate Limit :
Rate Limit : 150/350
Complexity & Frequency
79. 2.
Store
3.
Transform
&
1.
Collect
Analyze
the
Validate Dataset & . Keep don’t
Tip: 3 simple;
re-crawl/refresh
a
schem afrai d to
be
for m
Most
important
&
trans
the
ugliest
slide
in
this
deck
!
as
lem ent ,
1. Imp ipeline 4.
Model
Tip: age d p nolith 5.
Predict,
&
a st r a mo Reason
neve Recommend
&
Visualize
80. Trivia
• Social
Network
Analysis
originated
as
Sociometry
&
the
social
network
was
called
a
sociogram
• Back
then,
Facebook
was
called
SocioBinder!
• Jacob
Levi
Morano,
is
considered
the
originator
o NYTimes,
April
3,
1933,
P.
17
82. Twi5er Networks-‐‑Definitions
• In-‐degree
o Followers
• Out-‐Degree
o Friends/Follow
• Centrality
Measures
• Hubs
&
Authorities
o Hubs/Directories
tell
us
where
Authorities
are
o “Of
Mortals
&
Celebrities”
is
more
“Twitter-‐style”
83. Twi5er Networks-‐‑Properties
M
• Concepts
From
Citation
N
Networks
K
J
o Cocitation
L
I
• Common
papers
that
cite
a
paper
A
• Common
Followers
B G
o C
&
G
(Followed
by
F
&
H)
C H
o Bibliographic
Coupling
• Cite
the
same
papers
D F
• Common
Friends
(i.e.
follow
same
E
person)
o D,
E,
F
&
H
84. Twi5er Networks-‐‑Properties
• Concepts
From
Citation
Networks
M
o Cocitation
N
• Common
papers
that
cite
a
paper
K
• Common
Followers
J
L
o C
&
G
(Followed
by
F
&
H)
I
o Bibliographic
Coupling
A
• Cite
the
same
papers
B G
• Common
Friends
(i.e.
follow
same
person)
o D,
E,
F
&
H
follow
C
o H
&
F
follow
C
&
G
H
C
• So
H
&
F
have
high
coupling
D
• Hence,
if
H
follows
A,
we
can
F
recommend
F
to
follow
A
E
85. Twi5er Networks-‐‑Properties
• Bipartite/Affiliation
Networks
o Two
disjoint
subsets
o The
bipartite
concept
is
very
relevant
to
Twitter
social
graph
o Membership
in
Lists
• lists
vs.
users
bipartite
graph
o Common
#Tags
in
Tweets
• #tags
vs.
members
bipartite
graph
o @mention
together
• ?
Can
this
be
a
bipartite
graph
• ?
How
would
we
fold
this
?
86. Other Metrics & Mechanisms
• Kronecker
Graphs
Models
o Kronecker
product
is
a
way
of
generating
self-‐similar
matrices
o Prof.Leskovec
et
al
define
the
Kronecker
product
of
two
graphs
as
the
Kronecker
product
of
their
adjacency
matrices
o Application
:
Generating
models
for
analysis,
prediction,
anomaly
detection
et
al
• Erdos-‐Renyl
Random
Graphs
o Easy
to
build
a
Gn,p
graph
o Assumes
equal
likelihood
of
edges
between
two
nodes
o In a Twitter social network, we can create a more realistic expected distribution (adding the
“social reality” dimension) by inspecting the #tags & @mentions
• Network
Diameter
• Weak
Ties
• Follower
velocity
(+ve
&
–ve),
Association
strength
o Unfollow
not
a
reliable
measure
o But
an
interesting
property
to
investigate
when
it
happens
Not covered here, but potential for an encore !
Ref: Jure Leskovec: Kronecker Graphs, Random Graphs
87. Twi5er Networks-‐‑Properties
• Twitter != LinkedIn, Twitter != Facebook
• Twitter Network == Interest Network
• Be
cognizant
of
the
above
when
you
apply
traditional
network
properties
to
Twitter
• For
example,
o Six
degrees
of
separation
doesn't
make
sense
(most
of
the
time)
in
Twitter
–
except
may
be
for
Cliques
o Is
diameter
a
reliable
measure
for
a
Twitter
Network
?
• Probably
not
o Do
cut
sets
make
sense
?
• Probably
not
o But
citation
network
principles
do
apply;
we
can
learn
from
cliques
o Bipartite
graphs
do
make
sense
88. Cliques (1 of 2)
• “Maximal
subset
of
the
vertices
in
an
undirected
network
such
that
every
member
of
the
set
is
connected
by
an
edge
to
every
other”
• Cohesive
subgroup,
closely
connected
• Near-‐cliques
than
a
perfect
clique
(k-‐plex
i.e.
connected
to
at
least
n-‐k
others)
• k-‐plex
clique
to
discover
sub
groups
in
a
sparse
network;
1-‐plex
being
the
perfect
clique
Ref: Networks, An Introduction-‐‑Newman
89. Cliques (2 of 2)
• k-‐core
–
at
least
k
others
in
the
subset;
(n-‐k)-‐plex
• k-‐clique
–
no
more
than
k
distance
away
o Path
inside
or
outside
the
subset
o k-‐clan
or
k-‐club
(path
inside
the
subset)
• We
will
apply
k-‐plex
Cliques
for
one
of
our
hands-‐on
Ref: Networks, An Introduction-‐‑Newman
90. Sentiment Analysis
• Sentiment
Analysis
is
an
important
&
interesting
work
on
the
Twitter
platform
o Collect
Tweets
o Opinion
Estimation
-‐Pass
thru
Classifier,
Sentiment
Lexicons
• Naïve
Bayes/Max
Entropy
Class/SVM
o Aggregated
Text
Sentiment/Moving
Average
• I
chose
not
to
dive
deeper
because
of
time
constraints
o Couldn’t
do
justice
to
API,
Social
Network
and
Sentiment
Analysis,
all
in
3
hrs
• Next
3
Slides
have
couple
of
interesting
examples
92. Need I say more ?
“A
bit
of
clever
math
can
uncover
interes4ng
pa7erns
that
are
not
visible
to
the
human
eye”
h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-‐‑social-‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket
h5p://www.relevantdata.com/pdfs/IUStudy.pdf
95. Interesting Vectors of Exploration
1. Find
trending
#tags
&
then
related
#tags
–
using
cliques
over
co-‐#tag-‐citation,
which
infers
topics
related
to
trending
topics
2. Related
#tag
topics
over
a
set
of
tweets
by
a
user
or
group
of
users
3. Analysis-‐In/Out
flow,
Tweet
Flow
– Frequent
@mention
4. Find
affiliation
networks
by
List
memberships,
#tags
or
frequent
@mentions
96. Interesting Vectors of Exploration
5. Use
centrality
measures
to
determine
mortals
vs.
celebrities
6. Classify
Tweet
networks/cliques
based
on
message
passing
characteristics
– Tweets
vs.
Retweets,
No
of
reweets,…
7. Retweet
Network
– Measure
Influence
by
retweet
count
&
frequency
– Information
contagion
by
looking
at
different
retweet
network
subcomponents
–
who,
when,
how
much,…
98. Analysis Story Board
• @clouderati
is
a
popular
cloud
related
Twitter
account
• Goals:
o Analyze
the
social
graph
characteristics
of
the
users
who
are
following
the
account
In this • Dig
one
level
deep,
to
the
followers
&
friends,
of
the
tutorial
followers
of
@clouderati
o How
many
cliques
?
How
strong
are
they
?
o Does
the
@mention
support
the
clique
inferences
?
For you to o What
are
the
retweet
characteristics
?
explore !!
o How
does
the
#tag
network
graph
look
like
?
99. Twi5er Analysis Pipeline Story Board
Stages, Strategies, APIs & Tasks
Stage
4
Stag
o e
5
o Get
&
Store
User
details
For
e
(distinct
user
list)
follo ach
@c
o w loud
o Unroll
Find er
erat
frie i
inte nd=f
rsec o
tion llower
Note:
Needed
a
Note:
Unroll
-‐
se
stage
took
time
t
command
buffer
to
manage
scale
&
missteps
(~980,000
users)
Stage
3
Stage
6
raph
s ocial
g heory
o Create twork
t
ne
o Get
distinct
user
list
o Apply
ues
&
other
applying
the
liq
o Infer
c s
set(union(list))
operation
tie
proper
100. @clouderati Twi5er Social Graph
• Stats
(Retrospect
after
the
runs):
o Stage
1
• @clouderati
has
2072
followers
o Stage
2
• Limiting
followers
to
5,000
per
user
o Stage
3
• Digging
1st
level
(set
union
of
followers
&
friends
of
the
followers
of
@clouderati)
explodes
into
~980,000
distinct
users
o MongoDB
of
the
cache
and
intermediate
datasets
~10
GB
o The
database
was
hosted
at
AWS
(Hi
Mem
XLarge
–
m2.xlarge
),
8
X
15
GB,
Raid
10,
opened
to
Internet
with
DB
authentication
101. Code & Run Walk Through
o Code:
§ oscon_2012_user_list_spider_01.py
o Challenges:
Stage
1
§ Nothing
fancy
§ Get
the
record
and
store
o Get
@clouderati
Followers
o Store
in
MongoDB
§ Would
have
had
to
recurse
through
a
REST
cursor
if
there
were
more
than
5000
followers
§ @clouderati
has
2072
followers
o Interesting
Points:
102. Code & Run Walk Through
o Code:
§ oscon_2012_user_list_spider_02.py
§ oscon_2012_twitter_utils.py
§ oscon_2012_mongo.py
§ oscon_2012_validate_dataset.py
o Challenges:
§ Multiple
runs,
errors
et
al
!
Stage
2
o Interesting
Points:
§ Set
operation
between
two
mongo
collections
for
restart
buffer
o Crawl
1
level
deep
§ Protected
users,
some
had
0
followers,
or
0
friends
o Get
friends
&
followers
§ Interesting
operations
for
validate,
re-‐crawl
and
refresh
o Validate,
re-‐crawl
&
refresh
§ Added
“status_code”
to
differentiate
protected
users
§ {'$set':
{'status_code':
'401
Unauthorized,401
Unauthorized'}}
§ Getting friends & followers of 2000 users is the hardest (or so I thought,
until I got through the next stage!)