Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding Emerging Topics Using Chaos and Community Detection in Social Media Graphs

1,213 views

Published on

In this talk, we describe our recent work in the analysis of Twitter-based network graphs, including the Ebola crisis in 2014 and the stock market in 2015.

Published in: Data & Analytics
  • Be the first to comment

Finding Emerging Topics Using Chaos and Community Detection in Social Media Graphs

  1. 1. Finding Emerging Topics Using Chaos and Community Detection in Social Media Graphs Steve Kramer, Ph.D. President & Chief Scientist Paragon Science, Inc. September 2015 Copyright © 2006-2015 Paragon Science, Inc. All rights reserved.
  2. 2. Overview  Background Information about Paragon Science  Example 1: Ebola Twitter Analysis 2014  Example 2: Stock Market Analysis via Twitter  Q & A Paragon Science, Inc. 2
  3. 3. About Paragon Science  Advisory Board Company • Analysis of Healthcare Data  Digital Motorworks/CDK Global • Vehicle Pricing Analytics  Houston Law Firm • Email Analysis for Patent Lawsuit  Place IQ • Mobile Phone Data Analysis  RetailMeNot • Web Analytics for Online Coupons  Vast.com • Web User Click Patterns Paragon Science, Inc. 3  Founder: Dr. Steve Kramer • PhD in computational physics (nonlinear dynamics) • Self-funded data science entrepreneur • 22 years of research and high-tech experience • Manager and consultant at software companies • Reviewer for scientific journals and conferences • Member of StartOut Austin steering committee   http://affinityincmagazine.com/paragon-science-puts-patented-technology /
  4. 4. Paragon Science, Inc. 4  Using our patented anomaly detection software to find the “unknown unknowns”: unusual changes that represent revenue opportunities to exploit or risks to mitigate  Many possible application areas: • Social media alerting and sentiment change detection • Pricing and market trend analysis and alerting • Fraud prevention (banking, insurance, online auctions,…)  Key advantages • No machine learning or training required • Robust to missing or erroneous data • Highly scalable and parallelizable What Are We Doing?
  5. 5. Paragon Science, Inc. 5 How Is It Done Today?  Existing approaches • Standard SNA metrics • Rule-based systems (transaction profiling, etc.) • Bayesian and other statistical/probabilistic models • Machine learning tools (neural nets, HMMs, etc.)  Some limitations of existing methods • Training requirements can be large for neural nets. • For rule-based systems, it is difficult to effectively predict or define new “bad” anomalies or patterns in advance. • Many current methods are not scalable to real-world operational requirements.
  6. 6. Paragon Science, Inc. 6 What Is New in Our Patented Approach?  A powerful anomaly detection approach that incorporates nonlinear time series analysis methods • US Patent #8738652 (1.usa.gov/1kkyVD9) “Systems and Methods for Dynamic Anomaly Detection”  Key questions answered: • Which entities behave or evolve differently than others in the data set? • Which entities have shifted their behavior unexpectedly?
  7. 7. Paragon Science, Inc. 7 What Is New in Our Approach? (Cont’d.)  Our framework inherently captures the dynamics of the entities under study, without having to specify in advance normal vs. abnormal behavior.  We can simultaneously analyze the time evolution of • Network structures • Any associated attributes (text terms, geospatial position, etc.)  Our technique is robust with respect to missing or erroneous data.  As result, we can • Find key players in rapidly changing networks • Provide early warning of viral videos and online documents • Focus attention on the most-anomalous events or transactions
  8. 8. Paragon Science, Inc. 8 Dynamic Anomaly Detection Overview  A general approach that incorporates nonlinear time series analysis methods • Complexity measures • Finite-time Lyapunov exponents (FTLEs)  Input data • Communications or transactional data streams • General time-dependent data sets  Key questions • Which entities behave or evolve differently than others in the data set? • Which entities have shifted their behavior unexpectedly?
  9. 9. Paragon Science, Inc. 9 Finite-Time Lyapunov Exponents (FTLEs)  General dynamical system  Flow map • Advects points in the state space • Describes the time evolution of the system
  10. 10. Paragon Science, Inc. 10  FTLEs characterize the amount of stretching or contraction about a point x0 during a time interval T • Stability • Predictability  Definition Finite-Time Lyapunov Exponents (FTLEs)
  11. 11. Paragon Science, Inc. 11  Similarly, characteristic vectors derived from the flow map’s Jacobian can describe the generalized directions of the local stretching or contraction.  Possible derivation approaches: • Weight-based column sampling • Singular value decomposition (SVD) • Principal component analysis (PCA) Derived Jacobian Vectors
  12. 12. Paragon Science, Inc. 12 Paragon Dynamic Anomaly Detection Representation of Data at t=ti Cluster Resolution Feature Vector Encoding Outlier Detection at t=ti 3+Time Intervals? Yes No Clustering / Segmentation Dynamic Anomaly Detection Nonlinear Time Series Analysis FTLEs, Dynamic Thresholds, etc. Pattern Classification Outlier Detection Domain-Specific Filtering Threat Signatures, Risk Profiles, etc.
  13. 13. Example 1: Ebola Twitter Analysis 2014  Sample data set from Twitter API collected using twittertap • Date range: 11/8/2014 – 11/16/2014 • 2,541,812 tweets • 4,708,678 generated links with hashtags, URLs, and user replies  Research plan • Perform k-core decomposition • Run anomaly detection software on sub-networks of nodes in the central core to find the most influential users and most viral URLs • Carry out community detection and topic detection Paragon Science, Inc. 13
  14. 14. Twitter-Induced Social Networks Paragon Science, Inc. 14 User A User B User C replies to mentions URL 1 URL 2 Hash Tag 1 Hash Tag 2 references uses uses references
  15. 15. Paragon Science, Inc. 15 K-core Decomposition  The k-core of a graph is a maximal subgraph in which each vertex has at least degree k. • The coreness of a vertex is k if it belongs to the k-core but not to the (k+1)-core. • The k-core decomposition is performing by recursively removing all the vertices (along with their respective edges) that have degrees less than k.  The k-core decomposition of a network can be very effective in identifying the individuals within a network who are best positioned to spread or share information. • M. Kitska, et al., “Identifying influential spreaders in complex networks,” arXiv:1001.5285v1 [physics.soc-ph] (2010).
  16. 16. K-Core Decomposition of the Ebola Network Paragon Science, Inc. 16 http://sourceforge.net/projects/lanet-vi/
  17. 17. Central Core of the Ebola Network Paragon Science, Inc. 17
  18. 18. Top URLs in the Central Core Paragon Science, Inc. 18 URL K Shell Degree http://goo.gl/pFg3Z2 49 279 http://goo.gl/BFEUgy 49 233 http://goo.gl/S37kHT 49 212 http://goo.gl/silISF 47 364 http://invst.rs/7MKWHB 22 779 http://cnn.it/1wlIlUe 22 741 http://trib.al/YKSMCSN 22 734 http://nyp.st/136BPG3 22 698 http://nypost.com/2014/10/29/cdc-admits-droplets-from-a-sneeze-could- spread-ebola/ 22 415 http://fxn.ws/1oVgLwc 22 406
  19. 19. Top-Ranked Website (URLs 1, 2, and 4) Paragon Science, Inc. 19 UMA MENTIRA CHAMADA ,,EBOLA,, VEJAM !!! | NOTICIÃRIO DA WEB A statement made by a man in Ghana called Nana Kwame rocked the internet in recent days. The following information has to reach people. We need to see the Ebola for what it really is. It's time to wake up the world agenda behind this whole story. Follow what this man has to say about what is happening in their country of origin: People in the world need to know what is happening here in West Africa. They are lying! The '' Ebola''como a virus does not exist and is not contagious. The Red Cross brought a disease to four specific countries, for four specific reasons and is only contracted by those who receive treatments and injections of the Red Cross. That's why Liberians and Nigerians began to expel the Red Cross in their countries!
  20. 20. 5th Ranked Website Paragon Science, Inc. 20
  21. 21. 6th Ranked Website Paragon Science, Inc. 21
  22. 22. Topic Detection in the Ebola Twitter Network Paragon Science, Inc. 22 User A User B User C replies to mentions URL 1 URL 2 references Term 1 Term 2 Term N Term 3 Topic 1 Topic 2 Topic M
  23. 23. Applicable “Soft” Clustering Methods  K-Groups/Group Discovery Algoritjm (GDA) • J. Kubica, A. Moore, and J. Schneider, “Tractable group detection on large link data sets,” The Third IEEE International Conference on Data Mining (2003).  Clique Percolation (http://www.cfinder.org/) • G. Palla, et al., “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, 435, p. 814 (2005).  Louvain Modularity Optimization • V. Blondel, et al., “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, 10, P10008 (2008). Paragon Science, Inc. 23
  24. 24. Summary of Top 200 Topic Anomalies Paragon Science, Inc. 24 Topic Peak Start Time Peak End Time Max Change Metric # Anomalies Topic 99 2014-11-06 06:18 2014-11-12 10:18 2.97 40 Topic 8 2014-11-05 20:18 2014-11-07 07:18 2.891 34 Topic 59 2014-11-06 20:18 2014-11-11 19:18 2.43 28 Topic 1 2014-11-05 17:18 2014-11-05 19:18 2.32 3 Topic 52 2014-11-05 17:18 2014-11-05 18:18 2.30 2 Topic 50 2014-11-05 19:18 2014-11-06 15:18 2.22 11 Topic 32 2014-11-05 18:18 2014-11-05 19:18 2.18 2 Topic 20 2014-11-05 20:18 2014-11-06 02:18 2.11 7 Topic 2 2014-11-07 07:18 2014-11-12 16:18 2.10 33 Topic 28 2014-11-05 20:18 2014-11-05 22:18 2.00 3 Topic 29 2014-11-08 02:18 2014-11-12 18:18 1.96 21 Topic 97 2014-11-06 09:18 2014-11-07 03:18 1.91 4 Topic 30 2014-11-05 20:18 2014-11-05 20:18 1.84 1 Topic 22 2014-11-05 23:18 2014-11-06 02:18 1.79 4 Topic 18 2014-11-05 17:18 2014-11-05 17:18 1.65 1 Topic 15 2014-11-05 19:18 2014-11-05 19:18 1.63 1 Topic 4 2014-11-08 14:18 2014-11-12 15:18 1.61 5
  25. 25. Key Sites Related to Top 5 Ebola Topic Anomalies Paragon Science, Inc. 25 Topic Max Change Metric Peak Datetime Top Related URL Title Topic 99 2.973 2014-11-06 17:18:27 FACT SHEET: Emergency Funding Request to Enhance the U.S. Government’s Response to Ebola at Home and Abroad | The White House Topic 8 2.888 2014-11-05 20:18:27 BBC News - Ebola outbreak: Barack Obama 'to ask Congress for $6bn' Topic 59 2.426 2014-11-07 02:18:27 » Obama Caught Ordering Press to Cover Up Ebola Alex Jones' Infowars: There's a war on for your mind! Topic 1 2.321 2014-11-05 17:18:27 UMA MENTIRA CHAMADA ,,EBOLA,, VEJAM !!! | NOTICIÃRIO DA WEB Topic 52 2.296 2014-11-05 17:18:27 Nigeria Property: Ebola Virus Originated From US Bio- warfare Labs In West Africa – American Prof
  26. 26. Example: Topic 99 URL-to-User Links Paragon Science, Inc. 26
  27. 27. Topic 99a: Economic Consequences Paragon Science, Inc. 27
  28. 28. Topic 99b: Mobile Data to Prevent Ebola Paragon Science, Inc. 28
  29. 29. Topic 99c: ISIS and Ebola Paragon Science, Inc. 29
  30. 30. Topic 99d: @ebolafiles (Twitter user) Paragon Science, Inc. 30
  31. 31. Topic 99e: Emergency Funding Request Paragon Science, Inc. 31
  32. 32. Topic 99f: Follow Ebola Paragon Science, Inc. 32 Follow Ebola | Updated every second & see what the #CDC & #WHO is not telling you about #Ebola
  33. 33. Overview  Background Information about Paragon Science  Example 1: Ebola Twitter Analysis 2014  Example 2: Stock Market Analysis via Twitter  Q & A Paragon Science, Inc. 33
  34. 34. Twitter Stock Market Data Set  Date range: August 5-29, 2015  175,246 tweets sent by 28,754 users  Network graph generated includes these links: • symbol links to URL: 430,842 (74,034 distinct URLs) • user links to URL: 149,117 • user mentions user: 74,247 • user references hash tag: 176,670 • user references symbol: 501,165 • user replies to user:10,698  Goal: • Identify key influencers and emerging topics that could influence prices • Provide high-quality input for Moodzee predictive models Paragon Science, Inc. 34
  35. 35. Twitter Stock Market Graph for August 2015 Paragon Science, Inc. 35
  36. 36. Twitter Stock Market Graph (Zoom 1) Paragon Science, Inc. 36
  37. 37. Twitter Stock Market Graph (Zoom 2) Paragon Science, Inc. 37
  38. 38. Identifying Key Influencers  Perform k-core decomposition  Results: • 50 k-shells • 102 users at the center of the network • Examine stock symbol -> URL links for the central users using uncertainty scores for the content of the web pages Paragon Science, Inc. 38 Twitter User # Links DayTradersGroup 855 diggingplatinum 652 Benzinga 261 WrigleyTom 203 SeekingAlpha 182 OpenOutcrier 126 theflynews 125 WallStJesus 119 Istock8 96 valuewalk 93
  39. 39. Network of 102 Central Users and 2910 Neighbors Paragon Science, Inc. 39
  40. 40. Network of 102 Central Users and Neighbors (Zoom 1) Paragon Science, Inc. 40
  41. 41. Network of 102 Central Users and Neighbors (Zoom 2) Paragon Science, Inc. 41
  42. 42. Using Financial Sentiment Scores: Uncertainty Paragon Science, Inc. 42 Web Page Title URL(s) Uncertainty Predicting Is Hard Business | Seeking Alpha http://seekingalpha.com/article/3422496-predicting- is-hard-business?source=feed_f 69 In Today's Overheated Market, Control Risk In Your Retirement Portfolios With Sound Valuation | Seeking Alpha http://seekingalpha.com/article/3455116-in-todays- overheated-market-control-risk-in-your-retirement- portfolios-with-sound-valuat 63 Comments On The Market Correction; Focus On Biotechs: Large Caps - Regeneron Pharmaceuticals, Inc. (NASDAQ:REGN) | Seeking Alpha http://seekingalpha.com/article/3468626-comments- on-the-market-correction-focus-on-biotechs-large- caps?source=feed_f 55 TradingView: Free Stock Charts and Forex Charts Online. http://www.tradingview.com 51 A MASSIVE New Platinum Pick Is Being Released At 9:30 am Today! Get On The List For Early Access To This New Play. | Blog http://tinyurl.com/oea3bjx, http://tr.im/oCRrP, http://bit.ly/1JhlgVb 49 Our Pick On VGTL Has Gained 242.86% For Our Subscribers, In 2 Months! | Blog http://bit.ly/1OOMiY9, http://tr.im/6hNJf 47 After 550% Gains On Our Picks In 5 Weeks, We Have A Major New Pick Coming Tomorrow! It is ONLY being released to Platinum Members Tomorrow, So Go Platinum To Get It Early! | Blog http://ow.ly/QrGNn 47 Our Picks Gained Over 550% In The Past Month! And We Have A MASSIVE New Pick Coming To Our Platinum Members! Subscribe To Get It Early. | Blog http://bit.ly/1UjdodT, http://goo.gl/r34fP7, http://tr.im/mZn9y 47 Our Pick On VGTL Has Gained 242.86% For Our Subscribers, In 2 Months! | Blog http://tinyurl.com/qjwxxwk 47 What To Find Before Seeking Alpha: Position Size | Seeking Alpha http://seekingalpha.com/article/3444516-what-to- find-before-seeking-alpha-position-size? source=twitter_sa_factset 37 Loughran and McDonald Financial Sentiment Dictionaries: Tim Loughran and Bill McDonald, 2011, “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks,” Journal of Finance, 66:1, 35-65
  43. 43. Anomaly Scores for Symbols -> URL Links Paragon Science, Inc. 43 Largest jump in the anomaly scores: $BIDU on 8/13/2015
  44. 44. $BIDU Network at First Uncertainty Surge Paragon Science, Inc. 44
  45. 45. Topic Detection in the Twitter URL Network Paragon Science, Inc. 45 User A User B User C replies to mentions URL 1 URL 2 references Term 1 Term 2 Term N Term 3 Topic 1 Topic 2 Topic M
  46. 46. Topic Detection: Network of 698 Web Pages Shared by 102 Central Users Paragon Science, Inc. 46 215 topics detected
  47. 47. Network of 698 Web Pages Shared by 102 Central Users Paragon Science, Inc. 47
  48. 48. Network of 698 Web Pages Shared by 102 Central Users Paragon Science, Inc. 48 Nodes colored by topic #
  49. 49. Web Site Titles in Largest Topic Paragon Science, Inc. 49 SPY ETF Turns Negative For Year Before Clawing Back - Investors.com $TSLA $GE $JCP $JWN $LOCO $KING $DD $JPM $AMAT $BAC $CBK: Stocks to Watch: Tesla, GE, JC Penney, Nordstrom | Stock News Hour $AAPL Apple has completed a 6-month complex H&S top $MU $SYMC $AAPL $ATML $SYNA $QLGC $CRUS $FCS $YHOO $BABA $AKAM $FSLR: It’s Not Just Apple: Yahoo!, Micron, Synaptics Fall on China Fears | Stock News Hour $GS $NVDA $BRCM $MU $SWKS $QCOM $INTC $WYNN $AAPL $YHOO $CAT $GM $T $VZ: China Damage Spreading | Stock News Hour $GOOGL $CAT $AAPL $SHAK $KHC $TW $JASO $RRGB $CSC $SYMC $CREE: Investors eye positive catalysts in oil, Google | Stock News Hour $GOOGL $PCLN $CTRP $BIDU $FB $AMZN $BABA $EXPE $LONG $QUNR $AWAY: The only US Web company that’s figured out China | Stock News Hour
  50. 50. New Partner Company: Moodzee  Text analytics for financial markets • Predictive models • Advanced warning of price-moving events  Initial target users: Hedge funds  Price correlations done, now back-testing then paper trading then real trading Paragon Science, Inc. 50 Alerts Correlation Analysis Downloader Sentiment Price-Movers Anomalies
  51. 51. Paragon Science, Inc. 51 What Are the Payoffs?  Find the “unknown unknowns” in dynamic data sets  Quickly identify key influencers and trends in online networks  Provide early warning of viral videos, anomalous web events, or unusual network traffic  Enable enhanced business intelligence without having to specify normal vs. abnormal behavior in advance
  52. 52. Third-Party Software Acknowledgements  Paragon Science gratefully acknowledges the following researchers and software providers: • Cytoscape (http://www.cytoscape.org/) • dynnetwork Cytoscape plugin (https://code.google.com/p/dynnetwork/) • Lanet-vi (http://sourceforge.net/projects/lanet-vi/) ◦ J. Alvarez-Hamelin, et al. "Understanding Edge Connectivity in the Internet through Core Decomposition," Internet Mathematics 7 (1): 45–66, 2011. • Louvain community detection software (http://perso.crans.org/aynaud/communities/) ◦ V. Blondel, et al., “Fast Unfolding of Communities in Large Networks,” Journal of Statistical Mechanics: Theory and Experiment, 10, P10008, 2008. • Networkx (https://networkx.github.io/) ◦ A Hagberg, D Conway, "Hacking social networks using the Python programming language (Module II - Why do SNA in NetworkX)", Sunbelt 2010: International Network for Social Network Analysis. Paragon Science, Inc. 52
  53. 53. Overview  Background Information about Paragon Science  Example 1: Ebola Twitter Analysis 2014  Example 2: Stock Market Analysis via Twitter  Q & A Paragon Science, Inc. 53

×