The Art of Social Media Analysis with Twitter & Python-OSCON 2012


Published on

Final Slides for my 2012 Tutorial

Published in: Technology, Business

The Art of Social Media Analysis with Twitter & Python-OSCON 2012

  1. The Art of Social Media Analysis with Twitter & Python krishna sankar @ksankar
  2. Intro API, Objects,… o  House  Rules  (1  of  2)   Twitter Network We will analyze @clouderati, o  Doesn’t  assume  any  knowledge   Analysis 2072 followers, exploding to of  Twitter  API   Pipeline ~980,000 distinct users down one level o  Goal:  Everybody  in  the  same   page  &  get  a  working   knowledge  of  Twitter  API   NLP, NLTK, o  To  bootstrap  your  exploration   @mention Cliques, social Sentiment network graph into  Social  Network  Analysis  &   Analysis Twitter     Rewteeet analytics, Growth, #tag Network Information o  Simple  programs,  to  illustrate   contagion weakties usage  &  data  manipulation  
  3. Intro API, Objects,… Twittero  House  Rules  (2  of  2)   Network We will analyze @clouderati, Analysis 2072 followers, exploding to o  Am  using  the  requests  library   Pipeline ~980,000 distinct users down o  There  are  good  Twitter  frameworks   one level for  python,  but  wanted  to  build   from  the  basics.  Once  one   understands  the  fundamentals,   frameworks  can  help   NLP, NLTK, @mention Cliques, social Sentiment o  Many  areas  to  explore  –  not  enough   Analysis network graph time.  So  decided  to  focus  on  social   graph,  cliques  &  networkx   Rewteeet analytics, Growth, #tag Network Information contagion weakties
  4. About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at   o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization     •   o  Blog  :   o  Quora  :­‐Sankar  •  Prior  Gigs   o  Lead  Architect  (Egnyte)   o  Distinguished  Engineer  (CSCO)   o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !  •  Current  Focus:   o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,   MongoDB,  Solr,  Drupal,GitHub,…   o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –   so  far  !)   o  Overlay  based  semantic  search  &  ranking  •  Other  related  Presentations   o  Big  Data  Engineering  Top  10  Pragmatics  (Summary)   o  The  Art  of  Big  Data  (Detailed)   o  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  
  5. Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   End •  And  as  far  as  possible  same  as  the  ( json)  response       Beg As Th inni o  Use  NOSQL  CLI  for  trimming  records  et  al   ng e
  6. Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  7. Twitter Tips – A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 211.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  8. Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 213.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong  15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  9. Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  10. Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  11. Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416   Range  Unacceptable”  6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  12. A Fork   &  deep ,NLTK    •   NLP weets into  T ment   4 o  Sen ysis   Anal • Not enough time for both • I chose the Social Graph route
  13. A minute about Twitter as platform & it’s evolution blog/ er. com/ tter-­‐ twitt wi ps:/ /dev. nsistent-­‐t htt ring-­‐co e deliv ence   “The micro-blogging service must find the ri expe right balance of running a profitable business and maintaining a robust “.. we want to make sure that the Twitter experience is developers community.” – Chenda, CBS straightforward and easy to understand -- whether you’re on news! or elsewhere on the web”-Michael!My  Wish  &  Hope  •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive  •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn   o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience   o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that  •  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system  •  Orthogonally  extensible  platform  is  essential  •  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  
  14. •  For  Hands  on  Today   Setup o  Python  2.7.3   o  easy_install  –v  requests   •  http://docs.python-­‐­‐a-­‐ request   o  easy_install  –v  requests-­‐oauth   o  Hands  on  programs  at­‐handson  •  For  advanced  data  science  with  social  graphs   o  easy_install  –v  networkx   o  easy_install  –v  numpy   o  easy_install  –v  nltk     •  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al   o  Mongodb     •  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS   o  graphviz  -­‐;  easy_install  pygraphviz   o  easy_install  pydot  
  15. Thanks To these Giants …
  16. Problem Domain For this tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as   observed  by  Twitter  primitives   o  Not  for  Twitter  based  apps  for  real  time  tweets   o  Not  web  sites  with  real  time  tweets  •  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &   actionable  recommendations  •  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.   not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before   pronouncing  a  trend)  
  17. Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)     o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)   o  Objects   o  API   o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)  II.  Break  (3:00  PM  -­‐  3:30  PM)  III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)   o  Underlying  Concepts   o  Social  Graph  Analysis  of  @clouderati   §  Stages,  Strategies  &  Tasks   §  Code  Walk  thru    
  18. Open  This  First
  19. Twi5er  API  :  Read  These  First •  Using  Twitter  Brand   o  New  logo  &  associated  guidelines  :   o  Twitter  Rules  :­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐ best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules   o  Developer  Rules  of  the  road­‐terms  •  Read  These  Links  First   1.­‐every-­‐developer-­‐should-­‐know   2.   3.  Field  Guide  to  Objects­‐objects   4.  Security­‐best-­‐practices   5.  Media  Best  Practices  :   6.  Consolidates  Page  :   7.  Streaming  APIs­‐apis   8.  How  to  Appeal  (Not  that  you  all  would  need  it  !) articles/72585  •  Only  One  version  of  Twitter  APIs  
  20. API  Status  Page •  •  •  
  21. h5ps://­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter  
  22. Open  This  First •  Install  pre-­‐req  as  per  the  setup  slide  •  Run     o   o  To  test  connectivity  –  “canary  query”  •  Run   o   o  Use  to  check  reset_time  •  Formats  xml,  json,  atom  &  rss  
  23. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams   Seach & User  Streams   Trend Twitter   Twitter   Site  Streams   REST   Search   Firehose   Build  Profile   Keywords   Create/Post  Tweets   Specific  User   Reply   Trends   Favorite,  Re-­‐tweet   Rate  Limit  :     Rate  Limit  :  150/350        Complexity  &  Frequency  
  24. Rate  Limit
  25. Rate  Limits •  By  API  type  &  Authentication  Mode   API No authC authC Error REST   150/hr   350/hr   400  Search   Complexity  &   -­‐N/A-­‐   420   Frequency  Streaming   Upto  1%  Fire  hose   none   none  
  26. Rate  Limit  Header •  {  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "149",    •     "x-­‐ratelimit-­‐reset":  "1340467358",    •     "x-­‐runtime":  "0.04144",    •     "x-­‐transaction":  "2b49ac31cf8709af",    •     "x-­‐transaction-­‐mask":   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"  •  }  
  27. Rate  Limit-­‐‑ed  Header •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "150",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",    •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",    •     "server":  "tfe",    •     ”…  •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341363230",    •     "x-­‐runtime":  "0.01126"  •  }  
  28. Rate  Limit  Example •  Run   o  •  It  iterates  through  a  list  to  get  followers    •  List  is  2072  long  
  29. •  {  •     …  •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",    •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",     Last  time,  it  gave  me  5  min.   Now  the  reset  timer  is  1  •     "x-­‐ratelimit-­‐remaining":  "147",     hour    •     "x-­‐ratelimit-­‐reset":  "1341366831",     150  calls,  not  authenticated  •     "x-­‐runtime":  "0.02768",    •     "x-­‐transaction":  "f1bafd60112dddeb",    •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"  •  }  
  30. •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",     And  Rate  Limit  kicked-­‐‑in •  …  •  "status":  "400  Bad  Request",    •     "transfer-­‐encoding":  "chunked",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341366831",    •     "x-­‐runtime":  "0.01342"  •  }  
  31. API  with  OAuth •  {  •     …  •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "etag":  ""dd419c02ed00fc6b2a825cc27wbe040"",    •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "pragma":  "no-­‐cache",    •     "server":  "tfe",    •  …  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐access-­‐level":  "read",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341369121",    •     "x-­‐runtime":  "0.05539",     OAuth  • •     "x-­‐transaction":  "9f8508fe4c73a407",        "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   “api-­‐identified”  •  }   1  hr  reset   350  calls  
  32. •  {  •     …  •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",    •  …  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "133",    •     "x-­‐ratelimit-­‐reset":  "1341500165",    •   …   Rate  Limit  resets  during  •  }   consecutive  calls •  ********  2416  •  {   +1  •  …   hour •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",    •  …  •     "status":  "200  OK",    •     ….  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341503776",    •  ********  2417  
  33. Unexplained  Errors •  Traceback  (most  recent  call  last):  •     File  "",  line  39,  in  <module>  •         r  =  client.get(url,  params=payload)  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/",  line  244,  in  get  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/",  line  230,  in  request  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/",  line  609,  in  send  •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(,  port=443):  Max   retries  exceeded  with  url:  /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –   17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually  10-­‐6  AM  PST   42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%   2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got  around  by  “Trap  &  wait  5  seconds”   2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%   2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night  Runs  are  relatively  error  free   09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201  
  34. •  {  • •   …      "date":  "Fri,  06  Jul  2012  03:41:09  GMT",     A Day in the life of•     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",     Twitter Rate Limit•     "server":  "tfe",    •     "set-­‐cookie":  "dnt=;;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",    •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "0",     Missed  by  4  min! •     "x-­‐ratelimit-­‐reset":  "1341546334",    •     "x-­‐runtime":  "0.01918"  •  }  •  Error,  sleeping  •  {  •   …  •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",    •   …  •   "status":  "200  OK",    •   …  •   "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",     OK  after  5  min  sleep •   …  
  35. Strategies I  have  no  exotic  strategies,  so  far  !  1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in  2.  Combine  authenticated  &  non-­‐authenticated  calls  3.  Use  multiple  API  types  4.  Cache  5.  Store  &  get  only  what  is  needed  6.  Checkpoint  &  buffer  request  commands  7.  Distributed  data  parallelism  –  for  example  AWS  instances  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit
  36. Authentication
  37. Authentication •  Three  modes   o  Anonymous   o  HTTP  Basic  Auth   o  OAuth  •  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are   supported  •   OAuth  enables  the  user  to  authorize  an  application   without  sharing  credentials  •  Also  has  the  ability  to  revoke  •  Twitter  supports  OAuth  1.0a  •  OAuth  2.0  is  the  new  standard,  much  simpler   o  No  timeframe  for  Twitter  support,  yet      
  38. OAuth  Pragmatics •  Helpful  Links   o   o­‐from-­‐basic-­‐auth-­‐to-­‐oauth   o­‐user-­‐with-­‐examples   o­‐to-­‐build-­‐oauth-­‐consumer.html  •  Discussion  on  OAuth  internal  mechanisms  is  better  left  for   another  day  •  For  headless  applications  to  get  OAuth  token,  go  to  https://  •   Create  an  application  &  get  four  credential  pieces   o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret  •  All  the  frameworks  have  support  for  OAuth.  So  plug  –in   these  values  &  use  the  framework’s  calls  •  I  used  request-­‐oauth  library  like  so:  
  39. request-­‐‑oauth def  get_oauth_client():   Get  client  using  the        consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"   token,  key  &  secret  from          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={pre_request:  oauth_hook})          return  client   Use  the  client  instead   def  get_followers(user_id):   of  requests                                      url  =’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)   def  get_followers_with_oauth(user_id,client):                                        url  =                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)  Ref:  h5p://­‐‑oauth
  40. OAuth  Authorize  screen •  The  user   authenticates  with   Twitter  &  grants   access  to  Forbes   Social   •  Forbes  social   doesn’t  have  the   users  credentials,   but  uses  OAuth  to   access  the  user’s   account  
  41. HTTP  Status   Codes
  42. HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐   •  404  Not  Found   Library  error   •  406  Not  Acceptable   •  200  OK   •  413  Too  Long   •  304  Not  Modified   •  416  Range  Unacceptable   •  400  Bad  Request   •  420  Enhance  Your  Calm   o  Check  error  message  for  explanation   o  Rate  Limited   o  REST  Rate  Limit  !     •  500  Internal  Server  Error   •  401  UnAuthorized   •  502  Bad  Gateway     o  Beware  –  you  could  get  this  for  other   o  Down  for  maintenance   reasons  as  well.       •  503  Service  Unavailable   •  403  Forbidden   o  Overloaded  “Fail  whale”   o  Hit  Update  Limit  (>  max  Tweets/day,   •  504  Gateway  Timeout   following  too  many  people)   o  Overloaded  h5ps://­‐‑codes-­‐‑responses
  43. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "91",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",    •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",    •     "server":  "tfe",    •   …  •     "status":  "401  Unauthorized",    •     "vary":  "Accept-­‐Encoding",    •     "www-­‐authenticate":  "OAuth  realm=""",    • •     "x-­‐ratelimit-­‐class":  "api",        "x-­‐ratelimit-­‐limit":  "0",     Detailed  error  •     "x-­‐ratelimit-­‐remaining":  "0",     message    in  JSON  !  •     "x-­‐ratelimit-­‐reset":  "1340413616",    •     "x-­‐runtime":  "0.01997"   I  like  this  •  }  •  {  •     "errors":  [  •         {  •             "code":  53,    •             "message":  "Basic  authentication  is  not  supported"  •         }  •     ]  •  }  
  44. HTTP  Status  Code  –  Confusing  Example •  {   •  GET•  …   screen_nme=twitterapi,twitter&include_entities=•     "pragma":  "no-­‐cache",     true  •     "server":  "tfe",    •   …     •  Spelling  Mistake  •     "status":  "404  Not  Found",     o  Should  be  screen_name  •     …   •  But  confusing  error  !  •  }  •  {   •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,  •     "errors":  [   showing  parameter  error  •         {  •             "code":  34,    •             "message":  "Sorry,  that  page  does  not  exist"  •         }  •     ]  •  }  
  45. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "112",    •     "content-­‐type":  "application/json;charset=utf-­‐8",     Sometimes,  the  errors  are  •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",     not  correct.  I  got  this  error  •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •  …   for  user_timeline.json  w/  •     "status":  "401  Unauthorized",     user_id=20,15,12  •     "www-­‐authenticate":  "OAuth  realm=""",     Clearly  a  parameter  error  •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐ratelimit-­‐class":  "api",     (i.e.  more  parameters)  •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "147",    •     "x-­‐ratelimit-­‐reset":  "1340417742",    •     "x-­‐transaction":  "d545a806f9c72b98"  •  }  •  {  •     "error":  "Not  authorized",    •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"  •  }  
  46. Objects
  47. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://­‐‑objects
  48. Tweets •  A.k.a  Status  Updates   •  Interesting  fields   o  Coordinates  <-­‐  geo  location   o  created_at   o  entities  (will  see  later)   o  Id,  id_str   o  possibly  sensitive   o  user  (will  see  later)   •  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –   hard  to  maintain  at  scale   •   o  withheld_in_countries     •­‐withheld-­‐content-­‐fields-­‐api-­‐responses  h5ps://­‐‑objects/tweets
  49. A  word  about  id,  id_str •  June  1,  2010   o  Snowflake  the  id  generator  service   o  “The  full  ID  is  composed  of  a  timestamp,   a  worker  number,  and  a  sequence   number”   o  Had  problems  with  JavaScript  to  handle   numbers  >  53  bits   o  “id”:819797   o  “id_str”:”819797”  h5p://­‐‑snowflake.html h5ps://!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
  50. Tweets  -­‐‑  example •  Let  us  run  oscon2012-­‐  •  Example  of  tweet   o  coordinates   o  id     o  id_str  
  51. Users •  followers_count   •  geo_enabled   •  Id,  Id_str   •  name,  screen_name   •  Protected   •  status,  statuses_count   •  withheld_in_countries  h5ps://­‐‑objects/users
  52. Users  –  Let  us  run  some  examples •  Run     o   •  Lookup  users  by  screen_name   o   •  Lookup  users  by  user_id  •  Inspect  the  results   o  id,  name,  status,  status_count,  protected,  followers   (for  top  10  followers),  withheld  users  •  Can  use  information  for  customizing   the  user’s  screen  in  your  web  app  
  53. Entities •  Metadata  &  Contextual  Information   •  You  can  parse  them,  but  Entities   parse  them  out  as  structured  data   •  REST  API/Search  API  –   include_entities=1   •  Streaming  API  –  included  by  default   •  hashtags,  media,  urls,   user_mentions  h5ps://­‐‑objects/entities h5ps://­‐‑entities h5ps://­‐‑url-­‐‑wrapper
  54. Entities •  Run     o  •  Inspect  hashtags,  urls  et  al    
  55. Places •  attributes   •  bounding_box   •  Id  (as  a  string!)   •  country   •  name  h5ps://­‐‑objects/places h5ps://­‐‑geo-­‐‑place-­‐‑a5ributes
  56. Places •  Can  search  for  tweets  near  a  place  like  so:  •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]   o  Tweets  near  that  place  •  Tweets  near  San  Jose  [37.395715,-­‐122.102308]  •  We  will  not  see  further  here.  But  very  useful  
  57. Timelines •  Collections  of  tweets  ordered  by  time   •  Use  max_id  &  since_id  for  navigation  h5ps://­‐‑with-­‐‑timelines
  58. Other  Objects  &  APIs •  Lists  •  Notifications  •  Friendships/exists  to  see  if  one  follows   the  other  
  59. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://­‐‑objects
  60. Hands-­‐‑on  Exercise  (15  min) •  Setup  environment  –  slide  #14  •  Sanity  Check  Environment  &  Libraries   o   o  •  Get  objects  (show  calls)   o  Lookup  users  by  screen_name    -­‐   o  Lookup  users  by  id  -­‐   o  Lookup  tweets  -­‐   o  Get  entities  -­‐  •  Inspect  the  results  •  Explore  a  little  bit  •  Discussion  
  61. Twi5er  APIs
  62. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  63. Twi5er  REST  API •  •  What  we  were  doing  were  the  REST  API  •  Request-­‐Response  •  Anonymous  or  OAuth  •  Rate  Limited  :   o  150/350  
  64. Twi5er  Trends •  oscon2012-­‐  •  Trends/weekly,  Trends/monthly  •  Let  us  run  some  examples   o   o  •  Trends  &  hashtags   o  #hashtag  euro2012   o   o­‐hashtags/   o­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html   o  Top  10  :,  
  65. Brand  Rank  w/  Twi5er •  Walk  Through  &  results  of  following   o  •  Followed  10  user-­‐brands  for  a  few  days  to  find   growth  •  Brand  Rank     o  Growth  of  a  brand  w.r.t  the  industry   o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &   correlate  using  Twitter  APIs  &  metrics  •  API  :  url= lookup.json  •  payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}  
  66. Brand  Rank  w/  Twi5er Clouderati   is  very   stable
  67. Brand  Rank  w/  Twi5er   Tech  Brands •  Google  I/O  showed  a  spike  on  6/27-­‐   6/28   •  OReillyMedia  shares  some  spike   •  Looking  at  a  few  days  worth  of   data,  our  best  inference  is  that   “oscon  doesn’t  track  with  googleio”   •  “Clouderati  doesn’t  track  at  all”  
  68. Brand  Rank  w/  Twi5er   World  of  Soccer •  FOXSoccer,UEFAcom   track  each  other     The  numbers  seldom   decrease.  So  calculating   –ve  velocity  will  not   work OTOH,  if  you  see  a  –ve   velocity,  investigate
  69. Brand  Rank  w/  Twi5er   World  of  Basketball •  NBA,  MiamiHeat,  okcthunder  track  each  other  •  Used  %  than  absolute  numbers  to  compare  •  The  hike  on  7/6  to  7/10  is  interesting.      
  70. Brand  Rank  w/  Twi5er   Rising  Tide  … •  For  some  reason,  all  numbers  are  going  up  7/6  thru   7/10  –  except  for  clouderati!   •  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  
  71. Trivia  :  Search  API •  Search(   o  Built  by  Summize  which  was  acquired  by  Twitter  in   2008   o  Summize  described  itself  as  “sentiment  mining”  
  72. Search  API •  Very  simple     o  GET<blah>   •  Based  on  a  search  criteria   •  “The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets” •  Recent  =  Last  6-­‐9  days  worth  of  tweets   •  Anonymous  Call   •  Rate  Limit   o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency  h5ps://­‐‑search h5ps://
  73. Search  API •  Filters   o  Search  URL  encoded   o  @  =  %40,  #=%23   o   emoticons    :)  and  :(,   o   o  •  Location  Filters,  date  filters  •  Content  searches  
  74. Streaming  API •  Not  request  response;  but  stream  •  Twitter  frameworks  have  the  support  •  Rate  Limit  :  Upto  1%  •  Stall  warning  if  the  client  is  falling  behind  •  Good  Documentation  Links   o­‐apis/connecting   o­‐apis/parameters   o­‐apis/processing  
  75. Firehose •  ~  400  million  public  tweets/day  •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !  •  If  you  hit  real  limits,  then  explore  the  firehose  route  •  AFAIK,  it  is  not  cheap,  but  worth  it  
  76. API  Best  Practices 1.  Use  JSON   2.  Use  user_id  than  screen_name   o  User_id  is  constant  while  screen_name  can  change   3.  max_id  and  since_id   o  For  example  direct  messages,  if  you  have  last  message  use   since_id  for  search   o  max_id  how  far  to  go  back   4.  Cache  as  much  as  you  can   5.  Set  the  User-­‐Agent  header  for  debugging   I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentationThese are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
  77. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Questions  ?   Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  78. Part II SNA Part IITwitter Network Analysis
  79. 2.  Store   3.  Transform  &     1.  Collect   Analyze   the Validate Dataset & . Keep don’t Tip: 3 simple; re-crawl/refresh a schem afrai d to be for mMost  important  &   transthe  ugliest  slide  in   this  deck  !   as lem ent , 1. Imp ipeline 4.  Model   Tip: age d p nolith 5.  Predict,   &     a st r a mo Reason   neve Recommend  &   Visualize  
  80. Trivia •  Social  Network  Analysis  originated  as  Sociometry  &   the  social  network  was  called  a  sociogram  •  Back  then,  Facebook  was  called  SocioBinder!  •  Jacob  Levi  Morano,  is  considered  the  originator   o  NYTimes,  April  3,  1933,  P.  17  
  81. Twi5er  Networks-­‐‑Definitions •  Nodes   o  Users   o  #tags  •  Edges   o  Follows   o  Friends   o  @mentions   o  #tags  •  Directed  
  82. Twi5er  Networks-­‐‑Definitions •  In-­‐degree   o  Followers  •  Out-­‐Degree   o  Friends/Follow  •  Centrality  Measures  •  Hubs  &  Authorities   o  Hubs/Directories  tell  us  where   Authorities  are   o  “Of  Mortals  &  Celebrities”  is   more  “Twitter-­‐style”  
  83. Twi5er  Networks-­‐‑Properties M•  Concepts  From  Citation   N Networks   K J o  Cocitation   L   I •  Common  papers  that  cite  a  paper   A •  Common  Followers   B G o  C  &  G  (Followed  by  F  &  H)   C H o  Bibliographic  Coupling   •  Cite  the  same  papers   D F   •  Common  Friends  (i.e.  follow  same   E person)   o  D,  E,  F  &  H  
  84. Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks   M o  Cocitation   N •  Common  papers  that  cite  a  paper   K •  Common  Followers   J   L   o  C  &  G  (Followed  by  F  &  H)   I   o  Bibliographic  Coupling   A •  Cite  the  same  papers   B G •  Common  Friends    (i.e.  follow  same  person)   o  D,  E,  F  &  H  follow  C   o  H  &  F  follow  C  &  G   H C •  So  H  &  F  have  high  coupling   D •  Hence,  if  H  follows  A,  we  can   F   recommend  F  to  follow  A   E
  85. Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks   o  Two  disjoint  subsets   o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph   o  Membership  in  Lists     •  lists  vs.  users  bipartite  graph   o  Common  #Tags  in  Tweets     •  #tags  vs.  members  bipartite  graph   o  @mention  together   •  ?  Can  this  be  a  bipartite  graph   •  ?  How  would  we  fold  this  ?  
  86. Other  Metrics  &  Mechanisms •  Kronecker  Graphs  Models   o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices   o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of   their  adjacency  matrices   o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al   •  Erdos-­‐Renyl  Random  Graphs   o  Easy  to  build  a  Gn,p  graph   o  Assumes  equal  likelihood  of  edges  between  two  nodes   o  In a Twitter social network, we can create a more realistic expected distribution (adding the “social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter   •  Weak  Ties   •  Follower  velocity  (+ve  &  –ve),  Association  strength   o  Unfollow  not  a  reliable  measure   o  But  an  interesting  property  to  investigate  when  it  happens   Not covered here, but potential for an encore !Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
  87. Twi5er  Networks-­‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook•  Twitter Network == Interest Network•  Be  cognizant  of  the  above  when  you  apply  traditional  network   properties  to  Twitter    •  For  example,     o  Six  degrees  of  separation  doesnt  make  sense  (most  of  the  time)  in   Twitter  –  except  may  be  for  Cliques   o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?   •  Probably  not   o  Do  cut  sets  make  sense  ?     •  Probably  not   o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques   o  Bipartite  graphs  do  make  sense  
  88. Cliques  (1  of  2) •  “Maximal  subset  of  the  vertices  in  an   undirected  network  such  that  every  member   of  the  set  is  connected  by  an  edge  to  every   other”  •  Cohesive  subgroup,  closely  connected  •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.   connected  to  at  least  n-­‐k  others)  •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse   network;  1-­‐plex  being  the  perfect  clique   Ref:  Networks,  An  Introduction-­‐‑Newman
  89. Cliques  (2  of  2) •  k-­‐core  –  at  least  k  others  in  the  subset;   (n-­‐k)-­‐plex  •  k-­‐clique  –  no  more  than  k  distance  away   o  Path  inside  or  outside  the  subset   o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)  •  We  will  apply  k-­‐plex  Cliques  for  one  of   our  hands-­‐on     Ref:  Networks,  An  Introduction-­‐‑Newman
  90. Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work   on  the  Twitter  platform   o  Collect  Tweets   o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons   •  Naïve  Bayes/Max  Entropy  Class/SVM   o  Aggregated  Text  Sentiment/Moving  Average  •  I  chose  not  to  dive  deeper  because  of  time  constraints   o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,   all  in  3  hrs  •  Next  3  Slides  have  couple  of  interesting  examples    
  91. Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment   •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800    h5p://www.inside-­‐‑­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://
  92. Need  I  say  more  ? “A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the   human  eye”      h5p://­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://
  93. Project  Ideas  
  94. Interesting Vectors of Exploration 1.  Find  trending  #tags  &  then  related  #tags  –  using   cliques  over  co-­‐#tag-­‐citation,  which  infers  topics   related  to  trending  topics  2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or   group  of  users  3.  Analysis-­‐In/Out  flow,  Tweet  Flow   –  Frequent  @mention  4.  Find  affiliation  networks  by  List  memberships,  #tags   or  frequent  @mentions    
  95. Interesting Vectors of Exploration 5.  Use  centrality  measures  to  determine  mortals  vs.   celebrities  6.  Classify  Tweet  networks/cliques  based  on  message   passing  characteristics   –  Tweets  vs.  Retweets,  No  of  reweets,…  7.  Retweet  Network   –  Measure  Influence  by  retweet  count  &  frequency   –  Information  contagion  by  looking  at  different  retweet   network  subcomponents  –  who,  when,  how  much,…  
  96. Twi5er  Network  Graph  Analysis An  Example  
  97. Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related   Twitter  account   •  Goals:   o  Analyze  the  social  graph  characteristics  of  the  users  who  are   following  the  account   In this •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the   tutorial followers  of  @clouderati   o  How  many  cliques  ?  How  strong  are  they  ?   o  Does  the  @mention  support  the  clique  inferences  ?  For you to o  What  are  the  retweet  characteristics  ?  explore !! o  How  does  the  #tag  network  graph  look  like  ?      
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.