The Art of Social Media       Analysis with Twitter & Python                                      krishna sankar          ...
Intro	                                           API,                                          Objects,…	o  House	  Rules	...
Intro	                                                API,                                               Objects,…	       ...
About  Me	•    Lead	  Engineer/Data	  Scientist/AWS	  Ops	  Guy	  at	       Genophen.com	         o    Co-­‐chair	  –	  20...
Twitter Tips – A Baker’s Dozen	1.    Twitter	  APIs	  are	  (more	  or	  less)	  congruent	  &	  symmetric	  2.    Twitter...
Twitter Tips – A Baker’s Dozen	5.     Always	  use	  a	  big	  data	  pipeline	        o       Collect - Store - Transform...
Twitter Tips – A Baker’s Dozen	9.  Program	  defensively	  	        o      more so for a REST-based-Big Data-Analytics sys...
Twitter Tips – A Baker’s Dozen	12.  Check	  Point	  frequently	  (preferably	  after	  ever	  API	  call)	  &	  have	  a	 ...
Twitter Tips – A Baker’s Dozen	16.  The	  Twitter	  big	  data	  pipeline	  has	  lots	  of	  opportunities	  for	  parall...
Twitter Tips – A Baker’s Dozen	19.  Understand	  the	  underlying	  network	  characteristics	  for	  the	       inference...
Twitter Gripes	1.     Need	  more	  rich	  APIs	  for	  #tags	        o      Somewhat	  similar	  to	  users	  viz.	  foll...
A Fork	                           	                  & 	  deep       ,NLTK	   	  •   NLP weets    into	  T ment	          ...
A minute about Twitter as platform & it’s evolution	                                                                      ...
•    For	  Hands	  on	  Today	                                                                                            ...
Thanks To these Giants …
Problem Domain For this tutorial	•  Data	  Science	  (trends,	  analytics	  et	  al)	  on	  Social	  Networks	  as	     ob...
Agenda	I.     Mechanics	  :	  Twitter	  API	  (1:30	  PM	  -­‐	  3:00	  PM)	  	        o    Essential	  Fundamentals	  (Ra...
Open  This  First
Twi5er  API  :  Read  These  First	•    Using	  Twitter	  Brand	        o  New	  logo	  &	  associated	  guidelines	  :	  ...
API  Status  Page	•    https://dev.twitter.com/status	  •    https://dev.twitter.com/issues	  •    https://dev.twitter.com...
h5ps://dev.twi5er.com/status	http://www.buzzfeed.com/tommywilhelm/google-­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐t...
Open  This  First	•  Install	  pre-­‐req	  as	  per	  the	  setup	  slide	  •  Run	  	      o  oscon2012_open_this_first.py...
Twitter	  API	                                                                                                            ...
Rate  Limit
Rate  Limits	 •  By	  API	  type	  &	  Authentication	  Mode	           API	          No authC	           authC	          ...
Rate  Limit  Header	•  {	  •  "status":	  "200	  OK",	  	  •  	  	  "vary":	  "Accept-­‐Encoding",	  	  •  	  	  "x-­‐fram...
Rate  Limit-­‐‑ed  Header	•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  max-­‐age=300",	  	  •    	  	  "content...
Rate  Limit  Example	•  Run	      o  oscon2012_rate_limit_02.py	  •  It	  iterates	  through	  a	  list	  to	  get	  follo...
•    {	  •    	  	  …	  •    	  	  "date":	  "Wed,	  04	  Jul	  2012	  00:54:16	  GMT",	  	  •    "status":	  "200	  OK",	...
•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  max-­‐age=300",	  	  •    	  	  "content-­‐encoding":	  "gzip",	  ...
API  with  OAuth	•    {	  •    	  	  …	  •    	  	  "date":	  "Wed,	  04	  Jul	  2012	  01:32:01	  GMT",	  	  •    	  	  "...
•    {	  •    	  	  …	  •    	  	  "date":	  "Thu,	  05	  Jul	  2012	  14:56:05	  GMT",	  	  •    …	  •    	  	  "x-­‐rate...
Unexplained  Errors	•    Traceback	  (most	  recent	  call	  last):	  •    	  	  File	  "oscon2012_get_user_info_01.py",	 ...
•    {	  • •      	  …	       	  	  "date":	  "Fri,	  06	  Jul	  2012	  03:41:09	  GMT",	  	                              ...
Strategies	I	  have	  no	  exotic	  strategies,	  so	  far	  !	  1.  Obvious	  :	  	  Track	  elapsed	  time	  &	  sleep	 ...
Authentication
Authentication	•  Three	  modes	       o  Anonymous	       o  HTTP	  Basic	  Auth	       o  OAuth	  •  As	  of	  Aug	  31,...
OAuth  Pragmatics	•  Helpful	  Links	       o    https://dev.twitter.com/docs/auth/oauth	       o    https://dev.twitter.c...
request-­‐‑oauth	               def	  get_oauth_client():	                                                                ...
OAuth  Authorize  screen	                •  The	  user	                     authenticates	  with	                     Twit...
HTTP  Status    Codes
HTTP  status  Codes	         •  0	  Never	  made	  it	  to	  Twitter	  Servers	  -­‐	   •          404	  Not	  Found	     ...
HTTP  Status  Code  -­‐‑  Example	•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  max-­‐age=300",	  	  •    	  	  ...
HTTP  Status  Code  –  Confusing  Example	•    {	                                                                •  GET	  ...
HTTP  Status  Code  -­‐‑  Example	•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  no-­‐store,	  must-­‐revalidate,...
Objects
Followers	                                                        Twitter	  Platform	      Friends	                       ...
Tweets	                •  A.k.a	  Status	  Updates	                  •  Interesting	  fields	                        o    C...
A  word  about  id,  id_str	                  •  June	  1,	  2010	                           o  Snowflake	  the	  id	  gene...
Tweets  -­‐‑  example	•  Let	  us	  run	  oscon2012-­‐tweets.py	  •  Example	  of	  tweet	     o  coordinates	     o  id	 ...
Users	                •    followers_count	                  •    geo_enabled	                  •    Id,	  Id_str	        ...
Users  –  Let  us  run  some  examples	•  Run	  	       o  oscon_2012_users.py	           •  Lookup	  users	  by	  screen_...
Entities	                    •  Metadata	  &	  Contextual	  Information	                      •  You	  can	  parse	  them,...
Entities	•  Run	  	       o  oscon2012_entities.py	  •  Inspect	  hashtags,	  urls	  et	  al	  	  
Places	                  •    attributes	                    •    bounding_box	                    •    Id	  (as	  a	  str...
Places	•  Can	  search	  for	  tweets	  near	  a	  place	  like	  so:	  •  Get	  latlong	  of	  conv	  center	  [45.52929,...
Timelines	             •  Collections	  of	  tweets	  ordered	  by	  time	               •  Use	  max_id	  &	  since_id	  ...
Other  Objects  &  APIs	•  Lists	  •  Notifications	  •  Friendships/exists	  to	  see	  if	  one	  follows	     the	  othe...
Followers	                                                        Twitter	  Platform	      Friends	                       ...
Hands-­‐‑on  Exercise  (15  min)	•  Setup	  environment	  –	  slide	  #14	  •  Sanity	  Check	  Environment	  &	  Librarie...
Twi5er  APIs
Twitter	  API	                                                                                       Near-realtime,       ...
Twi5er  REST  API	•    https://dev.twitter.com/docs/api	  •    What	  we	  were	  doing	  were	  the	  REST	  API	  •    R...
Twi5er  Trends	•  oscon2012-­‐trends.py	  •  Trends/weekly,	  Trends/monthly	  •  Let	  us	  run	  some	  examples	       ...
Brand  Rank  w/  Twi5er	•  Walk	  Through	  &	  results	  of	  following	       o  oscon2012_brand_01.py	  •  Followed	  1...
Brand  Rank  w/  Twi5er	                     Clouderati                         is  very                          stable
Brand  Rank  w/  Twi5er      Tech  Brands	            •    Google	  I/O	  showed	  a	  spike	  on	  6/27-­‐	              ...
Brand  Rank  w/  Twi5er     World  of  Soccer	            •  FOXSoccer,UEFAcom	                 track	  each	  other	  	  ...
Brand  Rank  w/  Twi5er                   World  of  Basketball	•  NBA,	  MiamiHeat,	  okcthunder	  track	  each	  other	 ...
Brand  Rank  w/  Twi5er      Rising  Tide  …	 •  For	  some	  reason,	  all	  numbers	  are	  going	  up	  7/6	  thru	    ...
Trivia  :  Search  API	•  Search(search.twitter.com)	     o  Built	  by	  Summize	  which	  was	  acquired	  by	  Twitter	...
Search  API	              •  Very	  simple	  	                     o  GET	  http://search.twitter.com/search.json?q=<blah>...
Search  API	•  Filters	      o    Search	  URL	  encoded	      o    @	  =	  %40,	  #=%23	      o    	  emoticons	  	  :)	 ...
Streaming  API	•    Not	  request	  response;	  but	  stream	  •    Twitter	  frameworks	  have	  the	  support	  •    Rat...
Firehose	•  ~	  400	  million	  public	  tweets/day	  •  If	  you	  are	  working	  with	  Twitter	  firehose,	  I	  envy	 ...
API  Best  Practices	              1.  Use	  JSON	                2.  Use	  user_id	  than	  screen_name	                 ...
Twitter	  API	                                                                                       Near-realtime,       ...
Part II          SNA         Part IITwitter Network Analysis
2.	  Store	         3.	  Transform	  &	  	             1.	  Collect	                                                      ...
Trivia	•  Social	  Network	  Analysis	  originated	  as	  Sociometry	  &	     the	  social	  network	  was	  called	  a	  ...
Twi5er  Networks-­‐‑Definitions	•  Nodes	     o  Users	     o  #tags	  •  Edges	     o    Follows	     o    Friends	     o ...
Twi5er  Networks-­‐‑Definitions	•  In-­‐degree	      o  Followers	  •  Out-­‐Degree	      o  Friends/Follow	  •  Centrality...
Twi5er  Networks-­‐‑Properties	                                                                                   M•  Conc...
Twi5er  Networks-­‐‑Properties	•  Concepts	  From	  Citation	  Networks	                                                 M...
Twi5er  Networks-­‐‑Properties	•  Bipartite/Affiliation	  Networks	     o  Two	  disjoint	  subsets	     o  The	  bipartite	...
Other  Metrics  &  Mechanisms	                   •      Kronecker	  Graphs	  Models	                             o  Kronec...
Twi5er  Networks-­‐‑Properties	•  Twitter != LinkedIn, Twitter != Facebook•  Twitter Network == Interest Network•  Be	  co...
Cliques  (1  of  2)	•  “Maximal	  subset	  of	  the	  vertices	  in	  an	     undirected	  network	  such	  that	  every	 ...
Cliques  (2  of  2)	•  k-­‐core	  –	  at	  least	  k	  others	  in	  the	  subset;	     (n-­‐k)-­‐plex	  •  k-­‐clique	  –...
Sentiment  Analysis	•  Sentiment	  Analysis	  is	  an	  important	  &	  interesting	  work	     on	  the	  Twitter	  platf...
Sentiment  Analysis	                  •  Twitter	  Mining	  for	  Airline	  Sentiment	                    •  Opinion	  Lex...
Need  I  say  more  ?	                       “A	  bit	  of	  clever	  math	  can	  uncover	  interes4ng	  pa7erns	  that	 ...
Project	  Ideas	  
Interesting Vectors of Exploration	1.  Find	  trending	  #tags	  &	  then	  related	  #tags	  –	  using	      cliques	  ov...
Interesting Vectors of Exploration	5.  Use	  centrality	  measures	  to	  determine	  mortals	  vs.	      celebrities	  6....
Twi5er  Network  Graph  Analysis	      An	  Example	  
Analysis  Story  Board	              •  @clouderati	  is	  a	  popular	  cloud	  related	                   Twitter	  acco...
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
Upcoming SlideShare
Loading in...5
×

The Art of Social Media Analysis with Twitter & Python

11,424

Published on

Slides for my tutorial at OSCON 2012 http://goo.gl/fpxVE

Published in: Technology, Business
2 Comments
39 Likes
Statistics
Notes
No Downloads
Views
Total Views
11,424
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
451
Comments
2
Likes
39
Embeds 0
No embeds

No notes for slide

The Art of Social Media Analysis with Twitter & Python

  1. 1. The Art of Social Media Analysis with Twitter & Python krishna sankar @ksankar http://www.oscon.com/oscon2012/public/schedule/detail/23130
  2. 2. Intro API, Objects,… o  House  Rules  (1  of  2)   Twitter Network We will analyze @clouderati, o  Doesn’t  assume  any  knowledge   Analysis 2072 followers, exploding to of  Twitter  API   Pipeline ~980,000 distinct users down one level o  Goal:  Everybody  in  the  same   page  &  get  a  working   knowledge  of  Twitter  API   NLP, NLTK, o  To  bootstrap  your  exploration   @mention Cliques, social Sentiment network graph into  Social  Network  Analysis  &   Analysis Twitter     Rewteeet analytics, Growth, #tag Network Information o  Simple  programs,  to  illustrate   contagion weakties usage  &  data  manipulation  
  3. 3. Intro API, Objects,… Twittero  House  Rules  (2  of  2)   Network We will analyze @clouderati, Analysis 2072 followers, exploding to o  Am  using  the  requests  library   Pipeline ~980,000 distinct users down o  There  are  good  Twitter  frameworks   one level for  python,  but  wanted  to  build   from  the  basics.  Once  one   understands  the  fundamentals,   frameworks  can  help   NLP, NLTK, @mention Cliques, social Sentiment o  Many  areas  to  explore  –  not  enough   Analysis network graph time.  So  decided  to  focus  on  social   graph,  cliques  &  networkx   Rewteeet analytics, Growth, #tag Network Information contagion weakties
  4. 4. About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at   Genophen.com   o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization     •  http://www.ispcs.org/2012/index.html   o  Blog  :  http://doubleclix.wordpress.com/   o  Quora  :  http://www.quora.com/Krishna-­‐Sankar  •  Prior  Gigs   o  Lead  Architect  (Egnyte)   o  Distinguished  Engineer  (CSCO)   o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !  •  Current  Focus:   o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,   MongoDB,  Solr,  Drupal,GitHub,…   o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –   so  far  !)   o  Overlay  based  semantic  search  &  ranking  •  Other  related  Presentations   o  http://goo.gl/P1rhc  Big  Data  Engineering  Top  10  Pragmatics  (Summary)   o  http://goo.gl/0SQDV  The  Art  of  Big  Data  (Detailed)   o  http://goo.gl/EaUKH  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  
  5. 5. Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   End •  And  as  far  as  possible  same  as  the  ( json)  response       Beg As Th inni o  Use  NOSQL  CLI  for  trimming  records  et  al   ng e
  6. 6. Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  7. 7. Twitter Tips – A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 211.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  8. 8. Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 213.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong  15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  9. 9. Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  10. 10. Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  11. 11. Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416   Range  Unacceptable”  6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  12. 12. A Fork   &  deep ,NLTK    •   NLP weets into  T ment   4 o  Sen ysis   Anal • Not enough time for both • I chose the Social Graph route
  13. 13. A minute about Twitter as platform & it’s evolution blog/ er. com/ tter-­‐ twitt wi ps:/ /dev. nsistent-­‐t htt ring-­‐co e deliv ence   “The micro-blogging service must find the ri expe right balance of running a profitable business and maintaining a robust “.. we want to make sure that the Twitter experience is developers community.” – Chenda, CBS straightforward and easy to understand -- whether you’re on news! Twitter.com or elsewhere on the web”-Michael!My  Wish  &  Hope  •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive  •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn   o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience   o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that  •  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system  •  Orthogonally  extensible  platform  is  essential  •  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  
  14. 14. •  For  Hands  on  Today   Setup o  Python  2.7.3   o  easy_install  –v  requests   •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐ request   o  easy_install  –v  requests-­‐oauth   o  Hands  on  programs  at  https://github.com/xsankar/oscon2012-­‐handson  •  For  advanced  data  science  with  social  graphs   o  easy_install  –v  networkx   o  easy_install  –v  numpy   o  easy_install  –v  nltk     •  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al   o  Mongodb     •  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS   o  graphviz  -­‐  http://www.graphviz.org/;  easy_install  pygraphviz   o  easy_install  pydot  
  15. 15. Thanks To these Giants …
  16. 16. Problem Domain For this tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as   observed  by  Twitter  primitives   o  Not  for  Twitter  based  apps  for  real  time  tweets   o  Not  web  sites  with  real  time  tweets  •  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &   actionable  recommendations  •  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.   not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before   pronouncing  a  trend)  
  17. 17. Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)     o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)   o  Objects   o  API   o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)  II.  Break  (3:00  PM  -­‐  3:30  PM)  III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)   o  Underlying  Concepts   o  Social  Graph  Analysis  of  @clouderati   §  Stages,  Strategies  &  Tasks   §  Code  Walk  thru    
  18. 18. Open  This  First
  19. 19. Twi5er  API  :  Read  These  First •  Using  Twitter  Brand   o  New  logo  &  associated  guidelines  :  https://twitter.com/about/logos   o  Twitter  Rules  :   https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐ best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules   o  Developer  Rules  of  the  road  https://dev.twitter.com/terms/api-­‐terms  •  Read  These  Links  First   1.  https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know   2.  https://dev.twitter.com/docs/faq   3.  Field  Guide  to  Objects  https://dev.twitter.com/docs/platform-­‐objects   4.  Security  https://dev.twitter.com/docs/security-­‐best-­‐practices   5.  Media  Best  Practices  :  https://dev.twitter.com/media   6.  Consolidates  Page  :  https://dev.twitter.com/docs   7.  Streaming  APIs  https://dev.twitter.com/docs/streaming-­‐apis   8.  How  to  Appeal  (Not  that  you  all  would  need  it  !)  https://support.twitter.com/ articles/72585  •  Only  One  version  of  Twitter  APIs  
  20. 20. API  Status  Page •  https://dev.twitter.com/status  •  https://dev.twitter.com/issues  •  https://dev.twitter.com/discussions  
  21. 21. h5ps://dev.twi5er.com/status http://www.buzzfeed.com/tommywilhelm/google-­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter  
  22. 22. Open  This  First •  Install  pre-­‐req  as  per  the  setup  slide  •  Run     o  oscon2012_open_this_first.py   o  To  test  connectivity  –  “canary  query”  •  Run   o  oscon2012_rate_limit_status.py   o  Use  http://www.epochconverter.com  to  check  reset_time  •  Formats  xml,  json,  atom  &  rss  
  23. 23. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams   Seach & User  Streams   Trend Twitter   Twitter   Site  Streams   REST   Search   Firehose   Build  Profile   Keywords   Create/Post  Tweets   Specific  User   Reply   Trends   Favorite,  Re-­‐tweet   Rate  Limit  :     Rate  Limit  :  150/350        Complexity  &  Frequency  
  24. 24. Rate  Limit
  25. 25. Rate  Limits •  By  API  type  &  Authentication  Mode   API No authC authC Error REST   150/hr   350/hr   400  Search   Complexity  &   -­‐N/A-­‐   420   Frequency  Streaming   Upto  1%  Fire  hose   none   none  
  26. 26. Rate  Limit  Header •  {  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "149",    •     "x-­‐ratelimit-­‐reset":  "1340467358",    •     "x-­‐runtime":  "0.04144",    •     "x-­‐transaction":  "2b49ac31cf8709af",    •     "x-­‐transaction-­‐mask":   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"  •  }  
  27. 27. Rate  Limit-­‐‑ed  Header •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "150",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",    •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",    •     "server":  "tfe",    •     ”…  •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341363230",    •     "x-­‐runtime":  "0.01126"  •  }  
  28. 28. Rate  Limit  Example •  Run   o  oscon2012_rate_limit_02.py  •  It  iterates  through  a  list  to  get  followers    •  List  is  2072  long  
  29. 29. •  {  •     …  •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",    •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",     Last  time,  it  gave  me  5  min.   Now  the  reset  timer  is  1  •     "x-­‐ratelimit-­‐remaining":  "147",     hour    •     "x-­‐ratelimit-­‐reset":  "1341366831",     150  calls,  not  authenticated  •     "x-­‐runtime":  "0.02768",    •     "x-­‐transaction":  "f1bafd60112dddeb",    •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"  •  }  
  30. 30. •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",     And  Rate  Limit  kicked-­‐‑in •  …  •  "status":  "400  Bad  Request",    •     "transfer-­‐encoding":  "chunked",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341366831",    •     "x-­‐runtime":  "0.01342"  •  }  
  31. 31. API  with  OAuth •  {  •     …  •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "etag":  ""dd419c02ed00fc6b2a825cc27wbe040"",    •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "pragma":  "no-­‐cache",    •     "server":  "tfe",    •  …  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐access-­‐level":  "read",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341369121",    •     "x-­‐runtime":  "0.05539",     OAuth  • •     "x-­‐transaction":  "9f8508fe4c73a407",        "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   “api-­‐identified”  •  }   1  hr  reset   350  calls  
  32. 32. •  {  •     …  •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",    •  …  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "133",    •     "x-­‐ratelimit-­‐reset":  "1341500165",    •   …   Rate  Limit  resets  during  •  }   consecutive  calls •  ********  2416  •  {   +1  •  …   hour •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",    •  …  •     "status":  "200  OK",    •     ….  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341503776",    •  ********  2417  
  33. 33. Unexplained  Errors •  Traceback  (most  recent  call  last):  •     File  "oscon2012_get_user_info_01.py",  line  39,  in  <module>  •         r  =  client.get(url,  params=payload)  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  244,  in  get  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  230,  in  request  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",  line  609,  in  send  •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(host=api.twitter.com,  port=443):  Max   retries  exceeded  with  url:  /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –   17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually  10-­‐6  AM  PST   42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%   2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got  around  by  “Trap  &  wait  5  seconds”   2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%   2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night  Runs  are  relatively  error  free   09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201  
  34. 34. •  {  • •   …      "date":  "Fri,  06  Jul  2012  03:41:09  GMT",     A Day in the life of•     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",     Twitter Rate Limit•     "server":  "tfe",    •     "set-­‐cookie":  "dnt=;  domain=.twitter.com;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",    •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "0",     Missed  by  4  min! •     "x-­‐ratelimit-­‐reset":  "1341546334",    •     "x-­‐runtime":  "0.01918"  •  }  •  Error,  sleeping  •  {  •   …  •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",    •   …  •   "status":  "200  OK",    •   …  •   "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",     OK  after  5  min  sleep •   …  
  35. 35. Strategies I  have  no  exotic  strategies,  so  far  !  1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in  2.  Combine  authenticated  &  non-­‐authenticated  calls  3.  Use  multiple  API  types  4.  Cache  5.  Store  &  get  only  what  is  needed  6.  Checkpoint  &  buffer  request  commands  7.  Distributed  data  parallelism  –  for  example  AWS  instances  http://www.epochconverter.com/  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit
  36. 36. Authentication
  37. 37. Authentication •  Three  modes   o  Anonymous   o  HTTP  Basic  Auth   o  OAuth  •  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are   supported  •   OAuth  enables  the  user  to  authorize  an  application   without  sharing  credentials  •  Also  has  the  ability  to  revoke  •  Twitter  supports  OAuth  1.0a  •  OAuth  2.0  is  the  new  standard,  much  simpler   o  No  timeframe  for  Twitter  support,  yet      
  38. 38. OAuth  Pragmatics •  Helpful  Links   o  https://dev.twitter.com/docs/auth/oauth   o  https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth   o  https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples   o  http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html  •  Discussion  on  OAuth  internal  mechanisms  is  better  left  for   another  day  •  For  headless  applications  to  get  OAuth  token,  go  to  https:// dev.twitter.com/apps  •   Create  an  application  &  get  four  credential  pieces   o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret  •  All  the  frameworks  have  support  for  OAuth.  So  plug  –in   these  values  &  use  the  framework’s  calls  •  I  used  request-­‐oauth  library  like  so:  
  39. 39. request-­‐‑oauth def  get_oauth_client():   Get  client  using  the        consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"   token,  key  &  secret  from          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"   dev.twitter.com/apps          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={pre_request:  oauth_hook})          return  client   Use  the  client  instead   def  get_followers(user_id):   of  requests                                      url  =  https://api.twitter.com/1/followers/ids.json’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)   def  get_followers_with_oauth(user_id,client):                                        url  =  https://api.twitter.com/1/followers/ids.json                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)  Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
  40. 40. OAuth  Authorize  screen •  The  user   authenticates  with   Twitter  &  grants   access  to  Forbes   Social   •  Forbes  social   doesn’t  have  the   users  credentials,   but  uses  OAuth  to   access  the  user’s   account  
  41. 41. HTTP  Status   Codes
  42. 42. HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐   •  404  Not  Found   Library  error   •  406  Not  Acceptable   •  200  OK   •  413  Too  Long   •  304  Not  Modified   •  416  Range  Unacceptable   •  400  Bad  Request   •  420  Enhance  Your  Calm   o  Check  error  message  for  explanation   o  Rate  Limited   o  REST  Rate  Limit  !     •  500  Internal  Server  Error   •  401  UnAuthorized   •  502  Bad  Gateway     o  Beware  –  you  could  get  this  for  other   o  Down  for  maintenance   reasons  as  well.       •  503  Service  Unavailable   •  403  Forbidden   o  Overloaded  “Fail  whale”   o  Hit  Update  Limit  (>  max  Tweets/day,   •  504  Gateway  Timeout   following  too  many  people)   o  Overloaded  h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
  43. 43. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "91",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",    •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",    •     "server":  "tfe",    •   …  •     "status":  "401  Unauthorized",    •     "vary":  "Accept-­‐Encoding",    •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",    • •     "x-­‐ratelimit-­‐class":  "api",        "x-­‐ratelimit-­‐limit":  "0",     Detailed  error  •     "x-­‐ratelimit-­‐remaining":  "0",     message    in  JSON  !  •     "x-­‐ratelimit-­‐reset":  "1340413616",    •     "x-­‐runtime":  "0.01997"   I  like  this  •  }  •  {  •     "errors":  [  •         {  •             "code":  53,    •             "message":  "Basic  authentication  is  not  supported"  •         }  •     ]  •  }  
  44. 44. HTTP  Status  Code  –  Confusing  Example •  {   •  GET  https://api.twitter.com/1/users/lookup.json?•  …   screen_nme=twitterapi,twitter&include_entities=•     "pragma":  "no-­‐cache",     true  •     "server":  "tfe",    •   …     •  Spelling  Mistake  •     "status":  "404  Not  Found",     o  Should  be  screen_name  •     …   •  But  confusing  error  !  •  }  •  {   •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,  •     "errors":  [   showing  parameter  error  •         {  •             "code":  34,    •             "message":  "Sorry,  that  page  does  not  exist"  •         }  •     ]  •  }  
  45. 45. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "112",    •     "content-­‐type":  "application/json;charset=utf-­‐8",     Sometimes,  the  errors  are  •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",     not  correct.  I  got  this  error  •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •  …   for  user_timeline.json  w/  •     "status":  "401  Unauthorized",     user_id=20,15,12  •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     Clearly  a  parameter  error  •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐ratelimit-­‐class":  "api",     (i.e.  more  parameters)  •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "147",    •     "x-­‐ratelimit-­‐reset":  "1340417742",    •     "x-­‐transaction":  "d545a806f9c72b98"  •  }  •  {  •     "error":  "Not  authorized",    •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"  •  }  
  46. 46. Objects
  47. 47. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  48. 48. Tweets •  A.k.a  Status  Updates   •  Interesting  fields   o  Coordinates  <-­‐  geo  location   o  created_at   o  entities  (will  see  later)   o  Id,  id_str   o  possibly  sensitive   o  user  (will  see  later)   •  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –   hard  to  maintain  at  scale   •  https://dev.twitter.com/docs/faq#6981   o  withheld_in_countries     •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
  49. 49. A  word  about  id,  id_str •  June  1,  2010   o  Snowflake  the  id  generator  service   o  “The  full  ID  is  composed  of  a  timestamp,   a  worker  number,  and  a  sequence   number”   o  Had  problems  with  JavaScript  to  handle   numbers  >  53  bits   o  “id”:819797   o  “id_str”:”819797”  h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
  50. 50. Tweets  -­‐‑  example •  Let  us  run  oscon2012-­‐tweets.py  •  Example  of  tweet   o  coordinates   o  id     o  id_str  
  51. 51. Users •  followers_count   •  geo_enabled   •  Id,  Id_str   •  name,  screen_name   •  Protected   •  status,  statuses_count   •  withheld_in_countries  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
  52. 52. Users  –  Let  us  run  some  examples •  Run     o  oscon_2012_users.py   •  Lookup  users  by  screen_name   o  oscon12_first_20_ids.py   •  Lookup  users  by  user_id  •  Inspect  the  results   o  id,  name,  status,  status_count,  protected,  followers   (for  top  10  followers),  withheld  users  •  Can  use  information  for  customizing   the  user’s  screen  in  your  web  app  
  53. 53. Entities •  Metadata  &  Contextual  Information   •  You  can  parse  them,  but  Entities   parse  them  out  as  structured  data   •  REST  API/Search  API  –   include_entities=1   •  Streaming  API  –  included  by  default   •  hashtags,  media,  urls,   user_mentions  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-­‐‑entities h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
  54. 54. Entities •  Run     o  oscon2012_entities.py  •  Inspect  hashtags,  urls  et  al    
  55. 55. Places •  attributes   •  bounding_box   •  Id  (as  a  string!)   •  country   •  name  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
  56. 56. Places •  Can  search  for  tweets  near  a  place  like  so:  •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]   o  Tweets  near  that  place  •  Tweets  near  San  Jose  [37.395715,-­‐122.102308]  •  We  will  not  see  further  here.  But  very  useful  
  57. 57. Timelines •  Collections  of  tweets  ordered  by  time   •  Use  max_id  &  since_id  for  navigation  h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
  58. 58. Other  Objects  &  APIs •  Lists  •  Notifications  •  Friendships/exists  to  see  if  one  follows   the  other  
  59. 59. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  60. 60. Hands-­‐‑on  Exercise  (15  min) •  Setup  environment  –  slide  #14  •  Sanity  Check  Environment  &  Libraries   o  oscon2012_open_this_first.py   o  oscon2012_rate_limit_status.py  •  Get  objects  (show  calls)   o  Lookup  users  by  screen_name    -­‐  oscon12_users.py   o  Lookup  users  by  id  -­‐  oscon12_first_20_ids.py   o  Lookup  tweets  -­‐  oscon12_tweets.py   o  Get  entities  -­‐  oscon12_entities.py  •  Inspect  the  results  •  Explore  a  little  bit  •  Discussion  
  61. 61. Twi5er  APIs
  62. 62. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  63. 63. Twi5er  REST  API •  https://dev.twitter.com/docs/api  •  What  we  were  doing  were  the  REST  API  •  Request-­‐Response  •  Anonymous  or  OAuth  •  Rate  Limited  :   o  150/350  
  64. 64. Twi5er  Trends •  oscon2012-­‐trends.py  •  Trends/weekly,  Trends/monthly  •  Let  us  run  some  examples   o  oscon2012_trends_daily.py   o  oscon2012_trends_weekly.py  •  Trends  &  hashtags   o  #hashtag  euro2012   o  http://hashtags.org/euro2012   o  http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/   o  http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html   o  Top  10  :  http://twittercounter.com/pages/100,  http://twitaholic.com/  
  65. 65. Brand  Rank  w/  Twi5er •  Walk  Through  &  results  of  following   o  oscon2012_brand_01.py  •  Followed  10  user-­‐brands  for  a  few  days  to  find   growth  •  Brand  Rank     o  Growth  of  a  brand  w.r.t  the  industry   o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &   correlate  using  Twitter  APIs  &  metrics  •  API  :  url=https://api.twitter.com/1/users/ lookup.json  •  payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}  
  66. 66. Brand  Rank  w/  Twi5er Clouderati   is  very   stable
  67. 67. Brand  Rank  w/  Twi5er   Tech  Brands •  Google  I/O  showed  a  spike  on  6/27-­‐   6/28   •  OReillyMedia  shares  some  spike   •  Looking  at  a  few  days  worth  of   data,  our  best  inference  is  that   “oscon  doesn’t  track  with  googleio”   •  “Clouderati  doesn’t  track  at  all”  
  68. 68. Brand  Rank  w/  Twi5er   World  of  Soccer •  FOXSoccer,UEFAcom   track  each  other     The  numbers  seldom   decrease.  So  calculating   –ve  velocity  will  not   work OTOH,  if  you  see  a  –ve   velocity,  investigate
  69. 69. Brand  Rank  w/  Twi5er   World  of  Basketball •  NBA,  MiamiHeat,  okcthunder  track  each  other  •  Used  %  than  absolute  numbers  to  compare  •  The  hike  on  7/6  to  7/10  is  interesting.      
  70. 70. Brand  Rank  w/  Twi5er   Rising  Tide  … •  For  some  reason,  all  numbers  are  going  up  7/6  thru   7/10  –  except  for  clouderati!   •  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  
  71. 71. Trivia  :  Search  API •  Search(search.twitter.com)   o  Built  by  Summize  which  was  acquired  by  Twitter  in   2008   o  Summize  described  itself  as  “sentiment  mining”  
  72. 72. Search  API •  Very  simple     o  GET  http://search.twitter.com/search.json?q=<blah>   •  Based  on  a  search  criteria   •  “The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets” •  Recent  =  Last  6-­‐9  days  worth  of  tweets   •  Anonymous  Call   •  Rate  Limit   o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency  h5ps://dev.twi5er.com/docs/using-­‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search
  73. 73. Search  API •  Filters   o  Search  URL  encoded   o  @  =  %40,  #=%23   o   emoticons    :)  and  :(,   o  http://search.twitter.com/search.atom?q=sometimes+%3A)   o  http://search.twitter.com/search.atom?q=sometimes+%3A(  •  Location  Filters,  date  filters  •  Content  searches  
  74. 74. Streaming  API •  Not  request  response;  but  stream  •  Twitter  frameworks  have  the  support  •  Rate  Limit  :  Upto  1%  •  Stall  warning  if  the  client  is  falling  behind  •  Good  Documentation  Links   o  https://dev.twitter.com/docs/streaming-­‐apis/connecting   o  https://dev.twitter.com/docs/streaming-­‐apis/parameters   o  https://dev.twitter.com/docs/streaming-­‐apis/processing  
  75. 75. Firehose •  ~  400  million  public  tweets/day  •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !  •  If  you  hit  real  limits,  then  explore  the  firehose  route  •  AFAIK,  it  is  not  cheap,  but  worth  it  
  76. 76. API  Best  Practices 1.  Use  JSON   2.  Use  user_id  than  screen_name   o  User_id  is  constant  while  screen_name  can  change   3.  max_id  and  since_id   o  For  example  direct  messages,  if  you  have  last  message  use   since_id  for  search   o  max_id  how  far  to  go  back   4.  Cache  as  much  as  you  can   5.  Set  the  User-­‐Agent  header  for  debugging   I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentationThese are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
  77. 77. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Questions  ?   Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  78. 78. Part II SNA Part IITwitter Network Analysis
  79. 79. 2.  Store   3.  Transform  &     1.  Collect   Analyze   the Validate Dataset & . Keep don’t Tip: 3 simple; re-crawl/refresh a schem afrai d to be for mMost  important  &   transthe  ugliest  slide  in   this  deck  !   as lem ent , 1. Imp ipeline 4.  Model   Tip: age d p nolith 5.  Predict,   &     a st r a mo Reason   neve Recommend  &   Visualize  
  80. 80. Trivia •  Social  Network  Analysis  originated  as  Sociometry  &   the  social  network  was  called  a  sociogram  •  Back  then,  Facebook  was  called  SocioBinder!  •  Jacob  Levi  Morano,  is  considered  the  originator   o  NYTimes,  April  3,  1933,  P.  17  
  81. 81. Twi5er  Networks-­‐‑Definitions •  Nodes   o  Users   o  #tags  •  Edges   o  Follows   o  Friends   o  @mentions   o  #tags  •  Directed  
  82. 82. Twi5er  Networks-­‐‑Definitions •  In-­‐degree   o  Followers  •  Out-­‐Degree   o  Friends/Follow  •  Centrality  Measures  •  Hubs  &  Authorities   o  Hubs/Directories  tell  us  where   Authorities  are   o  “Of  Mortals  &  Celebrities”  is   more  “Twitter-­‐style”  
  83. 83. Twi5er  Networks-­‐‑Properties M•  Concepts  From  Citation   N Networks   K J o  Cocitation   L   I •  Common  papers  that  cite  a  paper   A •  Common  Followers   B G o  C  &  G  (Followed  by  F  &  H)   C H o  Bibliographic  Coupling   •  Cite  the  same  papers   D F   •  Common  Friends  (i.e.  follow  same   E person)   o  D,  E,  F  &  H  
  84. 84. Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks   M o  Cocitation   N •  Common  papers  that  cite  a  paper   K •  Common  Followers   J   L   o  C  &  G  (Followed  by  F  &  H)   I   o  Bibliographic  Coupling   A •  Cite  the  same  papers   B G •  Common  Friends    (i.e.  follow  same  person)   o  D,  E,  F  &  H  follow  C   o  H  &  F  follow  C  &  G   H C •  So  H  &  F  have  high  coupling   D •  Hence,  if  H  follows  A,  we  can   F   recommend  F  to  follow  A   E
  85. 85. Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks   o  Two  disjoint  subsets   o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph   o  Membership  in  Lists     •  lists  vs.  users  bipartite  graph   o  Common  #Tags  in  Tweets     •  #tags  vs.  members  bipartite  graph   o  @mention  together   •  ?  Can  this  be  a  bipartite  graph   •  ?  How  would  we  fold  this  ?  
  86. 86. Other  Metrics  &  Mechanisms •  Kronecker  Graphs  Models   o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices   o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of   their  adjacency  matrices   o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al   •  Erdos-­‐Renyl  Random  Graphs   o  Easy  to  build  a  Gn,p  graph   o  Assumes  equal  likelihood  of  edges  between  two  nodes   o  In a Twitter social network, we can create a more realistic expected distribution (adding the “social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter   •  Weak  Ties   •  Follower  velocity  (+ve  &  –ve),  Association  strength   o  Unfollow  not  a  reliable  measure   o  But  an  interesting  property  to  investigate  when  it  happens   Not covered here, but potential for an encore !Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
  87. 87. Twi5er  Networks-­‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook•  Twitter Network == Interest Network•  Be  cognizant  of  the  above  when  you  apply  traditional  network   properties  to  Twitter    •  For  example,     o  Six  degrees  of  separation  doesnt  make  sense  (most  of  the  time)  in   Twitter  –  except  may  be  for  Cliques   o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?   •  Probably  not   o  Do  cut  sets  make  sense  ?     •  Probably  not   o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques   o  Bipartite  graphs  do  make  sense  
  88. 88. Cliques  (1  of  2) •  “Maximal  subset  of  the  vertices  in  an   undirected  network  such  that  every  member   of  the  set  is  connected  by  an  edge  to  every   other”  •  Cohesive  subgroup,  closely  connected  •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.   connected  to  at  least  n-­‐k  others)  •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse   network;  1-­‐plex  being  the  perfect  clique   Ref:  Networks,  An  Introduction-­‐‑Newman
  89. 89. Cliques  (2  of  2) •  k-­‐core  –  at  least  k  others  in  the  subset;   (n-­‐k)-­‐plex  •  k-­‐clique  –  no  more  than  k  distance  away   o  Path  inside  or  outside  the  subset   o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)  •  We  will  apply  k-­‐plex  Cliques  for  one  of   our  hands-­‐on     Ref:  Networks,  An  Introduction-­‐‑Newman
  90. 90. Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work   on  the  Twitter  platform   o  Collect  Tweets   o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons   •  Naïve  Bayes/Max  Entropy  Class/SVM   o  Aggregated  Text  Sentiment/Moving  Average  •  I  chose  not  to  dive  deeper  because  of  time  constraints   o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,   all  in  3  hrs  •  Next  3  Slides  have  couple  of  interesting  examples    
  91. 91. Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment   •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800    h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
  92. 92. Need  I  say  more  ? “A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the   human  eye”      h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf
  93. 93. Project  Ideas  
  94. 94. Interesting Vectors of Exploration 1.  Find  trending  #tags  &  then  related  #tags  –  using   cliques  over  co-­‐#tag-­‐citation,  which  infers  topics   related  to  trending  topics  2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or   group  of  users  3.  Analysis-­‐In/Out  flow,  Tweet  Flow   –  Frequent  @mention  4.  Find  affiliation  networks  by  List  memberships,  #tags   or  frequent  @mentions    
  95. 95. Interesting Vectors of Exploration 5.  Use  centrality  measures  to  determine  mortals  vs.   celebrities  6.  Classify  Tweet  networks/cliques  based  on  message   passing  characteristics   –  Tweets  vs.  Retweets,  No  of  reweets,…  7.  Retweet  Network   –  Measure  Influence  by  retweet  count  &  frequency   –  Information  contagion  by  looking  at  different  retweet   network  subcomponents  –  who,  when,  how  much,…  
  96. 96. Twi5er  Network  Graph  Analysis An  Example  
  97. 97. Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related   Twitter  account   •  Goals:   o  Analyze  the  social  graph  characteristics  of  the  users  who  are   following  the  account   In this •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the   tutorial followers  of  @clouderati   o  How  many  cliques  ?  How  strong  are  they  ?   o  Does  the  @mention  support  the  clique  inferences  ?  For you to o  What  are  the  retweet  characteristics  ?  explore !! o  How  does  the  #tag  network  graph  look  like  ?      
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×