Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Art of Social Media       Analysis with Twitter & Python                                      krishna sankar          ...
Intro	                                           API,                                          Objects,…	o  House	  Rules	...
Intro	                                                API,                                               Objects,…	       ...
About  Me	•    Lead	  Engineer/Data	  Scientist/AWS	  Ops	  Guy	  at	       Genophen.com	         o    Co-­‐chair	  –	  20...
Twitter Tips – A Baker’s Dozen	1.    Twitter	  APIs	  are	  (more	  or	  less)	  congruent	  &	  symmetric	  2.    Twitter...
Twitter Tips – A Baker’s Dozen	5.     Always	  use	  a	  big	  data	  pipeline	        o       Collect - Store - Transform...
Twitter Tips – A Baker’s Dozen	9.  Program	  defensively	  	        o      more so for a REST-based-Big Data-Analytics sys...
Twitter Tips – A Baker’s Dozen	12.  Check	  Point	  frequently	  (preferably	  after	  ever	  API	  call)	  &	  have	  a	 ...
Twitter Tips – A Baker’s Dozen	16.  The	  Twitter	  big	  data	  pipeline	  has	  lots	  of	  opportunities	  for	  parall...
Twitter Tips – A Baker’s Dozen	19.  Understand	  the	  underlying	  network	  characteristics	  for	  the	       inference...
Twitter Gripes	1.     Need	  more	  rich	  APIs	  for	  #tags	        o      Somewhat	  similar	  to	  users	  viz.	  foll...
A Fork	                           	                  & 	  deep       ,NLTK	   	  •   NLP weets    into	  T ment	          ...
A minute about Twitter as platform & it’s evolution	                                                                      ...
•    For	  Hands	  on	  Today	                                                                                            ...
Thanks To these Giants …
Problem Domain For this tutorial	•  Data	  Science	  (trends,	  analytics	  et	  al)	  on	  Social	  Networks	  as	     ob...
Agenda	I.     Mechanics	  :	  Twitter	  API	  (1:30	  PM	  -­‐	  3:00	  PM)	  	        o    Essential	  Fundamentals	  (Ra...
Open  This  First
Twi5er  API  :  Read  These  First	•    Using	  Twitter	  Brand	        o  New	  logo	  &	  associated	  guidelines	  :	  ...
API  Status  Page	•    https://dev.twitter.com/status	  •    https://dev.twitter.com/issues	  •    https://dev.twitter.com...
h5ps://dev.twi5er.com/status	http://www.buzzfeed.com/tommywilhelm/google-­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐t...
Open  This  First	•  Install	  pre-­‐req	  as	  per	  the	  setup	  slide	  •  Run	  	      o  oscon2012_open_this_first.py...
Twitter	  API	                                                                                                            ...
Rate  Limit
Rate  Limits	 •  By	  API	  type	  &	  Authentication	  Mode	           API	          No authC	           authC	          ...
Rate  Limit  Header	•  {	  •  "status":	  "200	  OK",	  	  •  	  	  "vary":	  "Accept-­‐Encoding",	  	  •  	  	  "x-­‐fram...
Rate  Limit-­‐‑ed  Header	•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  max-­‐age=300",	  	  •    	  	  "content...
Rate  Limit  Example	•  Run	      o  oscon2012_rate_limit_02.py	  •  It	  iterates	  through	  a	  list	  to	  get	  follo...
•    {	  •    	  	  …	  •    	  	  "date":	  "Wed,	  04	  Jul	  2012	  00:54:16	  GMT",	  	  •    "status":	  "200	  OK",	...
•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  max-­‐age=300",	  	  •    	  	  "content-­‐encoding":	  "gzip",	  ...
API  with  OAuth	•    {	  •    	  	  …	  •    	  	  "date":	  "Wed,	  04	  Jul	  2012	  01:32:01	  GMT",	  	  •    	  	  "...
•    {	  •    	  	  …	  •    	  	  "date":	  "Thu,	  05	  Jul	  2012	  14:56:05	  GMT",	  	  •    …	  •    	  	  "x-­‐rate...
Unexplained  Errors	•    Traceback	  (most	  recent	  call	  last):	  •    	  	  File	  "oscon2012_get_user_info_01.py",	 ...
•    {	  • •      	  …	       	  	  "date":	  "Fri,	  06	  Jul	  2012	  03:41:09	  GMT",	  	                              ...
Strategies	I	  have	  no	  exotic	  strategies,	  so	  far	  !	  1.  Obvious	  :	  	  Track	  elapsed	  time	  &	  sleep	 ...
Authentication
Authentication	•  Three	  modes	       o  Anonymous	       o  HTTP	  Basic	  Auth	       o  OAuth	  •  As	  of	  Aug	  31,...
OAuth  Pragmatics	•  Helpful	  Links	       o    https://dev.twitter.com/docs/auth/oauth	       o    https://dev.twitter.c...
request-­‐‑oauth	               def	  get_oauth_client():	                                                                ...
OAuth  Authorize  screen	                •  The	  user	                     authenticates	  with	                     Twit...
HTTP  Status    Codes
HTTP  status  Codes	         •  0	  Never	  made	  it	  to	  Twitter	  Servers	  -­‐	   •          404	  Not	  Found	     ...
HTTP  Status  Code  -­‐‑  Example	•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  max-­‐age=300",	  	  •    	  	  ...
HTTP  Status  Code  –  Confusing  Example	•    {	                                                                •  GET	  ...
HTTP  Status  Code  -­‐‑  Example	•    {	  •    	  	  "cache-­‐control":	  "no-­‐cache,	  no-­‐store,	  must-­‐revalidate,...
Objects
Followers	                                                        Twitter	  Platform	      Friends	                       ...
Tweets	                •  A.k.a	  Status	  Updates	                  •  Interesting	  fields	                        o    C...
A  word  about  id,  id_str	                  •  June	  1,	  2010	                           o  Snowflake	  the	  id	  gene...
Tweets  -­‐‑  example	•  Let	  us	  run	  oscon2012-­‐tweets.py	  •  Example	  of	  tweet	     o  coordinates	     o  id	 ...
Users	                •    followers_count	                  •    geo_enabled	                  •    Id,	  Id_str	        ...
Users  –  Let  us  run  some  examples	•  Run	  	       o  oscon_2012_users.py	           •  Lookup	  users	  by	  screen_...
Entities	                    •  Metadata	  &	  Contextual	  Information	                      •  You	  can	  parse	  them,...
Entities	•  Run	  	       o  oscon2012_entities.py	  •  Inspect	  hashtags,	  urls	  et	  al	  	  
Places	                  •    attributes	                    •    bounding_box	                    •    Id	  (as	  a	  str...
Places	•  Can	  search	  for	  tweets	  near	  a	  place	  like	  so:	  •  Get	  latlong	  of	  conv	  center	  [45.52929,...
Timelines	             •  Collections	  of	  tweets	  ordered	  by	  time	               •  Use	  max_id	  &	  since_id	  ...
Other  Objects  &  APIs	•  Lists	  •  Notifications	  •  Friendships/exists	  to	  see	  if	  one	  follows	     the	  othe...
Followers	                                                        Twitter	  Platform	      Friends	                       ...
Hands-­‐‑on  Exercise  (15  min)	•  Setup	  environment	  –	  slide	  #14	  •  Sanity	  Check	  Environment	  &	  Librarie...
Twi5er  APIs
Twitter	  API	                                                                                       Near-realtime,       ...
Twi5er  REST  API	•    https://dev.twitter.com/docs/api	  •    What	  we	  were	  doing	  were	  the	  REST	  API	  •    R...
Twi5er  Trends	•  oscon2012-­‐trends.py	  •  Trends/weekly,	  Trends/monthly	  •  Let	  us	  run	  some	  examples	       ...
Brand  Rank  w/  Twi5er	•  Walk	  Through	  &	  results	  of	  following	       o  oscon2012_brand_01.py	  •  Followed	  1...
Brand  Rank  w/  Twi5er	                     Clouderati                         is  very                          stable
Brand  Rank  w/  Twi5er      Tech  Brands	            •    Google	  I/O	  showed	  a	  spike	  on	  6/27-­‐	              ...
Brand  Rank  w/  Twi5er     World  of  Soccer	            •  FOXSoccer,UEFAcom	                 track	  each	  other	  	  ...
Brand  Rank  w/  Twi5er                   World  of  Basketball	•  NBA,	  MiamiHeat,	  okcthunder	  track	  each	  other	 ...
Brand  Rank  w/  Twi5er      Rising  Tide  …	 •  For	  some	  reason,	  all	  numbers	  are	  going	  up	  7/6	  thru	    ...
Trivia  :  Search  API	•  Search(search.twitter.com)	     o  Built	  by	  Summize	  which	  was	  acquired	  by	  Twitter	...
Search  API	              •  Very	  simple	  	                     o  GET	  http://search.twitter.com/search.json?q=<blah>...
Search  API	•  Filters	      o    Search	  URL	  encoded	      o    @	  =	  %40,	  #=%23	      o    	  emoticons	  	  :)	 ...
Streaming  API	•    Not	  request	  response;	  but	  stream	  •    Twitter	  frameworks	  have	  the	  support	  •    Rat...
Firehose	•  ~	  400	  million	  public	  tweets/day	  •  If	  you	  are	  working	  with	  Twitter	  firehose,	  I	  envy	 ...
API  Best  Practices	              1.  Use	  JSON	                2.  Use	  user_id	  than	  screen_name	                 ...
Twitter	  API	                                                                                       Near-realtime,       ...
Part II          SNA         Part IITwitter Network Analysis
2.	  Store	         3.	  Transform	  &	  	             1.	  Collect	                                                      ...
Trivia	•  Social	  Network	  Analysis	  originated	  as	  Sociometry	  &	     the	  social	  network	  was	  called	  a	  ...
Twi5er  Networks-­‐‑Definitions	•  Nodes	     o  Users	     o  #tags	  •  Edges	     o    Follows	     o    Friends	     o ...
Twi5er  Networks-­‐‑Definitions	•  In-­‐degree	      o  Followers	  •  Out-­‐Degree	      o  Friends/Follow	  •  Centrality...
Twi5er  Networks-­‐‑Properties	                                                                                   M•  Conc...
Twi5er  Networks-­‐‑Properties	•  Concepts	  From	  Citation	  Networks	                                                 M...
Twi5er  Networks-­‐‑Properties	•  Bipartite/Affiliation	  Networks	     o  Two	  disjoint	  subsets	     o  The	  bipartite	...
Other  Metrics  &  Mechanisms	                   •      Kronecker	  Graphs	  Models	                             o  Kronec...
Twi5er  Networks-­‐‑Properties	•  Twitter != LinkedIn, Twitter != Facebook•  Twitter Network == Interest Network•  Be	  co...
Cliques  (1  of  2)	•  “Maximal	  subset	  of	  the	  vertices	  in	  an	     undirected	  network	  such	  that	  every	 ...
Cliques  (2  of  2)	•  k-­‐core	  –	  at	  least	  k	  others	  in	  the	  subset;	     (n-­‐k)-­‐plex	  •  k-­‐clique	  –...
Sentiment  Analysis	•  Sentiment	  Analysis	  is	  an	  important	  &	  interesting	  work	     on	  the	  Twitter	  platf...
Sentiment  Analysis	                  •  Twitter	  Mining	  for	  Airline	  Sentiment	                    •  Opinion	  Lex...
Need  I  say  more  ?	                       “A	  bit	  of	  clever	  math	  can	  uncover	  interes4ng	  pa7erns	  that	 ...
Project	  Ideas	  
Interesting Vectors of Exploration	1.  Find	  trending	  #tags	  &	  then	  related	  #tags	  –	  using	      cliques	  ov...
Interesting Vectors of Exploration	5.  Use	  centrality	  measures	  to	  determine	  mortals	  vs.	      celebrities	  6....
Twi5er  Network  Graph  Analysis	      An	  Example	  
Analysis  Story  Board	              •  @clouderati	  is	  a	  popular	  cloud	  related	                   Twitter	  acco...
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
Upcoming SlideShare
Loading in …5
×

The Art of Social Media Analysis with Twitter & Python-OSCON 2012

7,943 views

Published on

Final Slides for my 2012 Tutorial http://goo.gl/fpxVE

Published in: Technology, Business

The Art of Social Media Analysis with Twitter & Python-OSCON 2012

  1. The Art of Social Media Analysis with Twitter & Python krishna sankar @ksankar http://www.oscon.com/oscon2012/public/schedule/detail/23130
  2. Intro API, Objects,… o  House  Rules  (1  of  2)   Twitter Network We will analyze @clouderati, o  Doesn’t  assume  any  knowledge   Analysis 2072 followers, exploding to of  Twitter  API   Pipeline ~980,000 distinct users down one level o  Goal:  Everybody  in  the  same   page  &  get  a  working   knowledge  of  Twitter  API   NLP, NLTK, o  To  bootstrap  your  exploration   @mention Cliques, social Sentiment network graph into  Social  Network  Analysis  &   Analysis Twitter     Rewteeet analytics, Growth, #tag Network Information o  Simple  programs,  to  illustrate   contagion weakties usage  &  data  manipulation  
  3. Intro API, Objects,… Twittero  House  Rules  (2  of  2)   Network We will analyze @clouderati, Analysis 2072 followers, exploding to o  Am  using  the  requests  library   Pipeline ~980,000 distinct users down o  There  are  good  Twitter  frameworks   one level for  python,  but  wanted  to  build   from  the  basics.  Once  one   understands  the  fundamentals,   frameworks  can  help   NLP, NLTK, @mention Cliques, social Sentiment o  Many  areas  to  explore  –  not  enough   Analysis network graph time.  So  decided  to  focus  on  social   graph,  cliques  &  networkx   Rewteeet analytics, Growth, #tag Network Information contagion weakties
  4. About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at   Genophen.com   o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization     •  http://www.ispcs.org/2012/index.html   o  Blog  :  http://doubleclix.wordpress.com/   o  Quora  :  http://www.quora.com/Krishna-­‐Sankar  •  Prior  Gigs   o  Lead  Architect  (Egnyte)   o  Distinguished  Engineer  (CSCO)   o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !  •  Current  Focus:   o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,   MongoDB,  Solr,  Drupal,GitHub,…   o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –   so  far  !)   o  Overlay  based  semantic  search  &  ranking  •  Other  related  Presentations   o  http://goo.gl/P1rhc  Big  Data  Engineering  Top  10  Pragmatics  (Summary)   o  http://goo.gl/0SQDV  The  Art  of  Big  Data  (Detailed)   o  http://goo.gl/EaUKH  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  
  5. Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   End •  And  as  far  as  possible  same  as  the  ( json)  response       Beg As Th inni o  Use  NOSQL  CLI  for  trimming  records  et  al   ng e
  6. Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  7. Twitter Tips – A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 211.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  8. Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 213.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong  15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  9. Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  10. Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  11. Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416   Range  Unacceptable”  6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  12. A Fork   &  deep ,NLTK    •   NLP weets into  T ment   4 o  Sen ysis   Anal • Not enough time for both • I chose the Social Graph route
  13. A minute about Twitter as platform & it’s evolution blog/ er. com/ tter-­‐ twitt wi ps:/ /dev. nsistent-­‐t htt ring-­‐co e deliv ence   “The micro-blogging service must find the ri expe right balance of running a profitable business and maintaining a robust “.. we want to make sure that the Twitter experience is developers community.” – Chenda, CBS straightforward and easy to understand -- whether you’re on news! Twitter.com or elsewhere on the web”-Michael!My  Wish  &  Hope  •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive  •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn   o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience   o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that  •  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system  •  Orthogonally  extensible  platform  is  essential  •  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  
  14. •  For  Hands  on  Today   Setup o  Python  2.7.3   o  easy_install  –v  requests   •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐ request   o  easy_install  –v  requests-­‐oauth   o  Hands  on  programs  at  https://github.com/xsankar/oscon2012-­‐handson  •  For  advanced  data  science  with  social  graphs   o  easy_install  –v  networkx   o  easy_install  –v  numpy   o  easy_install  –v  nltk     •  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al   o  Mongodb     •  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS   o  graphviz  -­‐  http://www.graphviz.org/;  easy_install  pygraphviz   o  easy_install  pydot  
  15. Thanks To these Giants …
  16. Problem Domain For this tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as   observed  by  Twitter  primitives   o  Not  for  Twitter  based  apps  for  real  time  tweets   o  Not  web  sites  with  real  time  tweets  •  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &   actionable  recommendations  •  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.   not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before   pronouncing  a  trend)  
  17. Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)     o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)   o  Objects   o  API   o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)  II.  Break  (3:00  PM  -­‐  3:30  PM)  III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)   o  Underlying  Concepts   o  Social  Graph  Analysis  of  @clouderati   §  Stages,  Strategies  &  Tasks   §  Code  Walk  thru    
  18. Open  This  First
  19. Twi5er  API  :  Read  These  First •  Using  Twitter  Brand   o  New  logo  &  associated  guidelines  :  https://twitter.com/about/logos   o  Twitter  Rules  :   https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐ best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules   o  Developer  Rules  of  the  road  https://dev.twitter.com/terms/api-­‐terms  •  Read  These  Links  First   1.  https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know   2.  https://dev.twitter.com/docs/faq   3.  Field  Guide  to  Objects  https://dev.twitter.com/docs/platform-­‐objects   4.  Security  https://dev.twitter.com/docs/security-­‐best-­‐practices   5.  Media  Best  Practices  :  https://dev.twitter.com/media   6.  Consolidates  Page  :  https://dev.twitter.com/docs   7.  Streaming  APIs  https://dev.twitter.com/docs/streaming-­‐apis   8.  How  to  Appeal  (Not  that  you  all  would  need  it  !)  https://support.twitter.com/ articles/72585  •  Only  One  version  of  Twitter  APIs  
  20. API  Status  Page •  https://dev.twitter.com/status  •  https://dev.twitter.com/issues  •  https://dev.twitter.com/discussions  
  21. h5ps://dev.twi5er.com/status http://www.buzzfeed.com/tommywilhelm/google-­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter  
  22. Open  This  First •  Install  pre-­‐req  as  per  the  setup  slide  •  Run     o  oscon2012_open_this_first.py   o  To  test  connectivity  –  “canary  query”  •  Run   o  oscon2012_rate_limit_status.py   o  Use  http://www.epochconverter.com  to  check  reset_time  •  Formats  xml,  json,  atom  &  rss  
  23. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams   Seach & User  Streams   Trend Twitter   Twitter   Site  Streams   REST   Search   Firehose   Build  Profile   Keywords   Create/Post  Tweets   Specific  User   Reply   Trends   Favorite,  Re-­‐tweet   Rate  Limit  :     Rate  Limit  :  150/350        Complexity  &  Frequency  
  24. Rate  Limit
  25. Rate  Limits •  By  API  type  &  Authentication  Mode   API No authC authC Error REST   150/hr   350/hr   400  Search   Complexity  &   -­‐N/A-­‐   420   Frequency  Streaming   Upto  1%  Fire  hose   none   none  
  26. Rate  Limit  Header •  {  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "149",    •     "x-­‐ratelimit-­‐reset":  "1340467358",    •     "x-­‐runtime":  "0.04144",    •     "x-­‐transaction":  "2b49ac31cf8709af",    •     "x-­‐transaction-­‐mask":   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"  •  }  
  27. Rate  Limit-­‐‑ed  Header •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "150",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",    •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",    •     "server":  "tfe",    •     ”…  •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341363230",    •     "x-­‐runtime":  "0.01126"  •  }  
  28. Rate  Limit  Example •  Run   o  oscon2012_rate_limit_02.py  •  It  iterates  through  a  list  to  get  followers    •  List  is  2072  long  
  29. •  {  •     …  •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",    •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",     Last  time,  it  gave  me  5  min.   Now  the  reset  timer  is  1  •     "x-­‐ratelimit-­‐remaining":  "147",     hour    •     "x-­‐ratelimit-­‐reset":  "1341366831",     150  calls,  not  authenticated  •     "x-­‐runtime":  "0.02768",    •     "x-­‐transaction":  "f1bafd60112dddeb",    •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"  •  }  
  30. •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",     And  Rate  Limit  kicked-­‐‑in •  …  •  "status":  "400  Bad  Request",    •     "transfer-­‐encoding":  "chunked",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341366831",    •     "x-­‐runtime":  "0.01342"  •  }  
  31. API  with  OAuth •  {  •     …  •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "etag":  ""dd419c02ed00fc6b2a825cc27wbe040"",    •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "pragma":  "no-­‐cache",    •     "server":  "tfe",    •  …  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐access-­‐level":  "read",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341369121",    •     "x-­‐runtime":  "0.05539",     OAuth  • •     "x-­‐transaction":  "9f8508fe4c73a407",        "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   “api-­‐identified”  •  }   1  hr  reset   350  calls  
  32. •  {  •     …  •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",    •  …  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "133",    •     "x-­‐ratelimit-­‐reset":  "1341500165",    •   …   Rate  Limit  resets  during  •  }   consecutive  calls •  ********  2416  •  {   +1  •  …   hour •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",    •  …  •     "status":  "200  OK",    •     ….  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341503776",    •  ********  2417  
  33. Unexplained  Errors •  Traceback  (most  recent  call  last):  •     File  "oscon2012_get_user_info_01.py",  line  39,  in  <module>  •         r  =  client.get(url,  params=payload)  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  244,  in  get  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  230,  in  request  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",  line  609,  in  send  •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(host=api.twitter.com,  port=443):  Max   retries  exceeded  with  url:  /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –   17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually  10-­‐6  AM  PST   42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%   2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got  around  by  “Trap  &  wait  5  seconds”   2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%   2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night  Runs  are  relatively  error  free   09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201  
  34. •  {  • •   …      "date":  "Fri,  06  Jul  2012  03:41:09  GMT",     A Day in the life of•     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",     Twitter Rate Limit•     "server":  "tfe",    •     "set-­‐cookie":  "dnt=;  domain=.twitter.com;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",    •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "0",     Missed  by  4  min! •     "x-­‐ratelimit-­‐reset":  "1341546334",    •     "x-­‐runtime":  "0.01918"  •  }  •  Error,  sleeping  •  {  •   …  •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",    •   …  •   "status":  "200  OK",    •   …  •   "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",     OK  after  5  min  sleep •   …  
  35. Strategies I  have  no  exotic  strategies,  so  far  !  1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in  2.  Combine  authenticated  &  non-­‐authenticated  calls  3.  Use  multiple  API  types  4.  Cache  5.  Store  &  get  only  what  is  needed  6.  Checkpoint  &  buffer  request  commands  7.  Distributed  data  parallelism  –  for  example  AWS  instances  http://www.epochconverter.com/  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit
  36. Authentication
  37. Authentication •  Three  modes   o  Anonymous   o  HTTP  Basic  Auth   o  OAuth  •  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are   supported  •   OAuth  enables  the  user  to  authorize  an  application   without  sharing  credentials  •  Also  has  the  ability  to  revoke  •  Twitter  supports  OAuth  1.0a  •  OAuth  2.0  is  the  new  standard,  much  simpler   o  No  timeframe  for  Twitter  support,  yet      
  38. OAuth  Pragmatics •  Helpful  Links   o  https://dev.twitter.com/docs/auth/oauth   o  https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth   o  https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples   o  http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html  •  Discussion  on  OAuth  internal  mechanisms  is  better  left  for   another  day  •  For  headless  applications  to  get  OAuth  token,  go  to  https:// dev.twitter.com/apps  •   Create  an  application  &  get  four  credential  pieces   o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret  •  All  the  frameworks  have  support  for  OAuth.  So  plug  –in   these  values  &  use  the  framework’s  calls  •  I  used  request-­‐oauth  library  like  so:  
  39. request-­‐‑oauth def  get_oauth_client():   Get  client  using  the        consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"   token,  key  &  secret  from          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"   dev.twitter.com/apps          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={pre_request:  oauth_hook})          return  client   Use  the  client  instead   def  get_followers(user_id):   of  requests                                      url  =  https://api.twitter.com/1/followers/ids.json’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)   def  get_followers_with_oauth(user_id,client):                                        url  =  https://api.twitter.com/1/followers/ids.json                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)  Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
  40. OAuth  Authorize  screen •  The  user   authenticates  with   Twitter  &  grants   access  to  Forbes   Social   •  Forbes  social   doesn’t  have  the   users  credentials,   but  uses  OAuth  to   access  the  user’s   account  
  41. HTTP  Status   Codes
  42. HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐   •  404  Not  Found   Library  error   •  406  Not  Acceptable   •  200  OK   •  413  Too  Long   •  304  Not  Modified   •  416  Range  Unacceptable   •  400  Bad  Request   •  420  Enhance  Your  Calm   o  Check  error  message  for  explanation   o  Rate  Limited   o  REST  Rate  Limit  !     •  500  Internal  Server  Error   •  401  UnAuthorized   •  502  Bad  Gateway     o  Beware  –  you  could  get  this  for  other   o  Down  for  maintenance   reasons  as  well.       •  503  Service  Unavailable   •  403  Forbidden   o  Overloaded  “Fail  whale”   o  Hit  Update  Limit  (>  max  Tweets/day,   •  504  Gateway  Timeout   following  too  many  people)   o  Overloaded  h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
  43. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "91",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",    •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",    •     "server":  "tfe",    •   …  •     "status":  "401  Unauthorized",    •     "vary":  "Accept-­‐Encoding",    •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",    • •     "x-­‐ratelimit-­‐class":  "api",        "x-­‐ratelimit-­‐limit":  "0",     Detailed  error  •     "x-­‐ratelimit-­‐remaining":  "0",     message    in  JSON  !  •     "x-­‐ratelimit-­‐reset":  "1340413616",    •     "x-­‐runtime":  "0.01997"   I  like  this  •  }  •  {  •     "errors":  [  •         {  •             "code":  53,    •             "message":  "Basic  authentication  is  not  supported"  •         }  •     ]  •  }  
  44. HTTP  Status  Code  –  Confusing  Example •  {   •  GET  https://api.twitter.com/1/users/lookup.json?•  …   screen_nme=twitterapi,twitter&include_entities=•     "pragma":  "no-­‐cache",     true  •     "server":  "tfe",    •   …     •  Spelling  Mistake  •     "status":  "404  Not  Found",     o  Should  be  screen_name  •     …   •  But  confusing  error  !  •  }  •  {   •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,  •     "errors":  [   showing  parameter  error  •         {  •             "code":  34,    •             "message":  "Sorry,  that  page  does  not  exist"  •         }  •     ]  •  }  
  45. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "112",    •     "content-­‐type":  "application/json;charset=utf-­‐8",     Sometimes,  the  errors  are  •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",     not  correct.  I  got  this  error  •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •  …   for  user_timeline.json  w/  •     "status":  "401  Unauthorized",     user_id=20,15,12  •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     Clearly  a  parameter  error  •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐ratelimit-­‐class":  "api",     (i.e.  more  parameters)  •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "147",    •     "x-­‐ratelimit-­‐reset":  "1340417742",    •     "x-­‐transaction":  "d545a806f9c72b98"  •  }  •  {  •     "error":  "Not  authorized",    •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"  •  }  
  46. Objects
  47. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  48. Tweets •  A.k.a  Status  Updates   •  Interesting  fields   o  Coordinates  <-­‐  geo  location   o  created_at   o  entities  (will  see  later)   o  Id,  id_str   o  possibly  sensitive   o  user  (will  see  later)   •  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –   hard  to  maintain  at  scale   •  https://dev.twitter.com/docs/faq#6981   o  withheld_in_countries     •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
  49. A  word  about  id,  id_str •  June  1,  2010   o  Snowflake  the  id  generator  service   o  “The  full  ID  is  composed  of  a  timestamp,   a  worker  number,  and  a  sequence   number”   o  Had  problems  with  JavaScript  to  handle   numbers  >  53  bits   o  “id”:819797   o  “id_str”:”819797”  h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
  50. Tweets  -­‐‑  example •  Let  us  run  oscon2012-­‐tweets.py  •  Example  of  tweet   o  coordinates   o  id     o  id_str  
  51. Users •  followers_count   •  geo_enabled   •  Id,  Id_str   •  name,  screen_name   •  Protected   •  status,  statuses_count   •  withheld_in_countries  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
  52. Users  –  Let  us  run  some  examples •  Run     o  oscon_2012_users.py   •  Lookup  users  by  screen_name   o  oscon12_first_20_ids.py   •  Lookup  users  by  user_id  •  Inspect  the  results   o  id,  name,  status,  status_count,  protected,  followers   (for  top  10  followers),  withheld  users  •  Can  use  information  for  customizing   the  user’s  screen  in  your  web  app  
  53. Entities •  Metadata  &  Contextual  Information   •  You  can  parse  them,  but  Entities   parse  them  out  as  structured  data   •  REST  API/Search  API  –   include_entities=1   •  Streaming  API  –  included  by  default   •  hashtags,  media,  urls,   user_mentions  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-­‐‑entities h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
  54. Entities •  Run     o  oscon2012_entities.py  •  Inspect  hashtags,  urls  et  al    
  55. Places •  attributes   •  bounding_box   •  Id  (as  a  string!)   •  country   •  name  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
  56. Places •  Can  search  for  tweets  near  a  place  like  so:  •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]   o  Tweets  near  that  place  •  Tweets  near  San  Jose  [37.395715,-­‐122.102308]  •  We  will  not  see  further  here.  But  very  useful  
  57. Timelines •  Collections  of  tweets  ordered  by  time   •  Use  max_id  &  since_id  for  navigation  h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
  58. Other  Objects  &  APIs •  Lists  •  Notifications  •  Friendships/exists  to  see  if  one  follows   the  other  
  59. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  60. Hands-­‐‑on  Exercise  (15  min) •  Setup  environment  –  slide  #14  •  Sanity  Check  Environment  &  Libraries   o  oscon2012_open_this_first.py   o  oscon2012_rate_limit_status.py  •  Get  objects  (show  calls)   o  Lookup  users  by  screen_name    -­‐  oscon12_users.py   o  Lookup  users  by  id  -­‐  oscon12_first_20_ids.py   o  Lookup  tweets  -­‐  oscon12_tweets.py   o  Get  entities  -­‐  oscon12_entities.py  •  Inspect  the  results  •  Explore  a  little  bit  •  Discussion  
  61. Twi5er  APIs
  62. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  63. Twi5er  REST  API •  https://dev.twitter.com/docs/api  •  What  we  were  doing  were  the  REST  API  •  Request-­‐Response  •  Anonymous  or  OAuth  •  Rate  Limited  :   o  150/350  
  64. Twi5er  Trends •  oscon2012-­‐trends.py  •  Trends/weekly,  Trends/monthly  •  Let  us  run  some  examples   o  oscon2012_trends_daily.py   o  oscon2012_trends_weekly.py  •  Trends  &  hashtags   o  #hashtag  euro2012   o  http://hashtags.org/euro2012   o  http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/   o  http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html   o  Top  10  :  http://twittercounter.com/pages/100,  http://twitaholic.com/  
  65. Brand  Rank  w/  Twi5er •  Walk  Through  &  results  of  following   o  oscon2012_brand_01.py  •  Followed  10  user-­‐brands  for  a  few  days  to  find   growth  •  Brand  Rank     o  Growth  of  a  brand  w.r.t  the  industry   o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &   correlate  using  Twitter  APIs  &  metrics  •  API  :  url=https://api.twitter.com/1/users/ lookup.json  •  payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}  
  66. Brand  Rank  w/  Twi5er Clouderati   is  very   stable
  67. Brand  Rank  w/  Twi5er   Tech  Brands •  Google  I/O  showed  a  spike  on  6/27-­‐   6/28   •  OReillyMedia  shares  some  spike   •  Looking  at  a  few  days  worth  of   data,  our  best  inference  is  that   “oscon  doesn’t  track  with  googleio”   •  “Clouderati  doesn’t  track  at  all”  
  68. Brand  Rank  w/  Twi5er   World  of  Soccer •  FOXSoccer,UEFAcom   track  each  other     The  numbers  seldom   decrease.  So  calculating   –ve  velocity  will  not   work OTOH,  if  you  see  a  –ve   velocity,  investigate
  69. Brand  Rank  w/  Twi5er   World  of  Basketball •  NBA,  MiamiHeat,  okcthunder  track  each  other  •  Used  %  than  absolute  numbers  to  compare  •  The  hike  on  7/6  to  7/10  is  interesting.      
  70. Brand  Rank  w/  Twi5er   Rising  Tide  … •  For  some  reason,  all  numbers  are  going  up  7/6  thru   7/10  –  except  for  clouderati!   •  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  
  71. Trivia  :  Search  API •  Search(search.twitter.com)   o  Built  by  Summize  which  was  acquired  by  Twitter  in   2008   o  Summize  described  itself  as  “sentiment  mining”  
  72. Search  API •  Very  simple     o  GET  http://search.twitter.com/search.json?q=<blah>   •  Based  on  a  search  criteria   •  “The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets” •  Recent  =  Last  6-­‐9  days  worth  of  tweets   •  Anonymous  Call   •  Rate  Limit   o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency  h5ps://dev.twi5er.com/docs/using-­‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search
  73. Search  API •  Filters   o  Search  URL  encoded   o  @  =  %40,  #=%23   o   emoticons    :)  and  :(,   o  http://search.twitter.com/search.atom?q=sometimes+%3A)   o  http://search.twitter.com/search.atom?q=sometimes+%3A(  •  Location  Filters,  date  filters  •  Content  searches  
  74. Streaming  API •  Not  request  response;  but  stream  •  Twitter  frameworks  have  the  support  •  Rate  Limit  :  Upto  1%  •  Stall  warning  if  the  client  is  falling  behind  •  Good  Documentation  Links   o  https://dev.twitter.com/docs/streaming-­‐apis/connecting   o  https://dev.twitter.com/docs/streaming-­‐apis/parameters   o  https://dev.twitter.com/docs/streaming-­‐apis/processing  
  75. Firehose •  ~  400  million  public  tweets/day  •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !  •  If  you  hit  real  limits,  then  explore  the  firehose  route  •  AFAIK,  it  is  not  cheap,  but  worth  it  
  76. API  Best  Practices 1.  Use  JSON   2.  Use  user_id  than  screen_name   o  User_id  is  constant  while  screen_name  can  change   3.  max_id  and  since_id   o  For  example  direct  messages,  if  you  have  last  message  use   since_id  for  search   o  max_id  how  far  to  go  back   4.  Cache  as  much  as  you  can   5.  Set  the  User-­‐Agent  header  for  debugging   I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentationThese are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
  77. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Questions  ?   Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  78. Part II SNA Part IITwitter Network Analysis
  79. 2.  Store   3.  Transform  &     1.  Collect   Analyze   the Validate Dataset & . Keep don’t Tip: 3 simple; re-crawl/refresh a schem afrai d to be for mMost  important  &   transthe  ugliest  slide  in   this  deck  !   as lem ent , 1. Imp ipeline 4.  Model   Tip: age d p nolith 5.  Predict,   &     a st r a mo Reason   neve Recommend  &   Visualize  
  80. Trivia •  Social  Network  Analysis  originated  as  Sociometry  &   the  social  network  was  called  a  sociogram  •  Back  then,  Facebook  was  called  SocioBinder!  •  Jacob  Levi  Morano,  is  considered  the  originator   o  NYTimes,  April  3,  1933,  P.  17  
  81. Twi5er  Networks-­‐‑Definitions •  Nodes   o  Users   o  #tags  •  Edges   o  Follows   o  Friends   o  @mentions   o  #tags  •  Directed  
  82. Twi5er  Networks-­‐‑Definitions •  In-­‐degree   o  Followers  •  Out-­‐Degree   o  Friends/Follow  •  Centrality  Measures  •  Hubs  &  Authorities   o  Hubs/Directories  tell  us  where   Authorities  are   o  “Of  Mortals  &  Celebrities”  is   more  “Twitter-­‐style”  
  83. Twi5er  Networks-­‐‑Properties M•  Concepts  From  Citation   N Networks   K J o  Cocitation   L   I •  Common  papers  that  cite  a  paper   A •  Common  Followers   B G o  C  &  G  (Followed  by  F  &  H)   C H o  Bibliographic  Coupling   •  Cite  the  same  papers   D F   •  Common  Friends  (i.e.  follow  same   E person)   o  D,  E,  F  &  H  
  84. Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks   M o  Cocitation   N •  Common  papers  that  cite  a  paper   K •  Common  Followers   J   L   o  C  &  G  (Followed  by  F  &  H)   I   o  Bibliographic  Coupling   A •  Cite  the  same  papers   B G •  Common  Friends    (i.e.  follow  same  person)   o  D,  E,  F  &  H  follow  C   o  H  &  F  follow  C  &  G   H C •  So  H  &  F  have  high  coupling   D •  Hence,  if  H  follows  A,  we  can   F   recommend  F  to  follow  A   E
  85. Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks   o  Two  disjoint  subsets   o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph   o  Membership  in  Lists     •  lists  vs.  users  bipartite  graph   o  Common  #Tags  in  Tweets     •  #tags  vs.  members  bipartite  graph   o  @mention  together   •  ?  Can  this  be  a  bipartite  graph   •  ?  How  would  we  fold  this  ?  
  86. Other  Metrics  &  Mechanisms •  Kronecker  Graphs  Models   o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices   o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of   their  adjacency  matrices   o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al   •  Erdos-­‐Renyl  Random  Graphs   o  Easy  to  build  a  Gn,p  graph   o  Assumes  equal  likelihood  of  edges  between  two  nodes   o  In a Twitter social network, we can create a more realistic expected distribution (adding the “social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter   •  Weak  Ties   •  Follower  velocity  (+ve  &  –ve),  Association  strength   o  Unfollow  not  a  reliable  measure   o  But  an  interesting  property  to  investigate  when  it  happens   Not covered here, but potential for an encore !Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
  87. Twi5er  Networks-­‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook•  Twitter Network == Interest Network•  Be  cognizant  of  the  above  when  you  apply  traditional  network   properties  to  Twitter    •  For  example,     o  Six  degrees  of  separation  doesnt  make  sense  (most  of  the  time)  in   Twitter  –  except  may  be  for  Cliques   o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?   •  Probably  not   o  Do  cut  sets  make  sense  ?     •  Probably  not   o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques   o  Bipartite  graphs  do  make  sense  
  88. Cliques  (1  of  2) •  “Maximal  subset  of  the  vertices  in  an   undirected  network  such  that  every  member   of  the  set  is  connected  by  an  edge  to  every   other”  •  Cohesive  subgroup,  closely  connected  •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.   connected  to  at  least  n-­‐k  others)  •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse   network;  1-­‐plex  being  the  perfect  clique   Ref:  Networks,  An  Introduction-­‐‑Newman
  89. Cliques  (2  of  2) •  k-­‐core  –  at  least  k  others  in  the  subset;   (n-­‐k)-­‐plex  •  k-­‐clique  –  no  more  than  k  distance  away   o  Path  inside  or  outside  the  subset   o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)  •  We  will  apply  k-­‐plex  Cliques  for  one  of   our  hands-­‐on     Ref:  Networks,  An  Introduction-­‐‑Newman
  90. Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work   on  the  Twitter  platform   o  Collect  Tweets   o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons   •  Naïve  Bayes/Max  Entropy  Class/SVM   o  Aggregated  Text  Sentiment/Moving  Average  •  I  chose  not  to  dive  deeper  because  of  time  constraints   o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,   all  in  3  hrs  •  Next  3  Slides  have  couple  of  interesting  examples    
  91. Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment   •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800    h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
  92. Need  I  say  more  ? “A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the   human  eye”      h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf
  93. Project  Ideas  
  94. Interesting Vectors of Exploration 1.  Find  trending  #tags  &  then  related  #tags  –  using   cliques  over  co-­‐#tag-­‐citation,  which  infers  topics   related  to  trending  topics  2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or   group  of  users  3.  Analysis-­‐In/Out  flow,  Tweet  Flow   –  Frequent  @mention  4.  Find  affiliation  networks  by  List  memberships,  #tags   or  frequent  @mentions    
  95. Interesting Vectors of Exploration 5.  Use  centrality  measures  to  determine  mortals  vs.   celebrities  6.  Classify  Tweet  networks/cliques  based  on  message   passing  characteristics   –  Tweets  vs.  Retweets,  No  of  reweets,…  7.  Retweet  Network   –  Measure  Influence  by  retweet  count  &  frequency   –  Information  contagion  by  looking  at  different  retweet   network  subcomponents  –  who,  when,  how  much,…  
  96. Twi5er  Network  Graph  Analysis An  Example  
  97. Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related   Twitter  account   •  Goals:   o  Analyze  the  social  graph  characteristics  of  the  users  who  are   following  the  account   In this •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the   tutorial followers  of  @clouderati   o  How  many  cliques  ?  How  strong  are  they  ?   o  Does  the  @mention  support  the  clique  inferences  ?  For you to o  What  are  the  retweet  characteristics  ?  explore !! o  How  does  the  #tag  network  graph  look  like  ?      

×