The Art of Social Media Analysis with Twitter & Python
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The Art of Social Media Analysis with Twitter & Python

  • 9,437 views
Uploaded on

Slides for my tutorial at OSCON 2012 http://goo.gl/fpxVE

Slides for my tutorial at OSCON 2012 http://goo.gl/fpxVE

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
9,437
On Slideshare
9,398
From Embeds
39
Number of Embeds
5

Actions

Shares
Downloads
351
Comments
2
Likes
26

Embeds 39

http://eventifier.co 23
http://www.linkedin.com 7
https://www.linkedin.com 7
https://si0.twimg.com 1
http://www.slashdocs.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Art of Social Media Analysis with Twitter & Python krishna sankar @ksankar http://www.oscon.com/oscon2012/public/schedule/detail/23130
  • 2. Intro API, Objects,… o  House  Rules  (1  of  2)   Twitter Network We will analyze @clouderati, o  Doesn’t  assume  any  knowledge   Analysis 2072 followers, exploding to of  Twitter  API   Pipeline ~980,000 distinct users down one level o  Goal:  Everybody  in  the  same   page  &  get  a  working   knowledge  of  Twitter  API   NLP, NLTK, o  To  bootstrap  your  exploration   @mention Cliques, social Sentiment network graph into  Social  Network  Analysis  &   Analysis Twitter     Rewteeet analytics, Growth, #tag Network Information o  Simple  programs,  to  illustrate   contagion weakties usage  &  data  manipulation  
  • 3. Intro API, Objects,… Twittero  House  Rules  (2  of  2)   Network We will analyze @clouderati, Analysis 2072 followers, exploding to o  Am  using  the  requests  library   Pipeline ~980,000 distinct users down o  There  are  good  Twitter  frameworks   one level for  python,  but  wanted  to  build   from  the  basics.  Once  one   understands  the  fundamentals,   frameworks  can  help   NLP, NLTK, @mention Cliques, social Sentiment o  Many  areas  to  explore  –  not  enough   Analysis network graph time.  So  decided  to  focus  on  social   graph,  cliques  &  networkx   Rewteeet analytics, Growth, #tag Network Information contagion weakties
  • 4. About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at   Genophen.com   o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization     •  http://www.ispcs.org/2012/index.html   o  Blog  :  http://doubleclix.wordpress.com/   o  Quora  :  http://www.quora.com/Krishna-­‐Sankar  •  Prior  Gigs   o  Lead  Architect  (Egnyte)   o  Distinguished  Engineer  (CSCO)   o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !  •  Current  Focus:   o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,   MongoDB,  Solr,  Drupal,GitHub,…   o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –   so  far  !)   o  Overlay  based  semantic  search  &  ranking  •  Other  related  Presentations   o  http://goo.gl/P1rhc  Big  Data  Engineering  Top  10  Pragmatics  (Summary)   o  http://goo.gl/0SQDV  The  Art  of  Big  Data  (Detailed)   o  http://goo.gl/EaUKH  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  
  • 5. Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   End •  And  as  far  as  possible  same  as  the  ( json)  response       Beg As Th inni o  Use  NOSQL  CLI  for  trimming  records  et  al   ng e
  • 6. Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  • 7. Twitter Tips – A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 211.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  • 8. Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 213.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong  15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  • 9. Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  • 10. Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  • 11. Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416   Range  Unacceptable”  6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  • 12. A Fork   &  deep ,NLTK    •   NLP weets into  T ment   4 o  Sen ysis   Anal • Not enough time for both • I chose the Social Graph route
  • 13. A minute about Twitter as platform & it’s evolution blog/ er. com/ tter-­‐ twitt wi ps:/ /dev. nsistent-­‐t htt ring-­‐co e deliv ence   “The micro-blogging service must find the ri expe right balance of running a profitable business and maintaining a robust “.. we want to make sure that the Twitter experience is developers community.” – Chenda, CBS straightforward and easy to understand -- whether you’re on news! Twitter.com or elsewhere on the web”-Michael!My  Wish  &  Hope  •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive  •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn   o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience   o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that  •  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system  •  Orthogonally  extensible  platform  is  essential  •  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  
  • 14. •  For  Hands  on  Today   Setup o  Python  2.7.3   o  easy_install  –v  requests   •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐ request   o  easy_install  –v  requests-­‐oauth   o  Hands  on  programs  at  https://github.com/xsankar/oscon2012-­‐handson  •  For  advanced  data  science  with  social  graphs   o  easy_install  –v  networkx   o  easy_install  –v  numpy   o  easy_install  –v  nltk     •  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al   o  Mongodb     •  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS   o  graphviz  -­‐  http://www.graphviz.org/;  easy_install  pygraphviz   o  easy_install  pydot  
  • 15. Thanks To these Giants …
  • 16. Problem Domain For this tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as   observed  by  Twitter  primitives   o  Not  for  Twitter  based  apps  for  real  time  tweets   o  Not  web  sites  with  real  time  tweets  •  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &   actionable  recommendations  •  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.   not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before   pronouncing  a  trend)  
  • 17. Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)     o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)   o  Objects   o  API   o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)  II.  Break  (3:00  PM  -­‐  3:30  PM)  III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)   o  Underlying  Concepts   o  Social  Graph  Analysis  of  @clouderati   §  Stages,  Strategies  &  Tasks   §  Code  Walk  thru    
  • 18. Open  This  First
  • 19. Twi5er  API  :  Read  These  First •  Using  Twitter  Brand   o  New  logo  &  associated  guidelines  :  https://twitter.com/about/logos   o  Twitter  Rules  :   https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐ best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules   o  Developer  Rules  of  the  road  https://dev.twitter.com/terms/api-­‐terms  •  Read  These  Links  First   1.  https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know   2.  https://dev.twitter.com/docs/faq   3.  Field  Guide  to  Objects  https://dev.twitter.com/docs/platform-­‐objects   4.  Security  https://dev.twitter.com/docs/security-­‐best-­‐practices   5.  Media  Best  Practices  :  https://dev.twitter.com/media   6.  Consolidates  Page  :  https://dev.twitter.com/docs   7.  Streaming  APIs  https://dev.twitter.com/docs/streaming-­‐apis   8.  How  to  Appeal  (Not  that  you  all  would  need  it  !)  https://support.twitter.com/ articles/72585  •  Only  One  version  of  Twitter  APIs  
  • 20. API  Status  Page •  https://dev.twitter.com/status  •  https://dev.twitter.com/issues  •  https://dev.twitter.com/discussions  
  • 21. h5ps://dev.twi5er.com/status http://www.buzzfeed.com/tommywilhelm/google-­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter  
  • 22. Open  This  First •  Install  pre-­‐req  as  per  the  setup  slide  •  Run     o  oscon2012_open_this_first.py   o  To  test  connectivity  –  “canary  query”  •  Run   o  oscon2012_rate_limit_status.py   o  Use  http://www.epochconverter.com  to  check  reset_time  •  Formats  xml,  json,  atom  &  rss  
  • 23. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams   Seach & User  Streams   Trend Twitter   Twitter   Site  Streams   REST   Search   Firehose   Build  Profile   Keywords   Create/Post  Tweets   Specific  User   Reply   Trends   Favorite,  Re-­‐tweet   Rate  Limit  :     Rate  Limit  :  150/350        Complexity  &  Frequency  
  • 24. Rate  Limit
  • 25. Rate  Limits •  By  API  type  &  Authentication  Mode   API No authC authC Error REST   150/hr   350/hr   400  Search   Complexity  &   -­‐N/A-­‐   420   Frequency  Streaming   Upto  1%  Fire  hose   none   none  
  • 26. Rate  Limit  Header •  {  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "149",    •     "x-­‐ratelimit-­‐reset":  "1340467358",    •     "x-­‐runtime":  "0.04144",    •     "x-­‐transaction":  "2b49ac31cf8709af",    •     "x-­‐transaction-­‐mask":   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"  •  }  
  • 27. Rate  Limit-­‐‑ed  Header •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "150",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",    •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",    •     "server":  "tfe",    •     ”…  •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341363230",    •     "x-­‐runtime":  "0.01126"  •  }  
  • 28. Rate  Limit  Example •  Run   o  oscon2012_rate_limit_02.py  •  It  iterates  through  a  list  to  get  followers    •  List  is  2072  long  
  • 29. •  {  •     …  •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",    •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",     Last  time,  it  gave  me  5  min.   Now  the  reset  timer  is  1  •     "x-­‐ratelimit-­‐remaining":  "147",     hour    •     "x-­‐ratelimit-­‐reset":  "1341366831",     150  calls,  not  authenticated  •     "x-­‐runtime":  "0.02768",    •     "x-­‐transaction":  "f1bafd60112dddeb",    •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"  •  }  
  • 30. •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",     And  Rate  Limit  kicked-­‐‑in •  …  •  "status":  "400  Bad  Request",    •     "transfer-­‐encoding":  "chunked",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341366831",    •     "x-­‐runtime":  "0.01342"  •  }  
  • 31. API  with  OAuth •  {  •     …  •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "etag":  ""dd419c02ed00fc6b2a825cc27wbe040"",    •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "pragma":  "no-­‐cache",    •     "server":  "tfe",    •  …  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐access-­‐level":  "read",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341369121",    •     "x-­‐runtime":  "0.05539",     OAuth  • •     "x-­‐transaction":  "9f8508fe4c73a407",        "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   “api-­‐identified”  •  }   1  hr  reset   350  calls  
  • 32. •  {  •     …  •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",    •  …  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "133",    •     "x-­‐ratelimit-­‐reset":  "1341500165",    •   …   Rate  Limit  resets  during  •  }   consecutive  calls •  ********  2416  •  {   +1  •  …   hour •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",    •  …  •     "status":  "200  OK",    •     ….  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341503776",    •  ********  2417  
  • 33. Unexplained  Errors •  Traceback  (most  recent  call  last):  •     File  "oscon2012_get_user_info_01.py",  line  39,  in  <module>  •         r  =  client.get(url,  params=payload)  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  244,  in  get  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  230,  in  request  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",  line  609,  in  send  •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(host=api.twitter.com,  port=443):  Max   retries  exceeded  with  url:  /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –   17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually  10-­‐6  AM  PST   42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%   2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got  around  by  “Trap  &  wait  5  seconds”   2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%   2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night  Runs  are  relatively  error  free   09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201  
  • 34. •  {  • •   …      "date":  "Fri,  06  Jul  2012  03:41:09  GMT",     A Day in the life of•     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",     Twitter Rate Limit•     "server":  "tfe",    •     "set-­‐cookie":  "dnt=;  domain=.twitter.com;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",    •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "0",     Missed  by  4  min! •     "x-­‐ratelimit-­‐reset":  "1341546334",    •     "x-­‐runtime":  "0.01918"  •  }  •  Error,  sleeping  •  {  •   …  •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",    •   …  •   "status":  "200  OK",    •   …  •   "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",     OK  after  5  min  sleep •   …  
  • 35. Strategies I  have  no  exotic  strategies,  so  far  !  1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in  2.  Combine  authenticated  &  non-­‐authenticated  calls  3.  Use  multiple  API  types  4.  Cache  5.  Store  &  get  only  what  is  needed  6.  Checkpoint  &  buffer  request  commands  7.  Distributed  data  parallelism  –  for  example  AWS  instances  http://www.epochconverter.com/  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit
  • 36. Authentication
  • 37. Authentication •  Three  modes   o  Anonymous   o  HTTP  Basic  Auth   o  OAuth  •  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are   supported  •   OAuth  enables  the  user  to  authorize  an  application   without  sharing  credentials  •  Also  has  the  ability  to  revoke  •  Twitter  supports  OAuth  1.0a  •  OAuth  2.0  is  the  new  standard,  much  simpler   o  No  timeframe  for  Twitter  support,  yet      
  • 38. OAuth  Pragmatics •  Helpful  Links   o  https://dev.twitter.com/docs/auth/oauth   o  https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth   o  https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples   o  http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html  •  Discussion  on  OAuth  internal  mechanisms  is  better  left  for   another  day  •  For  headless  applications  to  get  OAuth  token,  go  to  https:// dev.twitter.com/apps  •   Create  an  application  &  get  four  credential  pieces   o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret  •  All  the  frameworks  have  support  for  OAuth.  So  plug  –in   these  values  &  use  the  framework’s  calls  •  I  used  request-­‐oauth  library  like  so:  
  • 39. request-­‐‑oauth def  get_oauth_client():   Get  client  using  the        consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"   token,  key  &  secret  from          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"   dev.twitter.com/apps          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={pre_request:  oauth_hook})          return  client   Use  the  client  instead   def  get_followers(user_id):   of  requests                                      url  =  https://api.twitter.com/1/followers/ids.json’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)   def  get_followers_with_oauth(user_id,client):                                        url  =  https://api.twitter.com/1/followers/ids.json                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)  Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
  • 40. OAuth  Authorize  screen •  The  user   authenticates  with   Twitter  &  grants   access  to  Forbes   Social   •  Forbes  social   doesn’t  have  the   users  credentials,   but  uses  OAuth  to   access  the  user’s   account  
  • 41. HTTP  Status   Codes
  • 42. HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐   •  404  Not  Found   Library  error   •  406  Not  Acceptable   •  200  OK   •  413  Too  Long   •  304  Not  Modified   •  416  Range  Unacceptable   •  400  Bad  Request   •  420  Enhance  Your  Calm   o  Check  error  message  for  explanation   o  Rate  Limited   o  REST  Rate  Limit  !     •  500  Internal  Server  Error   •  401  UnAuthorized   •  502  Bad  Gateway     o  Beware  –  you  could  get  this  for  other   o  Down  for  maintenance   reasons  as  well.       •  503  Service  Unavailable   •  403  Forbidden   o  Overloaded  “Fail  whale”   o  Hit  Update  Limit  (>  max  Tweets/day,   •  504  Gateway  Timeout   following  too  many  people)   o  Overloaded  h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
  • 43. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "91",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",    •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",    •     "server":  "tfe",    •   …  •     "status":  "401  Unauthorized",    •     "vary":  "Accept-­‐Encoding",    •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",    • •     "x-­‐ratelimit-­‐class":  "api",        "x-­‐ratelimit-­‐limit":  "0",     Detailed  error  •     "x-­‐ratelimit-­‐remaining":  "0",     message    in  JSON  !  •     "x-­‐ratelimit-­‐reset":  "1340413616",    •     "x-­‐runtime":  "0.01997"   I  like  this  •  }  •  {  •     "errors":  [  •         {  •             "code":  53,    •             "message":  "Basic  authentication  is  not  supported"  •         }  •     ]  •  }  
  • 44. HTTP  Status  Code  –  Confusing  Example •  {   •  GET  https://api.twitter.com/1/users/lookup.json?•  …   screen_nme=twitterapi,twitter&include_entities=•     "pragma":  "no-­‐cache",     true  •     "server":  "tfe",    •   …     •  Spelling  Mistake  •     "status":  "404  Not  Found",     o  Should  be  screen_name  •     …   •  But  confusing  error  !  •  }  •  {   •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,  •     "errors":  [   showing  parameter  error  •         {  •             "code":  34,    •             "message":  "Sorry,  that  page  does  not  exist"  •         }  •     ]  •  }  
  • 45. HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "112",    •     "content-­‐type":  "application/json;charset=utf-­‐8",     Sometimes,  the  errors  are  •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",     not  correct.  I  got  this  error  •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •  …   for  user_timeline.json  w/  •     "status":  "401  Unauthorized",     user_id=20,15,12  •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     Clearly  a  parameter  error  •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐ratelimit-­‐class":  "api",     (i.e.  more  parameters)  •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "147",    •     "x-­‐ratelimit-­‐reset":  "1340417742",    •     "x-­‐transaction":  "d545a806f9c72b98"  •  }  •  {  •     "error":  "Not  authorized",    •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"  •  }  
  • 46. Objects
  • 47. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  • 48. Tweets •  A.k.a  Status  Updates   •  Interesting  fields   o  Coordinates  <-­‐  geo  location   o  created_at   o  entities  (will  see  later)   o  Id,  id_str   o  possibly  sensitive   o  user  (will  see  later)   •  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –   hard  to  maintain  at  scale   •  https://dev.twitter.com/docs/faq#6981   o  withheld_in_countries     •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
  • 49. A  word  about  id,  id_str •  June  1,  2010   o  Snowflake  the  id  generator  service   o  “The  full  ID  is  composed  of  a  timestamp,   a  worker  number,  and  a  sequence   number”   o  Had  problems  with  JavaScript  to  handle   numbers  >  53  bits   o  “id”:819797   o  “id_str”:”819797”  h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
  • 50. Tweets  -­‐‑  example •  Let  us  run  oscon2012-­‐tweets.py  •  Example  of  tweet   o  coordinates   o  id     o  id_str  
  • 51. Users •  followers_count   •  geo_enabled   •  Id,  Id_str   •  name,  screen_name   •  Protected   •  status,  statuses_count   •  withheld_in_countries  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
  • 52. Users  –  Let  us  run  some  examples •  Run     o  oscon_2012_users.py   •  Lookup  users  by  screen_name   o  oscon12_first_20_ids.py   •  Lookup  users  by  user_id  •  Inspect  the  results   o  id,  name,  status,  status_count,  protected,  followers   (for  top  10  followers),  withheld  users  •  Can  use  information  for  customizing   the  user’s  screen  in  your  web  app  
  • 53. Entities •  Metadata  &  Contextual  Information   •  You  can  parse  them,  but  Entities   parse  them  out  as  structured  data   •  REST  API/Search  API  –   include_entities=1   •  Streaming  API  –  included  by  default   •  hashtags,  media,  urls,   user_mentions  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-­‐‑entities h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
  • 54. Entities •  Run     o  oscon2012_entities.py  •  Inspect  hashtags,  urls  et  al    
  • 55. Places •  attributes   •  bounding_box   •  Id  (as  a  string!)   •  country   •  name  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
  • 56. Places •  Can  search  for  tweets  near  a  place  like  so:  •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]   o  Tweets  near  that  place  •  Tweets  near  San  Jose  [37.395715,-­‐122.102308]  •  We  will  not  see  further  here.  But  very  useful  
  • 57. Timelines •  Collections  of  tweets  ordered  by  time   •  Use  max_id  &  since_id  for  navigation  h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
  • 58. Other  Objects  &  APIs •  Lists  •  Notifications  •  Friendships/exists  to  see  if  one  follows   the  other  
  • 59. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags  h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  • 60. Hands-­‐‑on  Exercise  (15  min) •  Setup  environment  –  slide  #14  •  Sanity  Check  Environment  &  Libraries   o  oscon2012_open_this_first.py   o  oscon2012_rate_limit_status.py  •  Get  objects  (show  calls)   o  Lookup  users  by  screen_name    -­‐  oscon12_users.py   o  Lookup  users  by  id  -­‐  oscon12_first_20_ids.py   o  Lookup  tweets  -­‐  oscon12_tweets.py   o  Get  entities  -­‐  oscon12_entities.py  •  Inspect  the  results  •  Explore  a  little  bit  •  Discussion  
  • 61. Twi5er  APIs
  • 62. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  • 63. Twi5er  REST  API •  https://dev.twitter.com/docs/api  •  What  we  were  doing  were  the  REST  API  •  Request-­‐Response  •  Anonymous  or  OAuth  •  Rate  Limited  :   o  150/350  
  • 64. Twi5er  Trends •  oscon2012-­‐trends.py  •  Trends/weekly,  Trends/monthly  •  Let  us  run  some  examples   o  oscon2012_trends_daily.py   o  oscon2012_trends_weekly.py  •  Trends  &  hashtags   o  #hashtag  euro2012   o  http://hashtags.org/euro2012   o  http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/   o  http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html   o  Top  10  :  http://twittercounter.com/pages/100,  http://twitaholic.com/  
  • 65. Brand  Rank  w/  Twi5er •  Walk  Through  &  results  of  following   o  oscon2012_brand_01.py  •  Followed  10  user-­‐brands  for  a  few  days  to  find   growth  •  Brand  Rank     o  Growth  of  a  brand  w.r.t  the  industry   o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &   correlate  using  Twitter  APIs  &  metrics  •  API  :  url=https://api.twitter.com/1/users/ lookup.json  •  payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}  
  • 66. Brand  Rank  w/  Twi5er Clouderati   is  very   stable
  • 67. Brand  Rank  w/  Twi5er   Tech  Brands •  Google  I/O  showed  a  spike  on  6/27-­‐   6/28   •  OReillyMedia  shares  some  spike   •  Looking  at  a  few  days  worth  of   data,  our  best  inference  is  that   “oscon  doesn’t  track  with  googleio”   •  “Clouderati  doesn’t  track  at  all”  
  • 68. Brand  Rank  w/  Twi5er   World  of  Soccer •  FOXSoccer,UEFAcom   track  each  other     The  numbers  seldom   decrease.  So  calculating   –ve  velocity  will  not   work OTOH,  if  you  see  a  –ve   velocity,  investigate
  • 69. Brand  Rank  w/  Twi5er   World  of  Basketball •  NBA,  MiamiHeat,  okcthunder  track  each  other  •  Used  %  than  absolute  numbers  to  compare  •  The  hike  on  7/6  to  7/10  is  interesting.      
  • 70. Brand  Rank  w/  Twi5er   Rising  Tide  … •  For  some  reason,  all  numbers  are  going  up  7/6  thru   7/10  –  except  for  clouderati!   •  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  
  • 71. Trivia  :  Search  API •  Search(search.twitter.com)   o  Built  by  Summize  which  was  acquired  by  Twitter  in   2008   o  Summize  described  itself  as  “sentiment  mining”  
  • 72. Search  API •  Very  simple     o  GET  http://search.twitter.com/search.json?q=<blah>   •  Based  on  a  search  criteria   •  “The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets” •  Recent  =  Last  6-­‐9  days  worth  of  tweets   •  Anonymous  Call   •  Rate  Limit   o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency  h5ps://dev.twi5er.com/docs/using-­‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search
  • 73. Search  API •  Filters   o  Search  URL  encoded   o  @  =  %40,  #=%23   o   emoticons    :)  and  :(,   o  http://search.twitter.com/search.atom?q=sometimes+%3A)   o  http://search.twitter.com/search.atom?q=sometimes+%3A(  •  Location  Filters,  date  filters  •  Content  searches  
  • 74. Streaming  API •  Not  request  response;  but  stream  •  Twitter  frameworks  have  the  support  •  Rate  Limit  :  Upto  1%  •  Stall  warning  if  the  client  is  falling  behind  •  Good  Documentation  Links   o  https://dev.twitter.com/docs/streaming-­‐apis/connecting   o  https://dev.twitter.com/docs/streaming-­‐apis/parameters   o  https://dev.twitter.com/docs/streaming-­‐apis/processing  
  • 75. Firehose •  ~  400  million  public  tweets/day  •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !  •  If  you  hit  real  limits,  then  explore  the  firehose  route  •  AFAIK,  it  is  not  cheap,  but  worth  it  
  • 76. API  Best  Practices 1.  Use  JSON   2.  Use  user_id  than  screen_name   o  User_id  is  constant  while  screen_name  can  change   3.  max_id  and  since_id   o  For  example  direct  messages,  if  you  have  last  message  use   since_id  for  search   o  max_id  how  far  to  go  back   4.  Cache  as  much  as  you  can   5.  Set  the  User-­‐Agent  header  for  debugging   I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentationThese are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
  • 77. Twitter  API   Near-realtime, High Volume Follow users,Core Data, REST   Streaming   topics, dataCore Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Questions  ?   Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  • 78. Part II SNA Part IITwitter Network Analysis
  • 79. 2.  Store   3.  Transform  &     1.  Collect   Analyze   the Validate Dataset & . Keep don’t Tip: 3 simple; re-crawl/refresh a schem afrai d to be for mMost  important  &   transthe  ugliest  slide  in   this  deck  !   as lem ent , 1. Imp ipeline 4.  Model   Tip: age d p nolith 5.  Predict,   &     a st r a mo Reason   neve Recommend  &   Visualize  
  • 80. Trivia •  Social  Network  Analysis  originated  as  Sociometry  &   the  social  network  was  called  a  sociogram  •  Back  then,  Facebook  was  called  SocioBinder!  •  Jacob  Levi  Morano,  is  considered  the  originator   o  NYTimes,  April  3,  1933,  P.  17  
  • 81. Twi5er  Networks-­‐‑Definitions •  Nodes   o  Users   o  #tags  •  Edges   o  Follows   o  Friends   o  @mentions   o  #tags  •  Directed  
  • 82. Twi5er  Networks-­‐‑Definitions •  In-­‐degree   o  Followers  •  Out-­‐Degree   o  Friends/Follow  •  Centrality  Measures  •  Hubs  &  Authorities   o  Hubs/Directories  tell  us  where   Authorities  are   o  “Of  Mortals  &  Celebrities”  is   more  “Twitter-­‐style”  
  • 83. Twi5er  Networks-­‐‑Properties M•  Concepts  From  Citation   N Networks   K J o  Cocitation   L   I •  Common  papers  that  cite  a  paper   A •  Common  Followers   B G o  C  &  G  (Followed  by  F  &  H)   C H o  Bibliographic  Coupling   •  Cite  the  same  papers   D F   •  Common  Friends  (i.e.  follow  same   E person)   o  D,  E,  F  &  H  
  • 84. Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks   M o  Cocitation   N •  Common  papers  that  cite  a  paper   K •  Common  Followers   J   L   o  C  &  G  (Followed  by  F  &  H)   I   o  Bibliographic  Coupling   A •  Cite  the  same  papers   B G •  Common  Friends    (i.e.  follow  same  person)   o  D,  E,  F  &  H  follow  C   o  H  &  F  follow  C  &  G   H C •  So  H  &  F  have  high  coupling   D •  Hence,  if  H  follows  A,  we  can   F   recommend  F  to  follow  A   E
  • 85. Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks   o  Two  disjoint  subsets   o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph   o  Membership  in  Lists     •  lists  vs.  users  bipartite  graph   o  Common  #Tags  in  Tweets     •  #tags  vs.  members  bipartite  graph   o  @mention  together   •  ?  Can  this  be  a  bipartite  graph   •  ?  How  would  we  fold  this  ?  
  • 86. Other  Metrics  &  Mechanisms •  Kronecker  Graphs  Models   o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices   o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of   their  adjacency  matrices   o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al   •  Erdos-­‐Renyl  Random  Graphs   o  Easy  to  build  a  Gn,p  graph   o  Assumes  equal  likelihood  of  edges  between  two  nodes   o  In a Twitter social network, we can create a more realistic expected distribution (adding the “social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter   •  Weak  Ties   •  Follower  velocity  (+ve  &  –ve),  Association  strength   o  Unfollow  not  a  reliable  measure   o  But  an  interesting  property  to  investigate  when  it  happens   Not covered here, but potential for an encore !Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
  • 87. Twi5er  Networks-­‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook•  Twitter Network == Interest Network•  Be  cognizant  of  the  above  when  you  apply  traditional  network   properties  to  Twitter    •  For  example,     o  Six  degrees  of  separation  doesnt  make  sense  (most  of  the  time)  in   Twitter  –  except  may  be  for  Cliques   o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?   •  Probably  not   o  Do  cut  sets  make  sense  ?     •  Probably  not   o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques   o  Bipartite  graphs  do  make  sense  
  • 88. Cliques  (1  of  2) •  “Maximal  subset  of  the  vertices  in  an   undirected  network  such  that  every  member   of  the  set  is  connected  by  an  edge  to  every   other”  •  Cohesive  subgroup,  closely  connected  •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.   connected  to  at  least  n-­‐k  others)  •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse   network;  1-­‐plex  being  the  perfect  clique   Ref:  Networks,  An  Introduction-­‐‑Newman
  • 89. Cliques  (2  of  2) •  k-­‐core  –  at  least  k  others  in  the  subset;   (n-­‐k)-­‐plex  •  k-­‐clique  –  no  more  than  k  distance  away   o  Path  inside  or  outside  the  subset   o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)  •  We  will  apply  k-­‐plex  Cliques  for  one  of   our  hands-­‐on     Ref:  Networks,  An  Introduction-­‐‑Newman
  • 90. Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work   on  the  Twitter  platform   o  Collect  Tweets   o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons   •  Naïve  Bayes/Max  Entropy  Class/SVM   o  Aggregated  Text  Sentiment/Moving  Average  •  I  chose  not  to  dive  deeper  because  of  time  constraints   o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,   all  in  3  hrs  •  Next  3  Slides  have  couple  of  interesting  examples    
  • 91. Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment   •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800    h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
  • 92. Need  I  say  more  ? “A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the   human  eye”      h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf
  • 93. Project  Ideas  
  • 94. Interesting Vectors of Exploration 1.  Find  trending  #tags  &  then  related  #tags  –  using   cliques  over  co-­‐#tag-­‐citation,  which  infers  topics   related  to  trending  topics  2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or   group  of  users  3.  Analysis-­‐In/Out  flow,  Tweet  Flow   –  Frequent  @mention  4.  Find  affiliation  networks  by  List  memberships,  #tags   or  frequent  @mentions    
  • 95. Interesting Vectors of Exploration 5.  Use  centrality  measures  to  determine  mortals  vs.   celebrities  6.  Classify  Tweet  networks/cliques  based  on  message   passing  characteristics   –  Tweets  vs.  Retweets,  No  of  reweets,…  7.  Retweet  Network   –  Measure  Influence  by  retweet  count  &  frequency   –  Information  contagion  by  looking  at  different  retweet   network  subcomponents  –  who,  when,  how  much,…  
  • 96. Twi5er  Network  Graph  Analysis An  Example  
  • 97. Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related   Twitter  account   •  Goals:   o  Analyze  the  social  graph  characteristics  of  the  users  who  are   following  the  account   In this •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the   tutorial followers  of  @clouderati   o  How  many  cliques  ?  How  strong  are  they  ?   o  Does  the  @mention  support  the  clique  inferences  ?  For you to o  What  are  the  retweet  characteristics  ?  explore !! o  How  does  the  #tag  network  graph  look  like  ?      
  • 98. Twi5er  Analysis  Pipeline  Story  Board   Stages,  Strategies,  APIs  &  Tasks Stage  4   Stag o  e  5   o  Get  &  Store  User  details   For  e (distinct  user  list)   follo ach  @c o  w loud o  Unroll   Find er   erat  frie i   inte nd=f rsec o tion llower   Note:  Needed  a   Note:  Unroll      -­‐  se stage  took  time   t   command  buffer   to  manage  scale   &  missteps   (~980,000  users)     Stage  3   Stage  6 raph    s ocial  g heory   o  Create twork  t ne o  Get  distinct  user  list   o  Apply   ues  &  other   applying  the   liq o  Infer  c s     set(union(list))  operation   tie proper
  • 99. @clouderati  Twi5er  Social  Graph   •  Stats  (Retrospect  after  the  runs):   o  Stage  1     •  @clouderati  has  2072  followers   o  Stage  2   •  Limiting  followers  to  5,000  per  user   o  Stage  3   •  Digging  1st  level  (set  union  of  followers  &  friends  of  the   followers  of  @clouderati)  explodes  into  ~980,000  distinct   users   o  MongoDB  of  the  cache  and  intermediate  datasets  ~10  GB   o  The  database  was  hosted  at  AWS  (Hi  Mem  XLarge  –  m2.xlarge  ),  8   X  15  GB,  Raid  10,  opened  to  Internet  with  DB  authentication  
  • 100. Code  &  Run  Walk  Through o  Code:   §  oscon_2012_user_list_spider_01.py   o  Challenges:   Stage  1   §  Nothing  fancy   §  Get  the  record  and  store  o  Get  @clouderati  Followers  o  Store  in  MongoDB   §  Would  have  had  to  recurse  through  a  REST   cursor  if  there  were  more  than  5000  followers   §  @clouderati  has  2072  followers   o  Interesting  Points:  
  • 101. Code  &  Run  Walk  Through o  Code:   §  oscon_2012_user_list_spider_02.py   §  oscon_2012_twitter_utils.py   §  oscon_2012_mongo.py   §  oscon_2012_validate_dataset.py   o  Challenges:   §  Multiple  runs,  errors  et  al  !   Stage  2   o  Interesting  Points:   §  Set  operation  between  two  mongo  collections  for  restart  buffer  o  Crawl  1  level  deep   §  Protected  users,  some  had  0  followers,  or  0  friends  o  Get  friends  &  followers   §  Interesting  operations  for  validate,  re-­‐crawl  and  refresh  o  Validate,  re-­‐crawl  &  refresh   §  Added  “status_code”  to  differentiate  protected  users   §  {$set:  {status_code:  401  Unauthorized,401  Unauthorized}}   §  Getting friends & followers of 2000 users is the hardest (or so I thought, until I got through the next stage!)    
  • 102. Validate-­‐‑Recrawl-­‐‑Refresh  Logs •  pymongo  version  =    2.2  •  Connected  to  DB!   o  1st  run  –  132  bad  records  •  …   o  This  is  the  classic  Erlang-­‐style  •  2075   supervisor  •  Error  Friends  :    <type  exceptions.KeyError>   o  The  crawl  continue  on  transport  errors  •  4ff3cd40e5557c00c7000000  -­‐  none  has  2072  followers  &  0  friends  •  Error  Friends  :    <type  exceptions.KeyError>   without  worrying  about  retry  •  o  Validate  will  recrawl  &  refresh  as   4ff3a958e5557cfc58000000  -­‐  none  has  2072  followers  &  0  friends  •  Error  Friends  :    <type  exceptions.KeyError>   needed  •  4ff3ccdee5557c00b6000000  -­‐  none  has  2072  followers  &  0  friends  •  4ff3d3b9e5557c01b900001e  -­‐  371187804  has  0  followers  &  0  friends  •  4ff3d3d8e5557c01b9000048  -­‐  63488295  has  155  followers  &  0  friends  •  4ff3d3d9e5557c01b9000049  -­‐  342712617  has  0  followers  &  0  friends  •  4ff3d3d9e5557c01b900004a  -­‐  21266738  has  0  followers  &  0  friends  •  4ff3d3dae5557c01b900004b  -­‐  204652853  has  0  followers  &  0  friends  •  …  •  4ff475cfe5557c1657000074  -­‐  258944989  has  0  followers  &  0  friends  •  4ff475d3e5557c165700007d  -­‐  327286780  has  0  followers  &  0  friends  •  Looks  like  we  have  132  not  so  good  records  •  Elapsed  Time  =  0.546846  
  • 103. Code  &  Run  Walk  Through o  Code:   §  oscon2012_analytics_01.py   Stage  3   o  Challenges:   o  Figure  out  the  right  Set  operations  o  Get  distinct  user  list   applying  the   set(union(list))  operation   o  Interesting  Points:   §  973,323  unique  users  !   §  Recursively  apply  set  union  over  400,00  lists   §  Set  operations  took  slightly  more  than  a  minute    
  • 104. Code  &  Run  Walk  Through o  Code:   §  oscon2012_analytics_01.py  (focus  on  cmd  string  creation)   §  oscon2012_get_user_info_01.py   §  oscon2012_unroll_user_list_01.py   §  oscon2012_unroll_user_list_02.py   Stage  4   o  Challenges:  o  Get  &  Store  User  details   §  Where  do  I  start  ?   (distinct  user  list)   •  In  the  next  few  slides    o  Unroll   §  Took  me  a  few  days  to  get  it  right  (along  with  my  daily  job!)   §  Unfortunately  I  did  not  employ  parallelism  &  didn’t  use  my   MacPro  with  32  GB  memory.  So  the  runs  were  long   §  But  learned  hard  lessons  on  check  point  &  restart   o  Interesting  Points:   §  Tracking  Control  Numbers   §  Time  …  Marathon  unroll  run  19:33:33  !  
  • 105. Twi5er  @  scale  Pa5ern •  Challenge:   o  You  want  to  get  screen  names,  follower  counts  and  other  details  for  a  million   users  •  Problem:   o  No  easy  REST  API   o  https://api.twitter.com/1/users/lookup.json  will  take  100  user_ids  and  give   details  •  Solution:   o  This  is  a  scalability  challenge.  Approach  it  like  so   o  Create  a  command  buffer  collection  in  MongoDB  splitting  millon  user_ids   into  batches  of  100   o  Have  a  “done”  flag  initialized  to  0  for  checkpoint  &  restart   o  After  each  cmd  str  is  executed,  rest  “done”:1   o  For  subsequent  runs,  ignore  “done”:1.     o  Also  helps  in  control  number  tracking  
  • 106. Control  numbers
  • 107. Control  Numbers •  >  db.t_users_info.count()  •  8122  •  >  db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no”:)  •  63  •  >  db.api_str.find({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1})   The  collection  should  have  8185  •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d53"),  "seq_no"  :  5433  }   documents  •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d59"),  "seq_no"  :o5439  }   But  it  has     nly  8122.   Where  did  the  rest  go  ?  •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d5f"),  "seq_no"  :  5445  }   63  of  them  still  have  done=0  •  8122  +  63  =  8185  !   {  "_id"  :  ObjectId("4ff4daebe5557c28bf001d74"),  "seq_no"  :  5466  }   Aha,  mystery  solved.    •  {  "_id"  :  ObjectId("4ff4daece5557c28bf001d7a"),  "seq_no"  :  5472  }        They  fell  through  the  cracks  •  Need  a  catch-­‐all  final  run       {  "_id"  :  ObjectId("4ff4daece5557c28bf001d80"),  "seq_no"  :  5478  }  •  {  "_id"  :  ObjectId("4ff4daede5557c28bf001d90"),  "seq_no"  :  5494  }  •  {  "_id"  :  ObjectId("4ff4daefe5557c28bf001daf"),  "seq_no"  :  5525  }  
  • 108. Day  in  the  life  of  a  Control  Number  Detective  –  Run  #1 •  Remember  :  973,323  users.  So,  9734  cmd  strings  (100  users  perstring)  •  >  >  db.api_str.count()  •  9831  •  >  db.api_str.count({"done":0})  •  239  •  >>  db.t_users_info.count()  •  9592  •  >  >  db.api_str.count({"api_str":""})  •  97  •  So  we  should  have  9831  –  97  =  9734  records  •  The  second  run  should  generate  9734-­‐9592  =  142  calls  (i.e.  350-­‐142=208  rate-­‐limit  should  remain).  Let  us  see.  •  {  •     …  •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "209",    •     …  •  }  •  Yep,  209  left  •  >  
  • 109. Day  in  the  life  of  a  Control  Number  Detective  –  Run  #2 •  Remember  :  973,323  users.  So,  9734  cmd  strings  (100  users  perstring)  •  >  db.t_users_info.count()   •  {  •  9728   •   …  •  >  db.api_str.count({"api_str":""})   •     "x-­‐ratelimit-­‐limit":  "350",    •  97   •     "x-­‐ratelimit-­‐remaining":  "344",    •  >  db.api_str.count({"done":0})   •   …  •  103   •  }  •  >9734-­‐9728=6,  same  as  103-­‐97  !   •  Yep,  6  more  records  •  Run  once  more  !   •  >  db.t_users_info.count()  •  >  db.api_str.find({"done":0},{"seq_no":1})   •  9734  •  …   •  Good,  got  9734  !  •  {  "_id"  :  ObjectId("4ff4dbd4e5557c28bf002e22"),  "seq_no"  :  9736  }  •  {  "_id"  :  ObjectId("4ff4db05e5557c28bf001f47"),  "seq_no"  :  5933  }  •  {  "_id"  :  ObjectId("4ff4db8be5557c28bf0028f6"),  "seq_no"  :  8412  }  •  {  "_id"  :  ObjectId("4ff4dba2e5557c28bf002a8c"),  "seq_no"  :  8818  }  •  {  "_id"  :  ObjectId("4ff4dbaee5557c28bf002b69"),  "seq_no"  :  9039  }  •  {  "_id"  :  ObjectId("4ff4dbb8e5557c28bf002c1c"),  "seq_no"  :  9218  }  •  …   Professor Layton would be proud ! In  fact,  I  have  all  the  four  &  plan  to  spend  sometime  with  them  &  Laphraig  !
  • 110. Monitor  runs  &  track  control  numbers Unroll  run  8:48  PM  to  ~4:08  PM  next  day  !  
  • 111. Track  error  &  the  document  numbers
  • 112. Code  &  Run  Walk  Through o  Code:   §  oscon2012_find_strong_ties_01.py   §  oscon2012_social_graph_stats_01.py   Stage  5   o  Challenges:  o  For  each  @clouderati   §  None.  Python  set  operations  made  this  easy   follower  o  Find  friend=follower    -­‐  set   o  Interesting  Points:   intersection   §  Even  at  this  scale,  single  machine  is  not  enough   §  Should  have  tried  data  parallelism     •  This  task  is  well  suited  to  leverage  data   parallelism  as  it  is  commutative  &  associative   •  Was  getting  invalid  cursor  error  from  MongoDB   •  So  had  to  do  the  updates  in  two  steps  
  • 113. Code  &  Run  Walk  Through o  Code:   §  oscon2012_find_cliques_01.py   o  Challenges:   Stage  6   o  Lots  of  good  information  hidden  in   the  data  !  o  Create  social  graph  o  Apply  network  theory   o  Memory  !  o  Infer  cliques  &  other   properties     o  Interesting  Points:   o  Graph,  List  &  set  operations   o  networkx  has  lots  of  interesting   graph  algorithms   o  Collections.Counter  to  the  rescue  
  • 114. Twi5er  Social  Graph  Analysis   of  @clouderati o                                       2072  Followers;  973,323   unique  users  one  level  down  w/   followers/friends  trimmed  at  5,000   o  Strong  ties     o  follower=friend   o  235,697  users,  462,  419  edges   o  501,367    Cliques   o  253  unique  users  8,906  Cliques  w/  >   10  users   o  GeorgeReese  in  7,973  of  them  !  See   List  for  1st  125   o  krishnan  3,446,randy  2,197,  joe  1,977,   sam  1,937,  jp  485,  stu  403,  urquhart   263,beaker  226,  acroll  149,  adrian  63,   gevaperry  24   o  Of  course,  clique  analysis  does  not   tell  us  the  whole  story  …    Clique  Distribution  =  {2:  296521,  3:  58368,  4:  36421,  5:  28788,  6:  24197,  7:  20240,  8:  15997,  9:  11929,  10:  6576,  11:  1909,  12:  364,  13:  55,  14:  2}  
  • 115. Twi5er  Social  Graph  Analysis   of  @clouderati Celebrity  –  very  low  strong  ties   Higher  Celebrity,  low  strong  ties  o  sort  by   followers  vs.   sort  by   strong  ties  is   interesting   Medium  Celebrity,  medium  strong  ties  
  • 116. Twi5er  Social  Graph  Analysis  of  @clouderati o  A  higher  “Strong  Ties”   number  is  interesting   §  It  means  a  very  high   follower-­‐friend  intersection   §  Reeves  62%,  bgolden    85%  o  Bur  a  high  clique  with  a   smaller  “Strong  ties”  show   more  cohesive  &  stronger   social  graph   §  eg.Krishnan  -­‐  15%   friends-­‐followers     §  Samj  –  33%  
  • 117. Twi5er  Social  Graph  Analysis   of  @clouderati o  Ideas  for   more   Exploration   §  Include  all   followers  (instead   of  stopping  at  the   5000  cap)   §  Get  tweets  &  track   @mention   §  Frequent   @mention  shows   more  stronger  ties   §  #tag  analysis  could   show  some   interesting   networks  
  • 118. Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   Beg in •  And  as  far  as  possible  same  as  the  ( json)  response       The ning A o  Use  NOSQL  CLI  for  trimming  records  et  al   End s
  • 119. Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  • 120. Twitter Tips – A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 211.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  • 121. Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 213.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong  15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  • 122. Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  • 123. Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  • 124. Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “416  Range   Unacceptable”  6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  • 125. Thanks To these Giants …
  • 126. Thanks To these Giants …
  • 127. Thanks To these Giants …
  • 128. Thanks To these Giants …
  • 129. Thanks To these Giants …
  • 130. I had a good time researching & preparing for this Tutorial. I hope you learned a few new things & have a few vectors to follow