SlideShare a Scribd company logo
1 of 131
Download to read offline
The Art of Social Media
       Analysis
 with Twitter & Python


                                      krishna sankar
                                           @ksankar
 http://www.oscon.com/oscon2012/public/schedule/detail/23130
Intro	



                                           API,
                                          Objects,…	

o  House	
  Rules	
  (1	
  of	
  2)	
                                    Twitter
                                                                         Network           We will analyze @clouderati,
       o  Doesn’t	
  assume	
  any	
  knowledge	
                        Analysis          2072 followers, exploding to
          of	
  Twitter	
  API	
                                         Pipeline	

       ~980,000 distinct users down
                                                                                           one level
       o  Goal:	
  Everybody	
  in	
  the	
  same	
  
          page	
  &	
  get	
  a	
  working	
  
          knowledge	
  of	
  Twitter	
  API	
  
                                                            NLP, NLTK,
       o  To	
  bootstrap	
  your	
  exploration	
                                     @mention           Cliques, social
                                                            Sentiment
                                                                                        network                graph
          into	
  Social	
  Network	
  Analysis	
  &	
       Analysis

          Twitter	
  	
                                                           Rewteeet analytics,
                                                                                                           Growth,
                                                           #tag Network              Information
       o  Simple	
  programs,	
  to	
  illustrate	
                                   contagion            weakties
          usage	
  &	
  data	
  manipulation	
  
Intro	



                                                API,
                                               Objects,…	

                                                                                    Twitter
o  House	
  Rules	
  (2	
  of	
  2)	
                                               Network           We will analyze @clouderati,
                                                                                    Analysis          2072 followers, exploding to
       o  Am	
  using	
  the	
  requests	
  library	
  
                                                                                    Pipeline	

       ~980,000 distinct users down
       o  There	
  are	
  good	
  Twitter	
  frameworks	
                                             one level
          for	
  python,	
  but	
  wanted	
  to	
  build	
  
          from	
  the	
  basics.	
  Once	
  one	
  
          understands	
  the	
  fundamentals,	
  
          frameworks	
  can	
  help	
                                  NLP, NLTK,
                                                                                                  @mention           Cliques, social
                                                                       Sentiment
       o  Many	
  areas	
  to	
  explore	
  –	
  not	
  enough	
        Analysis
                                                                                                   network                graph
          time.	
  So	
  decided	
  to	
  focus	
  on	
  social	
  
          graph,	
  cliques	
  &	
  networkx	
                                               Rewteeet analytics,
                                                                                                                      Growth,
                                                                      #tag Network              Information
                                                                                                 contagion            weakties
About  Me	
•    Lead	
  Engineer/Data	
  Scientist/AWS	
  Ops	
  Guy	
  at	
  
     Genophen.com	
  
       o    Co-­‐chair	
  –	
  2012	
  IEEE	
  Precision	
  Time	
  Synchronization	
  	
  
               •  http://www.ispcs.org/2012/index.html	
  
       o    Blog	
  :	
  http://doubleclix.wordpress.com/	
  
       o    Quora	
  :	
  http://www.quora.com/Krishna-­‐Sankar	
  
•    Prior	
  Gigs	
  
       o    Lead	
  Architect	
  (Egnyte)	
  
       o    Distinguished	
  Engineer	
  (CSCO)	
  
       o    Employee	
  #64439	
  (CSCO)	
  to	
  #39(Egnyte)	
  &	
  now	
  #9	
  !	
  
•    Current	
  Focus:	
  
       o    Design,	
  build	
  &	
  ops	
  of	
  BioInformatics/Consumer	
  Infrastructure	
  on	
  AWS,	
  
            MongoDB,	
  Solr,	
  Drupal,GitHub,…	
  
       o    Big	
  Data	
  (more	
  of	
  variety,	
  variability,	
  context	
  &	
  graphs,	
  than	
  volume	
  or	
  velocity	
  –	
  
            so	
  far	
  !)	
  
       o    Overlay	
  based	
  semantic	
  search	
  &	
  ranking	
  
•    Other	
  related	
  Presentations	
  
       o    http://goo.gl/P1rhc	
  Big	
  Data	
  Engineering	
  Top	
  10	
  Pragmatics	
  (Summary)	
  
       o    http://goo.gl/0SQDV	
  The	
  Art	
  of	
  Big	
  Data	
  (Detailed)	
  
       o    http://goo.gl/EaUKH	
  The	
  Hitchhiker’s	
  Guide	
  to	
  Kaggle	
  OSCON	
  2011	
  Tutorial	
  
Twitter Tips – A Baker’s Dozen	
1.    Twitter	
  APIs	
  are	
  (more	
  or	
  less)	
  congruent	
  &	
  symmetric	
  
2.    Twitter	
  is	
  usually	
  right	
  &	
  simple	
  -­‐	
  recheck	
  when	
  you	
  get	
  unexpected	
  results	
  
      before	
  blaming	
  Twitter	
  
      o      I	
  was	
  getting	
  numbers	
  when	
  I	
  was	
  expecting	
  screen_names	
  in	
  user	
  objects.	
  
      o      Was	
  ready	
  to	
  send	
  blasting	
  e-­‐mails	
  to	
  Twitter	
  team.	
  Decided	
  to	
  check	
  one	
  more	
  time	
  
             and	
  found	
  that	
  my	
  parameter	
  key	
  was	
  wrong-­‐screen_name	
  instead	
  of	
  user_id	
  
      o      Always test with one or two records before a long run ! - learned the hard way
3.    Twitter	
  APIs	
  are	
  very	
  powerful	
  –	
  consistent	
  use	
  can	
  bear	
  huge	
  data	
  
      o      In	
  a	
  week,	
  you	
  can	
  pull	
  in	
  4-­‐5	
  million	
  users	
  &	
  some	
  tweets	
  !	
  	
  
      o      Night runs are far more faster & error-free
4.    Use	
  a	
  NOSQL	
  data	
  store	
  as	
  a	
  command	
  buffer	
  &	
  data	
  buffer	
  
      o      Would	
  make	
  it	
  easy	
  to	
  work	
  with	
  Twitter	
  at	
  scale	
  
      o      I	
  use	
  	
  MongoDB	
  
                                                                                                                             The
      o      Keep	
  the	
  schema	
  simple	
  &	
  no	
  fancy	
  transformation	
                                             End
            •                And	
  as	
  far	
  as	
  possible	
  same	
  as	
  the	
  ( json)	
  response	
  	
  	
         Beg As Th
                                                                                                                                  inni
      o      Use	
  NOSQL	
  CLI	
  for	
  trimming	
  records	
  et	
  al	
                                                          ng	
 e
Twitter Tips – A Baker’s Dozen	

5.     Always	
  use	
  a	
  big	
  data	
  pipeline	
  
      o       Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize
      o       That	
  way	
  you	
  can	
  orthogonally	
  extend,	
  with	
  functional	
  components	
  like	
  command	
  buffers,	
  
              validation	
  et	
  al	
  	
  
6.     Use	
  functional	
  approach	
  for	
  a	
  scalable	
  pipeline	
  
      o       Compose	
  your	
  data	
  big	
  pipeline	
  with	
  well	
  defined	
  granular	
  functions,	
  each	
  doing	
  only	
  one	
  thing	
  
      o       Don’t	
  overload	
  the	
  functional	
  components	
  (i.e.	
  no	
  collect,	
  unroll	
  &	
  store	
  as	
  a	
  single	
  component)	
  
      o       Have	
  well	
  defined	
  functional	
  components	
  with	
  appropriate	
  caching,	
  buffering,	
  checkpoints	
  &	
  
              restart	
  techniques	
  
             •        This did create some trouble for me, as we will see later
7.     Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh	
  cycle	
  
       o  The	
  equivalent	
  of	
  the	
  traditional	
  ETL	
  
       o  Validation	
  stage	
  &	
  validation	
  routines	
  are	
  important	
  
               •    Cannot	
  expect	
  perfect	
  runs	
  
               •    Cannot	
  manually	
  look	
  at	
  data	
  either,	
  when	
  data	
  is	
  at	
  scale	
  
8.     Have	
  control	
  numbers	
  to	
  validate	
  runs	
  &	
  monitor	
  them	
  
      o       I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
              number through the various runs ! 
      o       There will be a separate printout of the control numbers that will be kept in the operations files
Twitter Tips – A Baker’s Dozen	
9.  Program	
  defensively	
  	
  
      o      more so for a REST-based-Big Data-Analytics systems
      o      Expect	
  failures	
  at	
  the	
  transport	
  layer	
  &	
  accommodate	
  for	
  them	
  	
  
10.  Have	
  Erlang-­‐style	
  supervisors	
  in	
  your	
  pipeline	
  
      o      Fail	
  fast	
  &	
  move	
  on	
  
      o      Don’t	
  linger	
  and	
  try	
  to	
  fix	
  errors	
  that	
  cannot	
  be	
  controlled	
  at	
  that	
  layer	
  
      o      A	
  higher	
  layer	
  process	
  will	
  circle	
  back	
  and	
  do	
  incremental	
  runs	
  to	
  
             correct	
  missing	
  spiders	
  and	
  crawls	
  
      o      Be	
  aware	
  of	
  visibility	
  &	
  lack	
  of	
  context.	
  Validate	
  at	
  the	
  lowest	
  layer	
  that	
  
             has	
  enough	
  context	
  to	
  take	
  corrective	
  actions	
  
      o      I have an example in part 2
11.  Data	
  will	
  never	
  be	
  perfect	
  
       o  Know	
  your	
  data	
  &	
  accommodate	
  for	
  it’s	
  idiosyncrasies	
  	
  
              •  for	
  example:	
  0	
  followers,	
  protected	
  users,	
  0	
  friends,…	
  
Twitter Tips – A Baker’s Dozen	
12.  Check	
  Point	
  frequently	
  (preferably	
  after	
  ever	
  API	
  call)	
  &	
  have	
  a	
  
     re-­‐startable	
  command	
  buffer	
  cache	
  	
  
     o      See a MongoDB example in Part 2
13.  Don’t	
  bombard	
  the	
  URL	
  
     o      Wait	
  a	
  few	
  seconds	
  before	
  successful	
  calls.	
  This	
  will	
  end	
  up	
  with	
  a	
  
            scalable	
  system,	
  eventually	
  
     o      I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
            work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 
14.  Always	
  measure	
  the	
  elapsed	
  time	
  of	
  your	
  API	
  runs	
  &	
  processing	
  
     o      Kind	
  of	
  early	
  warning	
  when	
  something	
  is	
  wrong	
  
15.  Develop	
  incrementally;	
  don’t	
  fail	
  to	
  check	
  “cut	
  &	
  paste”	
  errors	
  
Twitter Tips – A Baker’s Dozen	
16.  The	
  Twitter	
  big	
  data	
  pipeline	
  has	
  lots	
  of	
  opportunities	
  for	
  parallelism	
  
      o       Leverage	
  data	
  parallelism	
  frameworks	
  like	
  MapReduce	
  
      o       But	
  first	
  :	
  
             §       Prototype	
  as	
  a	
  linear	
  system,	
  	
  
             §       Optimize	
  and	
  tweak	
  the	
  functional	
  modules	
  &	
  cache	
  strategies,	
  	
  
             §       Note	
  down	
  stages	
  and	
  tasks	
  that	
  can	
  be	
  parallelized	
  and	
  	
  
             §       Then	
  parallelize	
  them	
  
      o       For the example project, we will see later, I did not leverage any parallel frameworks, but the
              opportunities were clearly evident. I will point them out, as we progress through the tutorial
17.  	
  Pay	
  attention	
  to	
  handoffs	
  between	
  stages	
  
      o      They	
  might	
  require	
  transformation	
  –	
  for	
  example	
  collect	
  &	
  store	
  might	
  store	
  a	
  user	
  list	
  
             as	
  multiple	
  arrays,	
  while	
  the	
  model	
  requires	
  each	
  user	
  to	
  be	
  a	
  document	
  for	
  
             aggregation	
  	
  
      o      But resist the urge to overload collect with transform
             o       i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
                     the array to separate documents 
      o      Add transformation as a granular function – of course, with appropriate buffering, caching,
             checkpoints & restart techniques 
18.  Have	
  a	
  good	
  log	
  management	
  system	
  to	
  capture	
  and	
  wade	
  through	
  
     logs	
  	
  
Twitter Tips – A Baker’s Dozen	
19.  Understand	
  the	
  underlying	
  network	
  characteristics	
  for	
  the	
  
     inference	
  you	
  want	
  to	
  make	
  
     o    Twitter	
  Network	
  	
  !=	
  Facebook	
  Network	
  ,	
  	
  Twitter	
  Graph	
  !=	
  LinkedIn	
  Graph	
  
     o    Twitter	
  Network	
  is	
  more	
  of	
  an	
  Interest	
  Network	
  
     o    So, many of the traditional network mechanisms & mechanics, like network
          diameter & degrees of separation, might not make sense
     o    But, others like Cliques and Bipartite Graphs do
Twitter Gripes	
1.     Need	
  more	
  rich	
  APIs	
  for	
  #tags	
  
      o      Somewhat	
  similar	
  to	
  users	
  viz.	
  followers,	
  friends	
  et	
  al	
  
      o      Might	
  make	
  sense	
  to	
  make	
  #tags	
  a	
  top	
  level	
  object	
  with	
  it’s	
  own	
  semantics	
  
2.  HTTP	
  Error	
  Return	
  is	
  not	
  uniform	
  	
  
      o      Returns	
  400	
  bad	
  Request	
  instead	
  of	
  420	
  
      o      Granted, there is enough information to figure this out
3.  Need	
  an	
  easier	
  way	
  to	
  get	
  screen_name	
  from	
  user_id	
  
4.  “following”	
  vs.	
  “friends_count”	
  i.e.	
  “following”	
  is	
  a	
  dummy	
  variable.	
  
      o      There are a few like this, most probably for backward compatibility
5.     Parameter	
  Validation	
  is	
  not	
  uniform	
  
      o      Gives	
  “404	
  Not	
  found”	
  instead	
  of	
  “406	
  Not	
  Acceptable”	
  or	
  “413	
  Too	
  Long”	
  or	
  “416	
  
             Range	
  Unacceptable”	
  
6.  Overall	
  more	
  validation	
  would	
  help	
  
      o      Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
             rest is easy to figure out
A Fork	

                           	
  
                & 	
  deep
       ,NLTK	
   	
  
•   NLP weets
    into	
  T ment	
  
             4
       o  Sen ysis	
  
           Anal


             • Not enough time for both
                • I chose the Social Graph route
A minute about Twitter as platform & it’s evolution	


                                                                                                   blog/
                                                                                           er. com/ tter-­‐
                                                                                     twitt         wi
                                                                           ps:/ /dev. nsistent-­‐t
                                                                        htt ring-­‐co
                                                                              e
                                                                         deliv ence	
                                                    “The micro-blogging service must find the
                                                                               ri
                                                                          expe
                                                                                                                                         right balance of running a profitable
                                                                                                                                         business and maintaining a robust
         “.. we want to make sure that the Twitter experience is                                                                         developers' community.” – Chenda, CBS
     straightforward and easy to understand -- whether you’re on
                                                                                                                                         news!
              Twitter.com or elsewhere on the web”-Michael!
My	
  Wish	
  &	
  Hope	
  
•  I	
  spend	
  a	
  lot	
  of	
  time	
  with	
  Twitter	
  &	
  derive	
  value;	
  the	
  platform	
  is	
  rich	
  &	
  the	
  APIs	
  intuitive	
  
•  I	
  did	
  like	
  the	
  fact	
  that	
  tweets	
  are	
  part	
  of	
  LinkedIn.	
  I	
  still	
  used	
  Twitter	
  more	
  than	
  LinkedIn	
  
          o      I	
  don’t	
  think	
  showing	
  Tweets	
  in	
  LinkedIn	
  took	
  anything	
  away	
  from	
  the	
  Twitter	
  experience	
  
          o      LinkedIn	
  experience	
  &	
  Twitter	
  experience	
  are	
  different	
  &	
  distinct.	
  Showing	
  tweets	
  in	
  LinkedIn	
  didn’t	
  change	
  that	
  
•       I	
  sincerely	
  hope	
  that	
  the	
  platform	
  grows	
  with	
  a	
  rich	
  developer	
  eco	
  system	
  
•       Orthogonally	
  extensible	
  platform	
  is	
  essential	
  
•       Of	
  course,	
  along	
  with	
  a	
  congruent	
  user	
  experience	
  –	
  “	
  …	
  core	
  Twitter	
  consumption	
  experience	
  through	
  consistent	
  tools”	
  
•    For	
  Hands	
  on	
  Today	
  
                                                                                                                    Setup	
      o  Python	
  2.7.3	
  
      o  easy_install	
  –v	
  requests	
  
           •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐
              request	
  
      o  easy_install	
  –v	
  requests-­‐oauth	
  
      o  Hands	
  on	
  programs	
  at	
  https://github.com/xsankar/oscon2012-­‐handson	
  
•    For	
  advanced	
  data	
  science	
  with	
  social	
  graphs	
  
      o  easy_install	
  –v	
  networkx	
  
      o  easy_install	
  –v	
  numpy	
  
      o  easy_install	
  –v	
  nltk	
  	
  
           •  Not	
  for	
  this	
  tutorial,	
  but	
  good	
  for	
  sentiment	
  analysis	
  et	
  al	
  
      o  Mongodb	
  	
  
           •  I	
  used	
  MongoDB	
  in	
  AWS	
  m2.xlarge,	
  RAID	
  10	
  X	
  8	
  X	
  15	
  GB	
  EBS	
  
      o  graphviz	
  -­‐	
  http://www.graphviz.org/;	
  easy_install	
  pygraphviz	
  
      o  easy_install	
  pydot	
  
Thanks To these Giants …
Problem Domain For this tutorial	

•  Data	
  Science	
  (trends,	
  analytics	
  et	
  al)	
  on	
  Social	
  Networks	
  as	
  
   observed	
  by	
  Twitter	
  primitives	
  
     o  Not	
  for	
  Twitter	
  based	
  apps	
  for	
  real	
  time	
  tweets	
  
     o  Not	
  web	
  sites	
  with	
  real	
  time	
  tweets	
  
•  By	
  looking	
  at	
  the	
  domain	
  in	
  aggregate	
  to	
  derive	
  inferences	
  &	
  
   actionable	
  recommendations	
  
•  Which	
  also	
  means,	
  you	
  need	
  to	
  be	
  deliberate	
  &	
  systemic	
  (	
  i.e.	
  
   not	
  look	
  at	
  a	
  fluctuation	
  as	
  a	
  trend	
  but	
  dig	
  deeper	
  before	
  
   pronouncing	
  a	
  trend)	
  
Agenda	

I.     Mechanics	
  :	
  Twitter	
  API	
  (1:30	
  PM	
  -­‐	
  3:00	
  PM)	
  	
  
      o    Essential	
  Fundamentals	
  (Rate	
  Limit,	
  HTTP	
  Codes	
  et	
  al)	
  
      o    Objects	
  
      o    API	
  
      o    Hands-­‐on	
  (2:45	
  PM	
  -­‐	
  3:00	
  PM)	
  
II.  Break	
  (3:00	
  PM	
  -­‐	
  3:30	
  PM)	
  
III.  Twitter	
  Social	
  Graph	
  Analysis	
  (3:30	
  PM	
  -­‐	
  5:00	
  PM)	
  
      o      Underlying	
  Concepts	
  
      o      Social	
  Graph	
  Analysis	
  of	
  @clouderati	
  
           §  Stages,	
  Strategies	
  &	
  Tasks	
  
           §  Code	
  Walk	
  thru	
  	
  
Open  This  First
Twi5er  API  :  Read  These  First	
•    Using	
  Twitter	
  Brand	
  
      o  New	
  logo	
  &	
  associated	
  guidelines	
  :	
  https://twitter.com/about/logos	
  
      o  Twitter	
  Rules	
  :	
  
         https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐
         best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules	
  
      o  Developer	
  Rules	
  of	
  the	
  road	
  https://dev.twitter.com/terms/api-­‐terms	
  
•    Read	
  These	
  Links	
  First	
  
      1.       https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know	
  
      2.       https://dev.twitter.com/docs/faq	
  
      3.       Field	
  Guide	
  to	
  Objects	
  https://dev.twitter.com/docs/platform-­‐objects	
  
      4.       Security	
  https://dev.twitter.com/docs/security-­‐best-­‐practices	
  
      5.       Media	
  Best	
  Practices	
  :	
  https://dev.twitter.com/media	
  
      6.       Consolidates	
  Page	
  :	
  https://dev.twitter.com/docs	
  
      7.       Streaming	
  APIs	
  https://dev.twitter.com/docs/streaming-­‐apis	
  
      8.       How	
  to	
  Appeal	
  (Not	
  that	
  you	
  all	
  would	
  need	
  it	
  !)	
  https://support.twitter.com/
               articles/72585	
  
•    Only	
  One	
  version	
  of	
  Twitter	
  APIs	
  
API  Status  Page	




•    https://dev.twitter.com/status	
  
•    https://dev.twitter.com/issues	
  
•    https://dev.twitter.com/discussions	
  
h5ps://dev.twi5er.com/status	




http://www.buzzfeed.com/tommywilhelm/google-­‐
users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter	
  
Open  This  First	
•  Install	
  pre-­‐req	
  as	
  per	
  the	
  setup	
  slide	
  
•  Run	
  	
  
    o  oscon2012_open_this_first.py	
  
    o  To	
  test	
  connectivity	
  –	
  “canary	
  query”	
  

•  Run	
  
    o  oscon2012_rate_limit_status.py	
  
    o  Use	
  http://www.epochconverter.com	
  to	
  check	
  reset_time	
  

•  Formats	
  xml,	
  json,	
  atom	
  &	
  rss	
  
Twitter	
  API	
  
                                                                                                                 Near-realtime,
                                                                                                                 High Volume	


                                                                                                                          Follow users,
Core Data,	

                 REST	
                                                           Streaming	
                topics, data
Core Twitter                                                                                                              mining	

Objects	

                                                                                                             Public	
  Streams	
  
                                     Seach &                                                                    User	
  Streams	
  
                                      Trend	

     Twitter	
                                                  Twitter	
                                        Site	
  Streams	
  
      REST	
                                                    Search	
                           Firehose	
  

                   Build	
  Profile	
                                          Keywords	
  
                     Create/Post	
  Tweets	
                                   Specific	
  User	
  
                       Reply	
                                                  Trends	
  
                       Favorite,	
  Re-­‐tweet	
                                  Rate	
  Limit	
  :	
  	
  
                            Rate	
  Limit	
  :	
  150/350	
                       	
  	
  	
  Complexity	
  &	
  Frequency	
  
Rate  Limit
Rate  Limits	
 •  By	
  API	
  type	
  &	
  Authentication	
  Mode	
  
         API	

          No authC	

           authC	

             Error	


REST	
             150/hr	
              350/hr	
         400	
  

Search	
           Complexity	
  &	
     -­‐N/A-­‐	
      420	
  
                   Frequency	
  

Streaming	
                              Upto	
  1%	
  

Fire	
  hose	
     none	
                none	
  
Rate  Limit  Header	
•  {	
  
•  "status":	
  "200	
  OK",	
  	
  
•  	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•  	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•  	
  	
  "x-­‐mid":	
  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐remaining":	
  "149",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐reset":	
  "1340467358",	
  	
  
•  	
  	
  "x-­‐runtime":	
  "0.04144",	
  	
  
•  	
  	
  "x-­‐transaction":	
  "2b49ac31cf8709af",	
  	
  
•  	
  	
  "x-­‐transaction-­‐mask":	
  
   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"	
  
•  }	
  
Rate  Limit-­‐‑ed  Header	
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  max-­‐age=300",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐length":	
  "150",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;	
  charset=utf-­‐8",	
  	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  00:48:25	
  GMT",	
  	
  
•    	
  	
  "expires":	
  "Wed,	
  04	
  Jul	
  2012	
  00:53:25	
  GMT",	
  	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  	
  ”…	
  
•    	
  	
  "status":	
  "400	
  Bad	
  Request",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341363230",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01126"	
  
•    }	
  
Rate  Limit  Example	
•  Run	
  
    o  oscon2012_rate_limit_02.py	
  

•  It	
  iterates	
  through	
  a	
  list	
  to	
  get	
  followers	
  	
  
•  List	
  is	
  2072	
  long	
  
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  00:54:16	
  GMT",	
  	
  
•    "status":	
  "200	
  OK",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•    	
  	
  "x-­‐mid":	
  "f31c7278ef8b6e28571166d359132f152289c3b8",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
                           Last	
  time,	
  it	
  gave	
  me	
  5	
  min.	
  
                                                                                Now	
  the	
  reset	
  timer	
  is	
  1	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "147",	
  	
  
                                                                                hour	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341366831",	
  	
  
                                                                                150	
  calls,	
  not	
  authenticated	
  
•    	
  	
  "x-­‐runtime":	
  "0.02768",	
  	
  
•    	
  	
  "x-­‐transaction":	
  "f1bafd60112dddeb",	
  	
  
•    	
  	
  "x-­‐transaction-­‐mask":	
  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"	
  
•    }	
  
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  max-­‐age=300",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;	
  charset=utf-­‐8",	
  	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  00:55:04	
  GMT",	
  	
  
                                                                                And  Rate  Limit  kicked-­‐‑in	
•    …	
  
•    "status":	
  "400	
  Bad	
  Request",	
  	
  
•    	
  	
  "transfer-­‐encoding":	
  "chunked",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341366831",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01342"	
  
•    }	
  
API  with  OAuth	
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  01:32:01	
  GMT",	
  	
  
•    	
  	
  "etag":	
  ""dd419c02ed00fc6b2a825cc27wbe040"",	
  	
  
•    	
  	
  "expires":	
  "Tue,	
  31	
  Mar	
  1981	
  05:00:00	
  GMT",	
  	
  
•    	
  	
  "last-­‐modified":	
  "Wed,	
  04	
  Jul	
  2012	
  01:32:01	
  GMT",	
  	
  
•    	
  	
  "pragma":	
  "no-­‐cache",	
  	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    …	
  
•    "status":	
  "200	
  OK",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐access-­‐level":	
  "read",	
  	
  
•    	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•    	
  	
  "x-­‐mid":	
  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "349",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341369121",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.05539",	
  	
                                                  OAuth	
  
• 
• 
     	
  	
  "x-­‐transaction":	
  "9f8508fe4c73a407",	
  	
  
     	
  	
  "x-­‐transaction-­‐mask":	
  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"	
  
                                                                                            “api-­‐identified”	
  
•    }	
                                                                                       1	
  hr	
  reset	
  
                                                                                               350	
  calls	
  
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "date":	
  "Thu,	
  05	
  Jul	
  2012	
  14:56:05	
  GMT",	
  	
  
•    …	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "133",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341500165",	
  	
  
•    	
  …	
                                                                               Rate  Limit  resets  during  
•    }	
                                                                                      consecutive  calls	
•    ********	
  2416	
  
•    {	
  
                                                                                   +1  
•    …	
                                                                          hour	
•    	
  	
  "date":	
  "Thu,	
  05	
  Jul	
  2012	
  14:56:18	
  GMT",	
  	
  
•    …	
  
•    	
  	
  "status":	
  "200	
  OK",	
  	
  
•    	
  	
  ….	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "349",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341503776",	
  	
  
•    ********	
  2417	
  
Unexplained  Errors	
•    Traceback	
  (most	
  recent	
  call	
  last):	
  
•    	
  	
  File	
  "oscon2012_get_user_info_01.py",	
  line	
  39,	
  in	
  <module>	
  
•    	
  	
  	
  	
  r	
  =	
  client.get(url,	
  params=payload)	
  
•    	
  	
  File	
  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",	
  line	
  244,	
  in	
  get	
  
•    	
  	
  File	
  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",	
  line	
  230,	
  in	
  request	
  
•    	
  	
  File	
  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",	
  line	
  609,	
  in	
  send	
  
•    requests.exceptions.ConnectionError:	
  HTTPSConnectionPool(host='api.twitter.com',	
  port=443):	
  Max	
  
     retries	
  exceeded	
  with	
  url:	
  /1/users/lookup.json?
     user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44
     614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854
     7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8
     962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C
                                                         While	
  trying	
  to	
  get	
  details	
  of	
  1,000,000	
  users,	
  I	
  get	
  this	
  error	
  –	
  
     17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C
                                                         usually	
  10-­‐6	
  AM	
  PST	
  
     42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C
     8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%
                                                         	
  
     2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%
                                                         Got	
  around	
  by	
  “Trap	
  &	
  wait	
  5	
  seconds”	
  
     2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%
                                                         	
  
     2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155
     56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260
                                                         Night	
  Runs	
  are	
  relatively	
  error	
  free	
  
     09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446
     14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886
     54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C
     13727232%2C199803906%2C220435108%2C268531201	
  
•    {	
  
• 
• 
     	
  …	
  
     	
  	
  "date":	
  "Fri,	
  06	
  Jul	
  2012	
  03:41:09	
  GMT",	
  	
  
                                                                                                                                            A Day in the life of
•    	
  	
  "expires":	
  "Fri,	
  06	
  Jul	
  2012	
  03:46:09	
  GMT",	
  	
                                                             Twitter Rate Limit
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  	
  "set-­‐cookie":	
  "dnt=;	
  domain=.twitter.com;	
  path=/;	
  expires=Thu,	
  01-­‐Jan-­‐1970	
  00:00:00	
  GMT",	
  	
  
•    	
  	
  "status":	
  "400	
  Bad	
  Request",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
                                                                                       Missed  by  4  min!	
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341546334",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01918"	
  
•    }	
  
•    Error,	
  sleeping	
  
•    {	
  
•    	
  …	
  
•    	
  "date":	
  "Fri,	
  06	
  Jul	
  2012	
  03:46:12	
  GMT",	
  	
  
•    	
  …	
  
•    	
  "status":	
  "200	
  OK",	
  	
  
•    	
  …	
  
•    	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "349",	
  	
                           OK  after  5  min  sleep	
•    	
  …	
  
Strategies	
I	
  have	
  no	
  exotic	
  strategies,	
  so	
  far	
  !	
  
1.  Obvious	
  :	
  	
  Track	
  elapsed	
  time	
  &	
  sleep	
  when	
  rate	
  limit	
  kicks	
  in	
  
2.  Combine	
  authenticated	
  &	
  non-­‐authenticated	
  calls	
  
3.  Use	
  multiple	
  API	
  types	
  
4.  Cache	
  
5.  Store	
  &	
  get	
  only	
  what	
  is	
  needed	
  
6.  Checkpoint	
  &	
  buffer	
  request	
  commands	
  
7.  Distributed	
  data	
  parallelism	
  –	
  for	
  example	
  AWS	
  instances	
  
http://www.epochconverter.com/	
  <-­‐	
  useful	
  to	
  debug	
  the	
  timer	

	

Pl share your tips and tricks for conserving the Rate Limit
Authentication
Authentication	
•  Three	
  modes	
  
     o  Anonymous	
  
     o  HTTP	
  Basic	
  Auth	
  
     o  OAuth	
  
•  As	
  of	
  Aug	
  31,	
  2010,	
  only	
  Anonymous	
  or	
  OAuth	
  are	
  
   supported	
  
•  	
  OAuth	
  enables	
  the	
  user	
  to	
  authorize	
  an	
  application	
  
   without	
  sharing	
  credentials	
  
•  Also	
  has	
  the	
  ability	
  to	
  revoke	
  
•  Twitter	
  supports	
  OAuth	
  1.0a	
  
•  OAuth	
  2.0	
  is	
  the	
  new	
  standard,	
  much	
  simpler	
  
     o  No	
  timeframe	
  for	
  Twitter	
  support,	
  yet	
  	
  	
  
OAuth  Pragmatics	
•  Helpful	
  Links	
  
     o    https://dev.twitter.com/docs/auth/oauth	
  
     o    https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth	
  
     o    https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples	
  
     o    http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html	
  
•  Discussion	
  on	
  OAuth	
  internal	
  mechanisms	
  is	
  better	
  left	
  for	
  
   another	
  day	
  
•  For	
  headless	
  applications	
  to	
  get	
  OAuth	
  token,	
  go	
  to	
  https://
   dev.twitter.com/apps	
  
•  	
  Create	
  an	
  application	
  &	
  get	
  four	
  credential	
  pieces	
  
     o  Consumer	
  Key,	
  Consumer	
  Secret,	
  Access	
  Token	
  &	
  Access	
  Token	
  Secret	
  
•  All	
  the	
  frameworks	
  have	
  support	
  for	
  OAuth.	
  So	
  plug	
  –in	
  
   these	
  values	
  &	
  use	
  the	
  framework’s	
  calls	
  
•  I	
  used	
  request-­‐oauth	
  library	
  like	
  so:	
  
request-­‐‑oauth	
               def	
  get_oauth_client():	
                                                                                                                                                                Get	
  client	
  using	
  the	
  
                             	
  	
  	
  consumer_key	
  =	
  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"	
                                                                                              token,	
  key	
  &	
  secret	
  from	
  
                             	
  	
  	
  	
  consumer_secret	
  =	
  "fceb3aedb960374e74f559caeabab3562efe97b4"	
                                                                                         dev.twitter.com/apps	
  
                             	
  	
  	
  	
  access_token	
  =	
  "df919acd38722bc0bd553651c80674fab2b465086782Ls"	
  
                             	
  	
  	
  	
  access_token_secret	
  =	
  "1370adbe858f9d726a43211afea2b2d9928ed878"	
  
                             	
  	
  	
  	
  header_auth	
  =	
  True	
  
                             	
  	
  	
  	
  oauth_hook	
  =	
  OAuthHook(access_token,	
  access_token_secret,	
  consumer_key,	
  consumer_secret,	
  header_auth)	
  
                             	
  	
  	
  	
  client	
  =	
  requests.session(hooks={'pre_request':	
  oauth_hook})	
  
                             	
  	
  	
  	
  return	
  client	
  
                                                                                                                                                                                                           Use	
  the	
  client	
  instead	
  
               def	
  get_followers(user_id):	
                                                                                                                                                                    of	
  requests	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  'https://api.twitter.com/1/followers/ids.json’	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  payload={"user_id":user_id}	
  #	
  if	
  cursor	
  is	
  needed	
  {"cursor":-­‐1,"user_id":scr_name}	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r	
  =	
  requests.get(url,	
  params=payload)	
  

               def	
  get_followers_with_oauth(user_id,client):	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  'https://api.twitter.com/1/followers/ids.json'	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  payload={"user_id":user_id}	
  #	
  if	
  cursor	
  is	
  needed	
  {"cursor":-­‐1,"user_id":scr_name}	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r	
  =	
  client.get(url,	
  params=payload)	
  

Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
OAuth  Authorize  screen	
                •  The	
  user	
  
                   authenticates	
  with	
  
                   Twitter	
  &	
  grants	
  
                   access	
  to	
  Forbes	
  
                   Social	
  
                •  Forbes	
  social	
  
                   doesn’t	
  have	
  the	
  
                   users	
  credentials,	
  
                   but	
  uses	
  OAuth	
  to	
  
                   access	
  the	
  user’s	
  
                   account	
  
HTTP  Status  
  Codes
HTTP  status  Codes	
         •  0	
  Never	
  made	
  it	
  to	
  Twitter	
  Servers	
  -­‐	
   •          404	
  Not	
  Found	
  
            Library	
  error	
                                              •          406	
  Not	
  Acceptable	
  
         •  200	
  OK	
                                                     •          413	
  Too	
  Long	
  
         •  304	
  Not	
  Modified	
                                         •          416	
  Range	
  Unacceptable	
  
         •  400	
  Bad	
  Request	
                                         •          420	
  Enhance	
  Your	
  Calm	
  
                o  Check	
  error	
  message	
  for	
  explanation	
                    o  Rate	
  Limited	
  
                o  REST	
  Rate	
  Limit	
  !	
  	
                              •  500	
  Internal	
  Server	
  Error	
  
         •  401	
  UnAuthorized	
                                                •  502	
  Bad	
  Gateway	
  	
  
                o  Beware	
  –	
  you	
  could	
  get	
  this	
  for	
  other	
         o  Down	
  for	
  maintenance	
  
                   reasons	
  as	
  well.	
  	
  	
                               •    503	
  Service	
  Unavailable	
  
         •  403	
  Forbidden	
                                                          o  Overloaded	
  “Fail	
  whale”	
  
                o  Hit	
  Update	
  Limit	
  (>	
  max	
  Tweets/day,	
          •  504	
  Gateway	
  Timeout	
  
                   following	
  too	
  many	
  people)	
                                o  Overloaded	
  


h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
HTTP  Status  Code  -­‐‑  Example	
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  max-­‐age=300",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐length":	
  "91",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;	
  charset=utf-­‐8",	
  	
  
•    	
  	
  "date":	
  "Sat,	
  23	
  Jun	
  2012	
  00:06:56	
  GMT",	
  	
  
•    	
  	
  "expires":	
  "Sat,	
  23	
  Jun	
  2012	
  00:11:56	
  GMT",	
  	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  …	
  
•    	
  	
  "status":	
  "401	
  Unauthorized",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "www-­‐authenticate":	
  "OAuth	
  realm="https://api.twitter.com"",	
  	
  
• 
• 
     	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
     	
  	
  "x-­‐ratelimit-­‐limit":	
  "0",	
  	
  
                                                                                                      Detailed	
  error	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
                                              message	
  	
  in	
  JSON	
  !	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1340413616",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01997"	
                                                               I	
  like	
  this	
  
•    }	
  
•    {	
  
•    	
  	
  "errors":	
  [	
  
•    	
  	
  	
  	
  {	
  
•    	
  	
  	
  	
  	
  	
  "code":	
  53,	
  	
  
•    	
  	
  	
  	
  	
  	
  "message":	
  "Basic	
  authentication	
  is	
  not	
  supported"	
  
•    	
  	
  	
  	
  }	
  
•    	
  	
  ]	
  
•    }	
  
HTTP  Status  Code  –  Confusing  Example	
•    {	
                                                                •  GET	
  https://api.twitter.com/1/users/lookup.json?
•    …	
  
                                                                               screen_nme=twitterapi,twitter&include_entities=
•    	
  	
  "pragma":	
  "no-­‐cache",	
  	
  
                                                                               true	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  …	
  	
                                                        •  Spelling	
  Mistake	
  
•    	
  	
  "status":	
  "404	
  Not	
  Found",	
  	
                           o  Should	
  be	
  screen_name	
  
•    	
  	
  …	
                                                        •  But	
  confusing	
  error	
  !	
  
•    }	
  
•    {	
                                                                •  Should	
  be	
  406	
  Not	
  Acceptable	
  or	
  413	
  Too	
  Long	
  ,	
  
•    	
  	
  "errors":	
  [	
                                                  showing	
  parameter	
  error	
  
•    	
  	
  	
  	
  {	
  
•    	
  	
  	
  	
  	
  	
  "code":	
  34,	
  	
  
•    	
  	
  	
  	
  	
  	
  "message":	
  "Sorry,	
  that	
  page	
  does	
  not	
  exist"	
  
•    	
  	
  	
  	
  }	
  
•    	
  	
  ]	
  
•    }	
  
HTTP  Status  Code  -­‐‑  Example	
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  no-­‐store,	
  must-­‐revalidate,	
  pre-­‐check=0,	
  post-­‐check=0",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐length":	
  "112",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;charset=utf-­‐8",	
  	
                                         Sometimes,	
  the	
  errors	
  are	
  
•    	
  	
  "date":	
  "Sat,	
  23	
  Jun	
  2012	
  01:23:47	
  GMT",	
  	
                                       not	
  correct.	
  I	
  got	
  this	
  error	
  
•    	
  	
  "expires":	
  "Tue,	
  31	
  Mar	
  1981	
  05:00:00	
  GMT",	
  	
  
•    …	
  
                                                                                                                    for	
  user_timeline.json	
  w/	
  
•    	
  	
  "status":	
  "401	
  Unauthorized",	
  	
                                                              user_id=20,15,12	
  
•    	
  	
  "www-­‐authenticate":	
  "OAuth	
  realm="https://api.twitter.com"",	
  	
                           Clearly	
  a	
  parameter	
  error	
  
•    	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
                                                                                                                    (i.e.	
  more	
  parameters)	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "147",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1340417742",	
  	
  
•    	
  	
  "x-­‐transaction":	
  "d545a806f9c72b98"	
  
•    }	
  
•    {	
  
•    	
  	
  "error":	
  "Not	
  authorized",	
  	
  
•    	
  	
  "request":	
  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"	
  
•    }	
  
Objects
Followers	
  
                                                      Twitter	
  Platform	
  
    Friends	
  
                      Are Followed By	

                                                                  Objects	
  
 Follow	

                   Users	
  
                        Status Update	

       @     user_mentions	
  
                                                                                      Entities	
  
                                                   embed	

    urls	
  
          Temporally
                               Tweets	
  
                                                        embe
                                                            d   	

          Ordered	

                                                      media	
  

    TimeLine	
                                         #	

                                  Places	
                    hashtags	
  


h5ps://dev.twi5er.com/docs/platform-­‐‑objects
Tweets	
                •  A.k.a	
  Status	
  Updates	
  
                •  Interesting	
  fields	
  
                      o    Coordinates	
  <-­‐	
  geo	
  location	
  
                      o    created_at	
  
                      o    entities	
  (will	
  see	
  later)	
  
                      o    Id,	
  id_str	
  
                      o    possibly	
  sensitive	
  
                      o    user	
  (will	
  see	
  later)	
  
                             •  perspectival	
  attributes	
  embedded	
  within	
  a	
  child	
  object	
  of	
  an	
  unlike	
  parent	
  –	
  
                                hard	
  to	
  maintain	
  at	
  scale	
  
                             •  https://dev.twitter.com/docs/faq#6981	
  
                      o  withheld_in_countries	
  	
  
                             •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses	
  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
A  word  about  id,  id_str	
                  •  June	
  1,	
  2010	
  
                         o  Snowflake	
  the	
  id	
  generator	
  service	
  
                         o  “The	
  full	
  ID	
  is	
  composed	
  of	
  a	
  timestamp,	
  
                            a	
  worker	
  number,	
  and	
  a	
  sequence	
  
                            number”	
  
                         o  Had	
  problems	
  with	
  JavaScript	
  to	
  handle	
  
                            numbers	
  >	
  53	
  bits	
  
                         o  “id”:819797	
  
                         o  “id_str”:”819797”	
  




h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html	
h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI	
h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
Tweets  -­‐‑  example	
•  Let	
  us	
  run	
  oscon2012-­‐tweets.py	
  
•  Example	
  of	
  tweet	
  
   o  coordinates	
  
   o  id	
  	
  
   o  id_str	
  
Users	
                •    followers_count	
  
                •    geo_enabled	
  
                •    Id,	
  Id_str	
  
                •    name,	
  screen_name	
  
                •    Protected	
  
                •    status,	
  statuses_count	
  
                •    withheld_in_countries	
  
h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
Users  –  Let  us  run  some  examples	
•  Run	
  	
  
     o  oscon_2012_users.py	
  
         •  Lookup	
  users	
  by	
  screen_name	
  
     o  oscon12_first_20_ids.py	
  
         •  Lookup	
  users	
  by	
  user_id	
  
•  Inspect	
  the	
  results	
  
     o  id,	
  name,	
  status,	
  status_count,	
  protected,	
  followers	
  
        (for	
  top	
  10	
  followers),	
  withheld	
  users	
  
•  Can	
  use	
  information	
  for	
  customizing	
  
   the	
  user’s	
  screen	
  in	
  your	
  web	
  app	
  
Entities	
                    •  Metadata	
  &	
  Contextual	
  Information	
  
                    •  You	
  can	
  parse	
  them,	
  but	
  Entities	
  
                       parse	
  them	
  out	
  as	
  structured	
  data	
  
                    •  REST	
  API/Search	
  API	
  –	
  
                       include_entities=1	
  
                    •  Streaming	
  API	
  –	
  included	
  by	
  default	
  
                    •  hashtags,	
  media,	
  urls,	
  
                       user_mentions	
  
h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities	
h5ps://dev.twi5er.com/docs/tweet-­‐‑entities	
h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
Entities	
•  Run	
  	
  
     o  oscon2012_entities.py	
  

•  Inspect	
  hashtags,	
  urls	
  et	
  al	
  	
  
Places	
                  •    attributes	
  
                  •    bounding_box	
  
                  •    Id	
  (as	
  a	
  string!)	
  
                  •    country	
  
                  •    name	
  


h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places	
h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
Places	
•  Can	
  search	
  for	
  tweets	
  near	
  a	
  place	
  like	
  so:	
  
•  Get	
  latlong	
  of	
  conv	
  center	
  [45.52929,-­‐122.66289]	
  
     o  Tweets	
  near	
  that	
  place	
  
•  Tweets	
  near	
  San	
  Jose	
  [37.395715,-­‐122.102308]	
  
•  We	
  will	
  not	
  see	
  further	
  here.	
  But	
  very	
  useful	
  
Timelines	
             •  Collections	
  of	
  tweets	
  ordered	
  by	
  time	
  
             •  Use	
  max_id	
  &	
  since_id	
  for	
  navigation	
  




h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
Other  Objects  &  APIs	
•  Lists	
  
•  Notifications	
  
•  Friendships/exists	
  to	
  see	
  if	
  one	
  follows	
  
   the	
  other	
  
Followers	
  
                                                      Twitter	
  Platform	
  
    Friends	
  
                      Are Followed By	

                                                                  Objects	
  
 Follow	

                   Users	
  
                        Status Update	

       @     user_mentions	
  
                                                                                      Entities	
  
                                                   embed	

    urls	
  
          Temporally
                               Tweets	
  
                                                        embe
                                                            d   	

          Ordered	

                                                      media	
  

    TimeLine	
                                         #	

                                  Places	
                    hashtags	
  


h5ps://dev.twi5er.com/docs/platform-­‐‑objects
Hands-­‐‑on  Exercise  (15  min)	
•  Setup	
  environment	
  –	
  slide	
  #14	
  
•  Sanity	
  Check	
  Environment	
  &	
  Libraries	
  
      o  oscon2012_open_this_first.py	
  
      o  oscon2012_rate_limit_status.py	
  
•  Get	
  objects	
  (show	
  calls)	
  
      o    Lookup	
  users	
  by	
  screen_name	
  	
  -­‐	
  oscon12_users.py	
  
      o    Lookup	
  users	
  by	
  id	
  -­‐	
  oscon12_first_20_ids.py	
  
      o    Lookup	
  tweets	
  -­‐	
  oscon12_tweets.py	
  
      o    Get	
  entities	
  -­‐	
  oscon12_entities.py	
  
•  Inspect	
  the	
  results	
  
•  Explore	
  a	
  little	
  bit	
  
•  Discussion	
  
Twi5er  APIs
Twitter	
  API	
  
                                                                                     Near-realtime,
                                                                                     High Volume	


                                                                                           Follow users,
Core Data,	

            REST	
                                          Streaming	
       topics, data
Core Twitter                                                                               mining	

Objects	

                                                                                   Public  Streams	
                             Seach &                                                User  Streams	
                              Trend	

     Twitter	
                               Twitter	
                                Site  Streams	
      REST	
                                 Search	
                      Firehose	

                   Build  Profile	
                            Keywords	
                    Create/Post  Tweets	
                      Specific  User	
                      Reply	
                                   Trends	
                      Favorite,  Re-­‐‑tweet	
                    Rate  Limit  :  	
                        Rate  Limit  :  150/350	
                       Complexity  &  Frequency
Twi5er  REST  API	
•    https://dev.twitter.com/docs/api	
  
•    What	
  we	
  were	
  doing	
  were	
  the	
  REST	
  API	
  
•    Request-­‐Response	
  
•    Anonymous	
  or	
  OAuth	
  
•    Rate	
  Limited	
  :	
  
      o  150/350	
  
Twi5er  Trends	
•  oscon2012-­‐trends.py	
  
•  Trends/weekly,	
  Trends/monthly	
  
•  Let	
  us	
  run	
  some	
  examples	
  
     o  oscon2012_trends_daily.py	
  
     o  oscon2012_trends_weekly.py	
  

•  Trends	
  &	
  hashtags	
  
     o    #hashtag	
  euro2012	
  
     o    http://hashtags.org/euro2012	
  
     o    http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/	
  
     o    http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html	
  
     o    Top	
  10	
  :	
  http://twittercounter.com/pages/100,	
  http://twitaholic.com/	
  
Brand  Rank  w/  Twi5er	
•  Walk	
  Through	
  &	
  results	
  of	
  following	
  
     o  oscon2012_brand_01.py	
  
•  Followed	
  10	
  user-­‐brands	
  for	
  a	
  few	
  days	
  to	
  find	
  
   growth	
  
•  Brand	
  Rank	
  	
  
     o  Growth	
  of	
  a	
  brand	
  w.r.t	
  the	
  industry	
  
     o  Surge	
  in	
  popularity	
  –	
  could	
  be	
  due	
  to	
  –ve	
  or	
  +ve	
  buzz.	
  Need	
  to	
  understand	
  &	
  
        correlate	
  using	
  Twitter	
  APIs	
  &	
  metrics	
  
•  API	
  :	
  url='https://api.twitter.com/1/users/
   lookup.json'	
  
•  payload={"screen_name":"miamiheat,okcthunder,n
   ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,
   googleio,OReillyMedia"}	
  
Brand  Rank  w/  Twi5er	
                     Clouderati  
                       is  very  
                        stable
Brand  Rank  w/  Twi5er  
    Tech  Brands	
            •    Google	
  I/O	
  showed	
  a	
  spike	
  on	
  6/27-­‐	
  
                 6/28	
  
            •    OReillyMedia	
  shares	
  some	
  spike	
  
            •    Looking	
  at	
  a	
  few	
  days	
  worth	
  of	
  
                 data,	
  our	
  best	
  inference	
  is	
  that	
  
                 “oscon	
  doesn’t	
  track	
  with	
  googleio”	
  
            •    “Clouderati	
  doesn’t	
  track	
  at	
  all”	
  
Brand  Rank  w/  Twi5er  
   World  of  Soccer	
            •  FOXSoccer,UEFAcom	
  
               track	
  each	
  other	
  	
  

                    The  numbers  seldom  
                   decrease.  So  calculating  
                    –ve  velocity  will  not  
                             work	
                   OTOH,  if  you  see  a  –ve  
                     velocity,  investigate
Brand  Rank  w/  Twi5er  
                 World  of  Basketball	
•  NBA,	
  MiamiHeat,	
  okcthunder	
  track	
  each	
  other	
  
•  Used	
  %	
  than	
  absolute	
  numbers	
  to	
  compare	
  
•  The	
  hike	
  on	
  7/6	
  to	
  7/10	
  is	
  interesting.	
  	
  	
  
Brand  Rank  w/  Twi5er  
    Rising  Tide  …	
 •  For	
  some	
  reason,	
  all	
  numbers	
  are	
  going	
  up	
  7/6	
  thru	
  
    7/10	
  –	
  except	
  for	
  clouderati!	
  
 •  Is	
  a	
  rising	
  (Twitter)	
  tide	
  lifting	
  all	
  (well,	
  almost	
  all)	
  ?	
  
Trivia  :  Search  API	
•  Search(search.twitter.com)	
  
   o  Built	
  by	
  Summize	
  which	
  was	
  acquired	
  by	
  Twitter	
  in	
  
      2008	
  
   o  Summize	
  described	
  itself	
  as	
  “sentiment	
  mining”	
  
Search  API	
              •  Very	
  simple	
  	
  
                   o  GET	
  http://search.twitter.com/search.json?q=<blah>	
  
              •  Based	
  on	
  a	
  search	
  criteria	
  
              •  “The Twitter Search API is a dedicated API for
                 running searches against the real-time index of
                 recent Tweets”
              •  Recent	
  =	
  Last	
  6-­‐9	
  days	
  worth	
  of	
  tweets	
  
              •  Anonymous	
  Call	
  
              •  Rate	
  Limit	
  
                   o  Not	
  No.	
  of	
  calls/hour,	
  but	
  Complexity	
  &	
  Frequency	
  
h5ps://dev.twi5er.com/docs/using-­‐‑search	
h5ps://dev.twi5er.com/docs/api/1/get/search
Search  API	
•  Filters	
  
    o    Search	
  URL	
  encoded	
  
    o    @	
  =	
  %40,	
  #=%23	
  
    o    	
  emoticons	
  	
  :)	
  and	
  :(,	
  
    o    http://search.twitter.com/search.atom?q=sometimes+%3A)	
  
    o    http://search.twitter.com/search.atom?q=sometimes+%3A(	
  

•  Location	
  Filters,	
  date	
  filters	
  
•  Content	
  searches	
  
Streaming  API	
•    Not	
  request	
  response;	
  but	
  stream	
  
•    Twitter	
  frameworks	
  have	
  the	
  support	
  
•    Rate	
  Limit	
  :	
  Upto	
  1%	
  
•    Stall	
  warning	
  if	
  the	
  client	
  is	
  falling	
  behind	
  
•    Good	
  Documentation	
  Links	
  
      o  https://dev.twitter.com/docs/streaming-­‐apis/connecting	
  
      o  https://dev.twitter.com/docs/streaming-­‐apis/parameters	
  
      o  https://dev.twitter.com/docs/streaming-­‐apis/processing	
  
Firehose	
•  ~	
  400	
  million	
  public	
  tweets/day	
  
•  If	
  you	
  are	
  working	
  with	
  Twitter	
  firehose,	
  I	
  envy	
  you	
  !	
  




•  If	
  you	
  hit	
  real	
  limits,	
  then	
  explore	
  the	
  firehose	
  route	
  
•  AFAIK,	
  it	
  is	
  not	
  cheap,	
  but	
  worth	
  it	
  
API  Best  Practices	
              1.  Use	
  JSON	
  
              2.  Use	
  user_id	
  than	
  screen_name	
  
                     o  User_id	
  is	
  constant	
  while	
  screen_name	
  can	
  change	
  
              3.  max_id	
  and	
  since_id	
  
                     o  For	
  example	
  direct	
  messages,	
  if	
  you	
  have	
  last	
  message	
  use	
  
                        since_id	
  for	
  search	
  
                     o  max_id	
  how	
  far	
  to	
  go	
  back	
  
              4.  Cache	
  as	
  much	
  as	
  you	
  can	
  
              5.  Set	
  the	
  User-­‐Agent	
  header	
  for	
  debugging	
  
              I have listed a few good blogs that have API best practices, in the
              reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the
                                                      sources
Twitter	
  API	
  
                                                                                     Near-realtime,
                                                                                     High Volume	


                                                                                           Follow users,
Core Data,	

            REST	
                                          Streaming	
       topics, data
Core Twitter                                                                               mining	

Objects	

                                                                                   Public  Streams	
                             Seach &                                                User  Streams	
                              Trend	

     Twitter	
                               Twitter	
                                Site  Streams	
      REST	
                                 Search	
                      Firehose	

                   Build  Profile	
                                    Questions	
  ?	
  
                                                              Keywords	
                    Create/Post  Tweets	
                      Specific  User	
                      Reply	
                                   Trends	
                      Favorite,  Re-­‐‑tweet	
                    Rate  Limit  :  	
                        Rate  Limit  :  150/350	
                       Complexity  &  Frequency
Part II
          SNA
         Part II
Twitter Network Analysis
2.	
  Store	
         3.	
  Transform	
  &	
  	
  
           1.	
  Collect	
  
                                                                                  Analyze	
  


                                                                                              the
                             Validate Dataset &                                        . Keep don’t
                                                                                 Tip: 3 simple;
                              re-crawl/refresh	

                                     a
                                                                                schem afrai d to
                                                                                     be
                                                                                            for m
Most	
  important	
  &	
                                                              trans
the	
  ugliest	
  slide	
  in	
  
       this	
  deck	
  !	
             as
                                lem ent ,
                          1. Imp ipeline                                   4.	
  Model	
  
                     Tip: age d p nolith              5.	
  Predict,	
            &	
  	
  
                       a st r a mo                                          Reason	
  
                         neve                       Recommend	
  &	
  
                                                       Visualize	
  
Trivia	
•  Social	
  Network	
  Analysis	
  originated	
  as	
  Sociometry	
  &	
  
   the	
  social	
  network	
  was	
  called	
  a	
  sociogram	
  
•  Back	
  then,	
  Facebook	
  was	
  called	
  SocioBinder!	
  
•  Jacob	
  Levi	
  Morano,	
  is	
  considered	
  the	
  originator	
  
    o  NYTimes,	
  April	
  3,	
  1933,	
  P.	
  17	
  
Twi5er  Networks-­‐‑Definitions	
•  Nodes	
  
   o  Users	
  
   o  #tags	
  

•  Edges	
  
   o    Follows	
  
   o    Friends	
  
   o    @mentions	
  
   o    #tags	
  

•  Directed	
  
Twi5er  Networks-­‐‑Definitions	
•  In-­‐degree	
  
    o  Followers	
  
•  Out-­‐Degree	
  
    o  Friends/Follow	
  
•  Centrality	
  Measures	
  
•  Hubs	
  &	
  Authorities	
  
    o  Hubs/Directories	
  tell	
  us	
  where	
  
       Authorities	
  are	
  
    o  “Of	
  Mortals	
  &	
  Celebrities”	
  is	
  
       more	
  “Twitter-­‐style”	
  
Twi5er  Networks-­‐‑Properties	
                                                                                   M
•  Concepts	
  From	
  Citation	
                                    N
   Networks	
                                                                      K
                                                                                           J	
   o  Cocitation	
                                                         L	
  
                                                                                                 I	
       •  Common	
  papers	
  that	
  cite	
  a	
  paper	
                         A
       •  Common	
  Followers	
                                      B                   G
            o  C	
  &	
  G	
  (Followed	
  by	
  F	
  &	
  H)	
  
                                                                    C              H
   o  Bibliographic	
  Coupling	
  
       •  Cite	
  the	
  same	
  papers	
                         D                    F	
  
       •  Common	
  Friends	
  (i.e.	
  follow	
  same	
               E
          person)	
  
            o  D,	
  E,	
  F	
  &	
  H	
  
Twi5er  Networks-­‐‑Properties	
•  Concepts	
  From	
  Citation	
  Networks	
                                                 M
    o  Cocitation	
                                                                   N
         •  Common	
  papers	
  that	
  cite	
  a	
  paper	
                                      K
         •  Common	
  Followers	
  
                                                                                                          J	
  
                                                                              L	
  
               o  C	
  &	
  G	
  (Followed	
  by	
  F	
  &	
  H)	
                                                I	
  
    o  Bibliographic	
  Coupling	
                                                        A
         •  Cite	
  the	
  same	
  papers	
                                           B                   G
         •  Common	
  Friends	
  	
  (i.e.	
  follow	
  same	
  person)	
  
               o  D,	
  E,	
  F	
  &	
  H	
  follow	
  C	
  
               o  H	
  &	
  F	
  follow	
  C	
  &	
  G	
                                      H
                                                                                    C
                       •  So	
  H	
  &	
  F	
  have	
  high	
  coupling	
   D
                       •  Hence,	
  if	
  H	
  follows	
  A,	
  we	
  can	
                           F	
  
                              recommend	
  F	
  to	
  follow	
  A	
                 E
Twi5er  Networks-­‐‑Properties	
•  Bipartite/Affiliation	
  Networks	
  
   o  Two	
  disjoint	
  subsets	
  
   o  The	
  bipartite	
  concept	
  is	
  very	
  relevant	
  to	
  Twitter	
  social	
  graph	
  
   o  Membership	
  in	
  Lists	
  	
  
       •  lists	
  vs.	
  users	
  bipartite	
  graph	
  
   o  Common	
  #Tags	
  in	
  Tweets	
  	
  
       •  #tags	
  vs.	
  members	
  bipartite	
  graph	
  
   o  @mention	
  together	
  
       •  ?	
  Can	
  this	
  be	
  a	
  bipartite	
  graph	
  
       •  ?	
  How	
  would	
  we	
  fold	
  this	
  ?	
  
Other  Metrics  &  Mechanisms	
                   •      Kronecker	
  Graphs	
  Models	
  
                           o  Kronecker	
  product	
  is	
  a	
  way	
  of	
  generating	
  self-­‐similar	
  matrices	
  
                           o  Prof.Leskovec	
  et	
  al	
  define	
  the	
  Kronecker	
  product	
  of	
  two	
  graphs	
  as	
  the	
  Kronecker	
  product	
  of	
  
                              their	
  adjacency	
  matrices	
  
                           o  Application	
  :	
  Generating	
  models	
  for	
  analysis,	
  prediction,	
  anomaly	
  detection	
  et	
  al	
  
                   •      Erdos-­‐Renyl	
  Random	
  Graphs	
  
                           o  Easy	
  to	
  build	
  a	
  Gn,p	
  graph	
  
                           o  Assumes	
  equal	
  likelihood	
  of	
  edges	
  between	
  two	
  nodes	
  
                           o  In a Twitter social network, we can create a more realistic expected distribution (adding the
                              “social reality” dimension) by inspecting the #tags & @mentions
                   •      Network	
  Diameter	
  
                   •      Weak	
  Ties	
  
                   •      Follower	
  velocity	
  (+ve	
  &	
  –ve),	
  Association	
  strength	
  
                           o  Unfollow	
  not	
  a	
  reliable	
  measure	
  
                           o  But	
  an	
  interesting	
  property	
  to	
  investigate	
  when	
  it	
  happens	
  


                        Not covered here, but potential for an encore !
Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
Twi5er  Networks-­‐‑Properties	
•  Twitter != LinkedIn, Twitter != Facebook
•  Twitter Network == Interest Network
•  Be	
  cognizant	
  of	
  the	
  above	
  when	
  you	
  apply	
  traditional	
  network	
  
   properties	
  to	
  Twitter	
  	
  
•  For	
  example,	
  	
  
      o  Six	
  degrees	
  of	
  separation	
  doesn't	
  make	
  sense	
  (most	
  of	
  the	
  time)	
  in	
  
         Twitter	
  –	
  except	
  may	
  be	
  for	
  Cliques	
  
      o  Is	
  diameter	
  a	
  reliable	
  measure	
  for	
  a	
  Twitter	
  Network	
  ?	
  
              •  Probably	
  not	
  
      o  Do	
  cut	
  sets	
  make	
  sense	
  ?	
  	
  
              •  Probably	
  not	
  
      o  But	
  citation	
  network	
  principles	
  do	
  apply;	
  we	
  can	
  learn	
  from	
  cliques	
  
      o  Bipartite	
  graphs	
  do	
  make	
  sense	
  
Cliques  (1  of  2)	
•  “Maximal	
  subset	
  of	
  the	
  vertices	
  in	
  an	
  
   undirected	
  network	
  such	
  that	
  every	
  member	
  
   of	
  the	
  set	
  is	
  connected	
  by	
  an	
  edge	
  to	
  every	
  
   other”	
  
•  Cohesive	
  subgroup,	
  closely	
  connected	
  
•  Near-­‐cliques	
  than	
  a	
  perfect	
  clique	
  (k-­‐plex	
  i.e.	
  
   connected	
  to	
  at	
  least	
  n-­‐k	
  others)	
  
•  k-­‐plex	
  clique	
  to	
  discover	
  sub	
  groups	
  in	
  a	
  sparse	
  
   network;	
  1-­‐plex	
  being	
  the	
  perfect	
  clique	
  
                                                 Ref:  Networks,  An  Introduction-­‐‑Newman
Cliques  (2  of  2)	
•  k-­‐core	
  –	
  at	
  least	
  k	
  others	
  in	
  the	
  subset;	
  
   (n-­‐k)-­‐plex	
  
•  k-­‐clique	
  –	
  no	
  more	
  than	
  k	
  distance	
  away	
  
    o  Path	
  inside	
  or	
  outside	
  the	
  subset	
  
    o  k-­‐clan	
  or	
  k-­‐club	
  (path	
  inside	
  the	
  subset)	
  

•  We	
  will	
  apply	
  k-­‐plex	
  Cliques	
  for	
  one	
  of	
  
   our	
  hands-­‐on	
  	
  

                                                                  Ref:  Networks,  An  Introduction-­‐‑Newman
Sentiment  Analysis	
•  Sentiment	
  Analysis	
  is	
  an	
  important	
  &	
  interesting	
  work	
  
   on	
  the	
  Twitter	
  platform	
  
       o  Collect	
  Tweets	
  
       o  Opinion	
  Estimation	
  -­‐Pass	
  thru	
  Classifier,	
  Sentiment	
  Lexicons	
  
             •  Naïve	
  Bayes/Max	
  Entropy	
  Class/SVM	
  
       o  Aggregated	
  Text	
  Sentiment/Moving	
  Average	
  
•  I	
  chose	
  not	
  to	
  dive	
  deeper	
  because	
  of	
  time	
  constraints	
  
       o  Couldn’t	
  do	
  justice	
  to	
  API,	
  Social	
  Network	
  and	
  Sentiment	
  Analysis,	
  
          all	
  in	
  3	
  hrs	
  
•  Next	
  3	
  Slides	
  have	
  couple	
  of	
  interesting	
  examples	
  
	
  
Sentiment  Analysis	
                  •  Twitter	
  Mining	
  for	
  Airline	
  Sentiment	
  
                  •  Opinion	
  Lexicon	
  -­‐	
  +ve	
  2000,	
  -­‐ve	
  4800	
  
                  	
  




h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment	
h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
Need  I  say  more  ?	
                       “A	
  bit	
  of	
  clever	
  math	
  can	
  uncover	
  interes4ng	
  pa7erns	
  that	
  are	
  not	
  visible	
  to	
  the	
  
                                                                            human	
  eye”	
  	
  	
  




h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket	
h5p://www.relevantdata.com/pdfs/IUStudy.pdf
Project	
  Ideas	
  
Interesting Vectors of Exploration	

1.  Find	
  trending	
  #tags	
  &	
  then	
  related	
  #tags	
  –	
  using	
  
    cliques	
  over	
  co-­‐#tag-­‐citation,	
  which	
  infers	
  topics	
  
    related	
  to	
  trending	
  topics	
  
2.  Related	
  #tag	
  topics	
  over	
  a	
  set	
  of	
  tweets	
  by	
  a	
  user	
  or	
  
    group	
  of	
  users	
  
3.  Analysis-­‐In/Out	
  flow,	
  Tweet	
  Flow	
  
      –  Frequent	
  @mention	
  
4.  Find	
  affiliation	
  networks	
  by	
  List	
  memberships,	
  #tags	
  
    or	
  frequent	
  @mentions	
  	
  
Interesting Vectors of Exploration	

5.  Use	
  centrality	
  measures	
  to	
  determine	
  mortals	
  vs.	
  
    celebrities	
  
6.  Classify	
  Tweet	
  networks/cliques	
  based	
  on	
  message	
  
    passing	
  characteristics	
  
    –    Tweets	
  vs.	
  Retweets,	
  No	
  of	
  reweets,…	
  
7.  Retweet	
  Network	
  
    –    Measure	
  Influence	
  by	
  retweet	
  count	
  &	
  frequency	
  
    –    Information	
  contagion	
  by	
  looking	
  at	
  different	
  retweet	
  
         network	
  subcomponents	
  –	
  who,	
  when,	
  how	
  much,…	
  
Twi5er  Network  
Graph  Analysis	
      An	
  Example	
  
Analysis  Story  Board	
              •  @clouderati	
  is	
  a	
  popular	
  cloud	
  related	
  
                 Twitter	
  account	
  
              •  Goals:	
  
                  o  Analyze	
  the	
  social	
  graph	
  characteristics	
  of	
  the	
  users	
  who	
  are	
  
                     following	
  the	
  account	
  
 In this               •  Dig	
  one	
  level	
  deep,	
  to	
  the	
  followers	
  &	
  friends,	
  of	
  the	
  
 tutorial
                followers	
  of	
  @clouderati	
  
                  o  How	
  many	
  cliques	
  ?	
  How	
  strong	
  are	
  they	
  ?	
  
                  o  Does	
  the	
  @mention	
  support	
  the	
  clique	
  inferences	
  ?	
  
For you to        o  What	
  are	
  the	
  retweet	
  characteristics	
  ?	
  
explore !!
       o  How	
  does	
  the	
  #tag	
  network	
  graph	
  look	
  like	
  ?	
  	
  	
  
Twi5er  Analysis  Pipeline  Story  Board  
                  Stages,  Strategies,  APIs  &  Tasks	
                   Stage	
  4	
  
                                                                                            Stag
                                                                      o                         e	
  5	
  
  o  Get	
  &	
  Store	
  User	
  details	
                                For	
  e
     (distinct	
  user	
  list)	
                                         follo ach	
  @c
                                                                 o                 w            loud
  o  Unroll	
                                                        Find er	
                      erat
                                                                              	
  frie                           i	
  
                                                                    inte               nd=f
                                                                            rsec              o
                                                                                     tion llower	
  
  Note:	
  Needed	
  a	
                        Note:	
  Unroll	
                        	
           	
  -­‐	
  se
                                                stage	
  took	
  time	
                                                t	
  
  command	
  buffer	
  
  to	
  manage	
  scale	
                       &	
  missteps	
  
  (~980,000	
  users)	
  



                                                                                                       	
  
                              Stage	
  3	
                                                  Stage	
  6
                                                                                                             raph	
  
                                                                                             	
  s ocial	
  g heory	
  
                                                                            o      Create twork	
  t
                                                                                               ne
               o  Get	
  distinct	
  user	
  list	
  
                                                                            o      Apply	
   ues	
  &	
  other	
  
                  applying	
  the	
                                                              liq
                                                                             o      Infer	
  c s	
  	
  
                  set(union(list))	
  operation	
                                                  tie
                                                                                     proper
@clouderati  Twi5er  Social  Graph  	
•  Stats	
  (Retrospect	
  after	
  the	
  runs):	
  
    o  Stage	
  1	
  	
  
           •  @clouderati	
  has	
  2072	
  followers	
  
    o  Stage	
  2	
  
           •  Limiting	
  followers	
  to	
  5,000	
  per	
  user	
  
    o  Stage	
  3	
  
           •  Digging	
  1st	
  level	
  (set	
  union	
  of	
  followers	
  &	
  friends	
  of	
  the	
  
              followers	
  of	
  @clouderati)	
  explodes	
  into	
  ~980,000	
  distinct	
  
              users	
  
    o  MongoDB	
  of	
  the	
  cache	
  and	
  intermediate	
  datasets	
  ~10	
  GB	
  
    o  The	
  database	
  was	
  hosted	
  at	
  AWS	
  (Hi	
  Mem	
  XLarge	
  –	
  m2.xlarge	
  ),	
  8	
  
       X	
  15	
  GB,	
  Raid	
  10,	
  opened	
  to	
  Internet	
  with	
  DB	
  authentication	
  
Code  &  Run  Walk  Through	
                                         o  Code:	
  
                                            §  oscon_2012_user_list_spider_01.py	
  

                                         o  Challenges:	
  
            Stage	
  1	
  
                                            §  Nothing	
  fancy	
  
                                            §  Get	
  the	
  record	
  and	
  store	
  
o  Get	
  @clouderati	
  Followers	
  
o  Store	
  in	
  MongoDB	
                 §  Would	
  have	
  had	
  to	
  recurse	
  through	
  a	
  REST	
  
                                                cursor	
  if	
  there	
  were	
  more	
  than	
  5000	
  followers	
  
                                            §  @clouderati	
  has	
  2072	
  followers	
  

                                         o  Interesting	
  Points:	
  
Code  &  Run  Walk  Through	
                                                 o  Code:	
  
                                                      §    oscon_2012_user_list_spider_02.py	
  
                                                      §    oscon_2012_twitter_utils.py	
  
                                                      §    oscon_2012_mongo.py	
  
                                                      §    oscon_2012_validate_dataset.py	
  
                                                 o  Challenges:	
  
                                                      §    Multiple	
  runs,	
  errors	
  et	
  al	
  !	
  
               Stage	
  2	
  
                                                 o  Interesting	
  Points:	
  
                                                      §  Set	
  operation	
  between	
  two	
  mongo	
  collections	
  for	
  restart	
  buffer	
  
o  Crawl	
  1	
  level	
  deep	
  
                                                      §  Protected	
  users,	
  some	
  had	
  0	
  followers,	
  or	
  0	
  friends	
  
o  Get	
  friends	
  &	
  followers	
  
                                                      §  Interesting	
  operations	
  for	
  validate,	
  re-­‐crawl	
  and	
  refresh	
  
o  Validate,	
  re-­‐crawl	
  &	
  refresh	
  
                                                      §  Added	
  “status_code”	
  to	
  differentiate	
  protected	
  users	
  
                                                             §  {'$set':	
  {'status_code':	
  '401	
  Unauthorized,401	
  Unauthorized'}}	
  
                                                      §  Getting friends & followers of 2000 users is the hardest (or so I thought,
                                                          until I got through the next stage!)                     	
  	
  
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012

More Related Content

What's hot

Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
Ken Mwai
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 

What's hot (20)

R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Search, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSearch, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled Vision
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Convolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language ProcessingConvolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language Processing
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 

Similar to The Art of Social Media Analysis with Twitter & Python-OSCON 2012

PinTrace Advanced AWS meetup
PinTrace Advanced AWS meetup PinTrace Advanced AWS meetup
PinTrace Advanced AWS meetup
Suman Karumuri
 

Similar to The Art of Social Media Analysis with Twitter & Python-OSCON 2012 (20)

Mike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupMike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backup
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
 
PinTrace Advanced AWS meetup
PinTrace Advanced AWS meetup PinTrace Advanced AWS meetup
PinTrace Advanced AWS meetup
 
discopen
discopendiscopen
discopen
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Connecting the Dots—How a Graph Database Enables Discovery
Connecting the Dots—How a Graph Database Enables DiscoveryConnecting the Dots—How a Graph Database Enables Discovery
Connecting the Dots—How a Graph Database Enables Discovery
 

More from OSCON Byrum

Big Data for each one of us
Big Data for each one of usBig Data for each one of us
Big Data for each one of us
OSCON Byrum
 
Declarative web data visualization using ClojureScript
Declarative web data visualization using ClojureScriptDeclarative web data visualization using ClojureScript
Declarative web data visualization using ClojureScript
OSCON Byrum
 

More from OSCON Byrum (20)

OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
OSCON 2013 - Planning an OpenStack Cloud - Tom FifieldOSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield
 
Protecting Open Innovation with the Defensive Patent License
Protecting Open Innovation with the Defensive Patent LicenseProtecting Open Innovation with the Defensive Patent License
Protecting Open Innovation with the Defensive Patent License
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 
Finite State Machines - Why the fear?
Finite State Machines - Why the fear?Finite State Machines - Why the fear?
Finite State Machines - Why the fear?
 
Open Source Automotive Development
Open Source Automotive DevelopmentOpen Source Automotive Development
Open Source Automotive Development
 
How we built our community using Github - Uri Cohen
How we built our community using Github - Uri CohenHow we built our community using Github - Uri Cohen
How we built our community using Github - Uri Cohen
 
The Vanishing Pattern: from iterators to generators in Python
The Vanishing Pattern: from iterators to generators in PythonThe Vanishing Pattern: from iterators to generators in Python
The Vanishing Pattern: from iterators to generators in Python
 
Distributed Coordination with Python
Distributed Coordination with PythonDistributed Coordination with Python
Distributed Coordination with Python
 
An overview of open source in East Asia (China, Japan, Korea)
An overview of open source in East Asia (China, Japan, Korea)An overview of open source in East Asia (China, Japan, Korea)
An overview of open source in East Asia (China, Japan, Korea)
 
Oscon 2013 Jesse Anderson
Oscon 2013 Jesse AndersonOscon 2013 Jesse Anderson
Oscon 2013 Jesse Anderson
 
US Patriot Act OSCON2012 David Mertz
US Patriot Act OSCON2012 David MertzUS Patriot Act OSCON2012 David Mertz
US Patriot Act OSCON2012 David Mertz
 
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
 
Big Data for each one of us
Big Data for each one of usBig Data for each one of us
Big Data for each one of us
 
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
 
Declarative web data visualization using ClojureScript
Declarative web data visualization using ClojureScriptDeclarative web data visualization using ClojureScript
Declarative web data visualization using ClojureScript
 
Using and Building Open Source in Google Corporate Engineering - Justin McWil...
Using and Building Open Source in Google Corporate Engineering - Justin McWil...Using and Building Open Source in Google Corporate Engineering - Justin McWil...
Using and Building Open Source in Google Corporate Engineering - Justin McWil...
 
A Look at the Network: Searching for Truth in Distributed Applications
A Look at the Network: Searching for Truth in Distributed ApplicationsA Look at the Network: Searching for Truth in Distributed Applications
A Look at the Network: Searching for Truth in Distributed Applications
 
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudLife After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data Cloud
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
Comparing open source private cloud platforms
Comparing open source private cloud platformsComparing open source private cloud platforms
Comparing open source private cloud platforms
 

Recently uploaded

Recently uploaded (20)

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 

The Art of Social Media Analysis with Twitter & Python-OSCON 2012

  • 1. The Art of Social Media Analysis with Twitter & Python krishna sankar @ksankar http://www.oscon.com/oscon2012/public/schedule/detail/23130
  • 2. Intro API, Objects,… o  House  Rules  (1  of  2)   Twitter Network We will analyze @clouderati, o  Doesn’t  assume  any  knowledge   Analysis 2072 followers, exploding to of  Twitter  API   Pipeline ~980,000 distinct users down one level o  Goal:  Everybody  in  the  same   page  &  get  a  working   knowledge  of  Twitter  API   NLP, NLTK, o  To  bootstrap  your  exploration   @mention Cliques, social Sentiment network graph into  Social  Network  Analysis  &   Analysis Twitter     Rewteeet analytics, Growth, #tag Network Information o  Simple  programs,  to  illustrate   contagion weakties usage  &  data  manipulation  
  • 3. Intro API, Objects,… Twitter o  House  Rules  (2  of  2)   Network We will analyze @clouderati, Analysis 2072 followers, exploding to o  Am  using  the  requests  library   Pipeline ~980,000 distinct users down o  There  are  good  Twitter  frameworks   one level for  python,  but  wanted  to  build   from  the  basics.  Once  one   understands  the  fundamentals,   frameworks  can  help   NLP, NLTK, @mention Cliques, social Sentiment o  Many  areas  to  explore  –  not  enough   Analysis network graph time.  So  decided  to  focus  on  social   graph,  cliques  &  networkx   Rewteeet analytics, Growth, #tag Network Information contagion weakties
  • 4. About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at   Genophen.com   o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization     •  http://www.ispcs.org/2012/index.html   o  Blog  :  http://doubleclix.wordpress.com/   o  Quora  :  http://www.quora.com/Krishna-­‐Sankar   •  Prior  Gigs   o  Lead  Architect  (Egnyte)   o  Distinguished  Engineer  (CSCO)   o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !   •  Current  Focus:   o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,   MongoDB,  Solr,  Drupal,GitHub,…   o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –   so  far  !)   o  Overlay  based  semantic  search  &  ranking   •  Other  related  Presentations   o  http://goo.gl/P1rhc  Big  Data  Engineering  Top  10  Pragmatics  (Summary)   o  http://goo.gl/0SQDV  The  Art  of  Big  Data  (Detailed)   o  http://goo.gl/EaUKH  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  
  • 5. Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric   2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way 3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free 4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   End •  And  as  far  as  possible  same  as  the  ( json)  response       Beg As Th inni o  Use  NOSQL  CLI  for  trimming  records  et  al   ng e
  • 6. Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al     6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later 7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale   8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  • 7. Twitter Tips – A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them     10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 2 11.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  • 8. Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 2 13.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong   15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  • 9. Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial 17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  • 10. Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  • 11. Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics   2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out 3.  Need  an  easier  way  to  get  screen_name  from  user_id   4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility 5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416   Range  Unacceptable”   6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  • 12. A Fork   &  deep ,NLTK     •   NLP weets into  T ment   4 o  Sen ysis   Anal • Not enough time for both • I chose the Social Graph route
  • 13. A minute about Twitter as platform & it’s evolution blog/ er. com/ tter-­‐ twitt wi ps:/ /dev. nsistent-­‐t htt ring-­‐co e deliv ence   “The micro-blogging service must find the ri expe right balance of running a profitable business and maintaining a robust “.. we want to make sure that the Twitter experience is developers' community.” – Chenda, CBS straightforward and easy to understand -- whether you’re on news! Twitter.com or elsewhere on the web”-Michael! My  Wish  &  Hope   •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive   •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn   o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience   o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that   •  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system   •  Orthogonally  extensible  platform  is  essential   •  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  
  • 14. •  For  Hands  on  Today   Setup o  Python  2.7.3   o  easy_install  –v  requests   •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐ request   o  easy_install  –v  requests-­‐oauth   o  Hands  on  programs  at  https://github.com/xsankar/oscon2012-­‐handson   •  For  advanced  data  science  with  social  graphs   o  easy_install  –v  networkx   o  easy_install  –v  numpy   o  easy_install  –v  nltk     •  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al   o  Mongodb     •  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS   o  graphviz  -­‐  http://www.graphviz.org/;  easy_install  pygraphviz   o  easy_install  pydot  
  • 15. Thanks To these Giants …
  • 16. Problem Domain For this tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as   observed  by  Twitter  primitives   o  Not  for  Twitter  based  apps  for  real  time  tweets   o  Not  web  sites  with  real  time  tweets   •  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &   actionable  recommendations   •  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.   not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before   pronouncing  a  trend)  
  • 17. Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)     o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)   o  Objects   o  API   o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)   II.  Break  (3:00  PM  -­‐  3:30  PM)   III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)   o  Underlying  Concepts   o  Social  Graph  Analysis  of  @clouderati   §  Stages,  Strategies  &  Tasks   §  Code  Walk  thru    
  • 19. Twi5er  API  :  Read  These  First •  Using  Twitter  Brand   o  New  logo  &  associated  guidelines  :  https://twitter.com/about/logos   o  Twitter  Rules  :   https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐ best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules   o  Developer  Rules  of  the  road  https://dev.twitter.com/terms/api-­‐terms   •  Read  These  Links  First   1.  https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know   2.  https://dev.twitter.com/docs/faq   3.  Field  Guide  to  Objects  https://dev.twitter.com/docs/platform-­‐objects   4.  Security  https://dev.twitter.com/docs/security-­‐best-­‐practices   5.  Media  Best  Practices  :  https://dev.twitter.com/media   6.  Consolidates  Page  :  https://dev.twitter.com/docs   7.  Streaming  APIs  https://dev.twitter.com/docs/streaming-­‐apis   8.  How  to  Appeal  (Not  that  you  all  would  need  it  !)  https://support.twitter.com/ articles/72585   •  Only  One  version  of  Twitter  APIs  
  • 20. API  Status  Page •  https://dev.twitter.com/status   •  https://dev.twitter.com/issues   •  https://dev.twitter.com/discussions  
  • 22. Open  This  First •  Install  pre-­‐req  as  per  the  setup  slide   •  Run     o  oscon2012_open_this_first.py   o  To  test  connectivity  –  “canary  query”   •  Run   o  oscon2012_rate_limit_status.py   o  Use  http://www.epochconverter.com  to  check  reset_time   •  Formats  xml,  json,  atom  &  rss  
  • 23. Twitter  API   Near-realtime, High Volume Follow users, Core Data, REST   Streaming   topics, data Core Twitter mining Objects Public  Streams   Seach & User  Streams   Trend Twitter   Twitter   Site  Streams   REST   Search   Firehose   Build  Profile   Keywords   Create/Post  Tweets   Specific  User   Reply   Trends   Favorite,  Re-­‐tweet   Rate  Limit  :     Rate  Limit  :  150/350        Complexity  &  Frequency  
  • 25. Rate  Limits •  By  API  type  &  Authentication  Mode   API No authC authC Error REST   150/hr   350/hr   400   Search   Complexity  &   -­‐N/A-­‐   420   Frequency   Streaming   Upto  1%   Fire  hose   none   none  
  • 26. Rate  Limit  Header •  {   •  "status":  "200  OK",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "149",     •     "x-­‐ratelimit-­‐reset":  "1340467358",     •     "x-­‐runtime":  "0.04144",     •     "x-­‐transaction":  "2b49ac31cf8709af",     •     "x-­‐transaction-­‐mask":   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"   •  }  
  • 27. Rate  Limit-­‐‑ed  Header •  {   •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",     •     "content-­‐encoding":  "gzip",     •     "content-­‐length":  "150",     •     "content-­‐type":  "application/json;  charset=utf-­‐8",     •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",     •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",     •     "server":  "tfe",     •     ”…   •     "status":  "400  Bad  Request",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "0",     •     "x-­‐ratelimit-­‐reset":  "1341363230",     •     "x-­‐runtime":  "0.01126"   •  }  
  • 28. Rate  Limit  Example •  Run   o  oscon2012_rate_limit_02.py   •  It  iterates  through  a  list  to  get  followers     •  List  is  2072  long  
  • 29. •  {   •     …   •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",     •  "status":  "200  OK",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     Last  time,  it  gave  me  5  min.   Now  the  reset  timer  is  1   •     "x-­‐ratelimit-­‐remaining":  "147",     hour     •     "x-­‐ratelimit-­‐reset":  "1341366831",     150  calls,  not  authenticated   •     "x-­‐runtime":  "0.02768",     •     "x-­‐transaction":  "f1bafd60112dddeb",     •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   •  }  
  • 30. •  {   •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",     •     "content-­‐encoding":  "gzip",     •     "content-­‐type":  "application/json;  charset=utf-­‐8",     •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",     And  Rate  Limit  kicked-­‐‑in •  …   •  "status":  "400  Bad  Request",     •     "transfer-­‐encoding":  "chunked",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "0",     •     "x-­‐ratelimit-­‐reset":  "1341366831",     •     "x-­‐runtime":  "0.01342"   •  }  
  • 31. API  with  OAuth •  {   •     …   •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",     •     "etag":  ""dd419c02ed00fc6b2a825cc27wbe040"",     •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",     •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",     •     "pragma":  "no-­‐cache",     •     "server":  "tfe",     •  …   •  "status":  "200  OK",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐access-­‐level":  "read",     •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",     •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "349",     •     "x-­‐ratelimit-­‐reset":  "1341369121",     •     "x-­‐runtime":  "0.05539",     OAuth   •  •     "x-­‐transaction":  "9f8508fe4c73a407",        "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   “api-­‐identified”   •  }   1  hr  reset   350  calls  
  • 32. •  {   •     …   •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",     •  …   •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "133",     •     "x-­‐ratelimit-­‐reset":  "1341500165",     •   …   Rate  Limit  resets  during   •  }   consecutive  calls •  ********  2416   •  {   +1   •  …   hour •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",     •  …   •     "status":  "200  OK",     •     ….   •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "349",     •     "x-­‐ratelimit-­‐reset":  "1341503776",     •  ********  2417  
  • 33. Unexplained  Errors •  Traceback  (most  recent  call  last):   •     File  "oscon2012_get_user_info_01.py",  line  39,  in  <module>   •         r  =  client.get(url,  params=payload)   •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  244,  in  get   •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  230,  in  request   •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",  line  609,  in  send   •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(host='api.twitter.com',  port=443):  Max   retries  exceeded  with  url:  /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –   17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually  10-­‐6  AM  PST   42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%   2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got  around  by  “Trap  &  wait  5  seconds”   2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%   2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night  Runs  are  relatively  error  free   09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201  
  • 34. •  {   •  •   …      "date":  "Fri,  06  Jul  2012  03:41:09  GMT",     A Day in the life of •     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",     Twitter Rate Limit •     "server":  "tfe",     •     "set-­‐cookie":  "dnt=;  domain=.twitter.com;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",     •     "status":  "400  Bad  Request",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "0",     Missed  by  4  min! •     "x-­‐ratelimit-­‐reset":  "1341546334",     •     "x-­‐runtime":  "0.01918"   •  }   •  Error,  sleeping   •  {   •   …   •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",     •   …   •   "status":  "200  OK",     •   …   •   "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "349",     OK  after  5  min  sleep •   …  
  • 35. Strategies I  have  no  exotic  strategies,  so  far  !   1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in   2.  Combine  authenticated  &  non-­‐authenticated  calls   3.  Use  multiple  API  types   4.  Cache   5.  Store  &  get  only  what  is  needed   6.  Checkpoint  &  buffer  request  commands   7.  Distributed  data  parallelism  –  for  example  AWS  instances   http://www.epochconverter.com/  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit
  • 37. Authentication •  Three  modes   o  Anonymous   o  HTTP  Basic  Auth   o  OAuth   •  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are   supported   •   OAuth  enables  the  user  to  authorize  an  application   without  sharing  credentials   •  Also  has  the  ability  to  revoke   •  Twitter  supports  OAuth  1.0a   •  OAuth  2.0  is  the  new  standard,  much  simpler   o  No  timeframe  for  Twitter  support,  yet      
  • 38. OAuth  Pragmatics •  Helpful  Links   o  https://dev.twitter.com/docs/auth/oauth   o  https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth   o  https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples   o  http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html   •  Discussion  on  OAuth  internal  mechanisms  is  better  left  for   another  day   •  For  headless  applications  to  get  OAuth  token,  go  to  https:// dev.twitter.com/apps   •   Create  an  application  &  get  four  credential  pieces   o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret   •  All  the  frameworks  have  support  for  OAuth.  So  plug  –in   these  values  &  use  the  framework’s  calls   •  I  used  request-­‐oauth  library  like  so:  
  • 39. request-­‐‑oauth def  get_oauth_client():   Get  client  using  the        consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"   token,  key  &  secret  from          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"   dev.twitter.com/apps          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={'pre_request':  oauth_hook})          return  client   Use  the  client  instead   def  get_followers(user_id):   of  requests                                      url  =  'https://api.twitter.com/1/followers/ids.json’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)   def  get_followers_with_oauth(user_id,client):                                        url  =  'https://api.twitter.com/1/followers/ids.json'                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)   Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
  • 40. OAuth  Authorize  screen •  The  user   authenticates  with   Twitter  &  grants   access  to  Forbes   Social   •  Forbes  social   doesn’t  have  the   users  credentials,   but  uses  OAuth  to   access  the  user’s   account  
  • 42. HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐   •  404  Not  Found   Library  error   •  406  Not  Acceptable   •  200  OK   •  413  Too  Long   •  304  Not  Modified   •  416  Range  Unacceptable   •  400  Bad  Request   •  420  Enhance  Your  Calm   o  Check  error  message  for  explanation   o  Rate  Limited   o  REST  Rate  Limit  !     •  500  Internal  Server  Error   •  401  UnAuthorized   •  502  Bad  Gateway     o  Beware  –  you  could  get  this  for  other   o  Down  for  maintenance   reasons  as  well.       •  503  Service  Unavailable   •  403  Forbidden   o  Overloaded  “Fail  whale”   o  Hit  Update  Limit  (>  max  Tweets/day,   •  504  Gateway  Timeout   following  too  many  people)   o  Overloaded   h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
  • 43. HTTP  Status  Code  -­‐‑  Example •  {   •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",     •     "content-­‐encoding":  "gzip",     •     "content-­‐length":  "91",     •     "content-­‐type":  "application/json;  charset=utf-­‐8",     •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",     •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",     •     "server":  "tfe",     •   …   •     "status":  "401  Unauthorized",     •     "vary":  "Accept-­‐Encoding",     •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     •  •     "x-­‐ratelimit-­‐class":  "api",        "x-­‐ratelimit-­‐limit":  "0",     Detailed  error   •     "x-­‐ratelimit-­‐remaining":  "0",     message    in  JSON  !   •     "x-­‐ratelimit-­‐reset":  "1340413616",     •     "x-­‐runtime":  "0.01997"   I  like  this   •  }   •  {   •     "errors":  [   •         {   •             "code":  53,     •             "message":  "Basic  authentication  is  not  supported"   •         }   •     ]   •  }  
  • 44. HTTP  Status  Code  –  Confusing  Example •  {   •  GET  https://api.twitter.com/1/users/lookup.json? •  …   screen_nme=twitterapi,twitter&include_entities= •     "pragma":  "no-­‐cache",     true   •     "server":  "tfe",     •   …     •  Spelling  Mistake   •     "status":  "404  Not  Found",     o  Should  be  screen_name   •     …   •  But  confusing  error  !   •  }   •  {   •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,   •     "errors":  [   showing  parameter  error   •         {   •             "code":  34,     •             "message":  "Sorry,  that  page  does  not  exist"   •         }   •     ]   •  }  
  • 45. HTTP  Status  Code  -­‐‑  Example •  {   •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",     •     "content-­‐encoding":  "gzip",     •     "content-­‐length":  "112",     •     "content-­‐type":  "application/json;charset=utf-­‐8",     Sometimes,  the  errors  are   •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",     not  correct.  I  got  this  error   •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",     •  …   for  user_timeline.json  w/   •     "status":  "401  Unauthorized",     user_id=20,15,12   •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     Clearly  a  parameter  error   •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐ratelimit-­‐class":  "api",     (i.e.  more  parameters)   •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "147",     •     "x-­‐ratelimit-­‐reset":  "1340417742",     •     "x-­‐transaction":  "d545a806f9c72b98"   •  }   •  {   •     "error":  "Not  authorized",     •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"   •  }  
  • 47. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags   h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  • 48. Tweets •  A.k.a  Status  Updates   •  Interesting  fields   o  Coordinates  <-­‐  geo  location   o  created_at   o  entities  (will  see  later)   o  Id,  id_str   o  possibly  sensitive   o  user  (will  see  later)   •  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –   hard  to  maintain  at  scale   •  https://dev.twitter.com/docs/faq#6981   o  withheld_in_countries     •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
  • 49. A  word  about  id,  id_str •  June  1,  2010   o  Snowflake  the  id  generator  service   o  “The  full  ID  is  composed  of  a  timestamp,   a  worker  number,  and  a  sequence   number”   o  Had  problems  with  JavaScript  to  handle   numbers  >  53  bits   o  “id”:819797   o  “id_str”:”819797”   h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
  • 50. Tweets  -­‐‑  example •  Let  us  run  oscon2012-­‐tweets.py   •  Example  of  tweet   o  coordinates   o  id     o  id_str  
  • 51. Users •  followers_count   •  geo_enabled   •  Id,  Id_str   •  name,  screen_name   •  Protected   •  status,  statuses_count   •  withheld_in_countries   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
  • 52. Users  –  Let  us  run  some  examples •  Run     o  oscon_2012_users.py   •  Lookup  users  by  screen_name   o  oscon12_first_20_ids.py   •  Lookup  users  by  user_id   •  Inspect  the  results   o  id,  name,  status,  status_count,  protected,  followers   (for  top  10  followers),  withheld  users   •  Can  use  information  for  customizing   the  user’s  screen  in  your  web  app  
  • 53. Entities •  Metadata  &  Contextual  Information   •  You  can  parse  them,  but  Entities   parse  them  out  as  structured  data   •  REST  API/Search  API  –   include_entities=1   •  Streaming  API  –  included  by  default   •  hashtags,  media,  urls,   user_mentions   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-­‐‑entities h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
  • 54. Entities •  Run     o  oscon2012_entities.py   •  Inspect  hashtags,  urls  et  al    
  • 55. Places •  attributes   •  bounding_box   •  Id  (as  a  string!)   •  country   •  name   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
  • 56. Places •  Can  search  for  tweets  near  a  place  like  so:   •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]   o  Tweets  near  that  place   •  Tweets  near  San  Jose  [37.395715,-­‐122.102308]   •  We  will  not  see  further  here.  But  very  useful  
  • 57. Timelines •  Collections  of  tweets  ordered  by  time   •  Use  max_id  &  since_id  for  navigation   h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
  • 58. Other  Objects  &  APIs •  Lists   •  Notifications   •  Friendships/exists  to  see  if  one  follows   the  other  
  • 59. Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags   h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  • 60. Hands-­‐‑on  Exercise  (15  min) •  Setup  environment  –  slide  #14   •  Sanity  Check  Environment  &  Libraries   o  oscon2012_open_this_first.py   o  oscon2012_rate_limit_status.py   •  Get  objects  (show  calls)   o  Lookup  users  by  screen_name    -­‐  oscon12_users.py   o  Lookup  users  by  id  -­‐  oscon12_first_20_ids.py   o  Lookup  tweets  -­‐  oscon12_tweets.py   o  Get  entities  -­‐  oscon12_entities.py   •  Inspect  the  results   •  Explore  a  little  bit   •  Discussion  
  • 62. Twitter  API   Near-realtime, High Volume Follow users, Core Data, REST   Streaming   topics, data Core Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  • 63. Twi5er  REST  API •  https://dev.twitter.com/docs/api   •  What  we  were  doing  were  the  REST  API   •  Request-­‐Response   •  Anonymous  or  OAuth   •  Rate  Limited  :   o  150/350  
  • 64. Twi5er  Trends •  oscon2012-­‐trends.py   •  Trends/weekly,  Trends/monthly   •  Let  us  run  some  examples   o  oscon2012_trends_daily.py   o  oscon2012_trends_weekly.py   •  Trends  &  hashtags   o  #hashtag  euro2012   o  http://hashtags.org/euro2012   o  http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/   o  http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html   o  Top  10  :  http://twittercounter.com/pages/100,  http://twitaholic.com/  
  • 65. Brand  Rank  w/  Twi5er •  Walk  Through  &  results  of  following   o  oscon2012_brand_01.py   •  Followed  10  user-­‐brands  for  a  few  days  to  find   growth   •  Brand  Rank     o  Growth  of  a  brand  w.r.t  the  industry   o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &   correlate  using  Twitter  APIs  &  metrics   •  API  :  url='https://api.twitter.com/1/users/ lookup.json'   •  payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}  
  • 66. Brand  Rank  w/  Twi5er Clouderati   is  very   stable
  • 67. Brand  Rank  w/  Twi5er   Tech  Brands •  Google  I/O  showed  a  spike  on  6/27-­‐   6/28   •  OReillyMedia  shares  some  spike   •  Looking  at  a  few  days  worth  of   data,  our  best  inference  is  that   “oscon  doesn’t  track  with  googleio”   •  “Clouderati  doesn’t  track  at  all”  
  • 68. Brand  Rank  w/  Twi5er   World  of  Soccer •  FOXSoccer,UEFAcom   track  each  other     The  numbers  seldom   decrease.  So  calculating   –ve  velocity  will  not   work OTOH,  if  you  see  a  –ve   velocity,  investigate
  • 69. Brand  Rank  w/  Twi5er   World  of  Basketball •  NBA,  MiamiHeat,  okcthunder  track  each  other   •  Used  %  than  absolute  numbers  to  compare   •  The  hike  on  7/6  to  7/10  is  interesting.      
  • 70. Brand  Rank  w/  Twi5er   Rising  Tide  … •  For  some  reason,  all  numbers  are  going  up  7/6  thru   7/10  –  except  for  clouderati!   •  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  
  • 71. Trivia  :  Search  API •  Search(search.twitter.com)   o  Built  by  Summize  which  was  acquired  by  Twitter  in   2008   o  Summize  described  itself  as  “sentiment  mining”  
  • 72. Search  API •  Very  simple     o  GET  http://search.twitter.com/search.json?q=<blah>   •  Based  on  a  search  criteria   •  “The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets” •  Recent  =  Last  6-­‐9  days  worth  of  tweets   •  Anonymous  Call   •  Rate  Limit   o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency   h5ps://dev.twi5er.com/docs/using-­‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search
  • 73. Search  API •  Filters   o  Search  URL  encoded   o  @  =  %40,  #=%23   o   emoticons    :)  and  :(,   o  http://search.twitter.com/search.atom?q=sometimes+%3A)   o  http://search.twitter.com/search.atom?q=sometimes+%3A(   •  Location  Filters,  date  filters   •  Content  searches  
  • 74. Streaming  API •  Not  request  response;  but  stream   •  Twitter  frameworks  have  the  support   •  Rate  Limit  :  Upto  1%   •  Stall  warning  if  the  client  is  falling  behind   •  Good  Documentation  Links   o  https://dev.twitter.com/docs/streaming-­‐apis/connecting   o  https://dev.twitter.com/docs/streaming-­‐apis/parameters   o  https://dev.twitter.com/docs/streaming-­‐apis/processing  
  • 75. Firehose •  ~  400  million  public  tweets/day   •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !   •  If  you  hit  real  limits,  then  explore  the  firehose  route   •  AFAIK,  it  is  not  cheap,  but  worth  it  
  • 76. API  Best  Practices 1.  Use  JSON   2.  Use  user_id  than  screen_name   o  User_id  is  constant  while  screen_name  can  change   3.  max_id  and  since_id   o  For  example  direct  messages,  if  you  have  last  message  use   since_id  for  search   o  max_id  how  far  to  go  back   4.  Cache  as  much  as  you  can   5.  Set  the  User-­‐Agent  header  for  debugging   I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
  • 77. Twitter  API   Near-realtime, High Volume Follow users, Core Data, REST   Streaming   topics, data Core Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Questions  ?   Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  • 78. Part II SNA Part II Twitter Network Analysis
  • 79. 2.  Store   3.  Transform  &     1.  Collect   Analyze   the Validate Dataset & . Keep don’t Tip: 3 simple; re-crawl/refresh a schem afrai d to be for m Most  important  &   trans the  ugliest  slide  in   this  deck  !   as lem ent , 1. Imp ipeline 4.  Model   Tip: age d p nolith 5.  Predict,   &     a st r a mo Reason   neve Recommend  &   Visualize  
  • 80. Trivia •  Social  Network  Analysis  originated  as  Sociometry  &   the  social  network  was  called  a  sociogram   •  Back  then,  Facebook  was  called  SocioBinder!   •  Jacob  Levi  Morano,  is  considered  the  originator   o  NYTimes,  April  3,  1933,  P.  17  
  • 81. Twi5er  Networks-­‐‑Definitions •  Nodes   o  Users   o  #tags   •  Edges   o  Follows   o  Friends   o  @mentions   o  #tags   •  Directed  
  • 82. Twi5er  Networks-­‐‑Definitions •  In-­‐degree   o  Followers   •  Out-­‐Degree   o  Friends/Follow   •  Centrality  Measures   •  Hubs  &  Authorities   o  Hubs/Directories  tell  us  where   Authorities  are   o  “Of  Mortals  &  Celebrities”  is   more  “Twitter-­‐style”  
  • 83. Twi5er  Networks-­‐‑Properties M •  Concepts  From  Citation   N Networks   K J o  Cocitation   L   I •  Common  papers  that  cite  a  paper   A •  Common  Followers   B G o  C  &  G  (Followed  by  F  &  H)   C H o  Bibliographic  Coupling   •  Cite  the  same  papers   D F   •  Common  Friends  (i.e.  follow  same   E person)   o  D,  E,  F  &  H  
  • 84. Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks   M o  Cocitation   N •  Common  papers  that  cite  a  paper   K •  Common  Followers   J   L   o  C  &  G  (Followed  by  F  &  H)   I   o  Bibliographic  Coupling   A •  Cite  the  same  papers   B G •  Common  Friends    (i.e.  follow  same  person)   o  D,  E,  F  &  H  follow  C   o  H  &  F  follow  C  &  G   H C •  So  H  &  F  have  high  coupling   D •  Hence,  if  H  follows  A,  we  can   F   recommend  F  to  follow  A   E
  • 85. Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks   o  Two  disjoint  subsets   o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph   o  Membership  in  Lists     •  lists  vs.  users  bipartite  graph   o  Common  #Tags  in  Tweets     •  #tags  vs.  members  bipartite  graph   o  @mention  together   •  ?  Can  this  be  a  bipartite  graph   •  ?  How  would  we  fold  this  ?  
  • 86. Other  Metrics  &  Mechanisms •  Kronecker  Graphs  Models   o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices   o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of   their  adjacency  matrices   o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al   •  Erdos-­‐Renyl  Random  Graphs   o  Easy  to  build  a  Gn,p  graph   o  Assumes  equal  likelihood  of  edges  between  two  nodes   o  In a Twitter social network, we can create a more realistic expected distribution (adding the “social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter   •  Weak  Ties   •  Follower  velocity  (+ve  &  –ve),  Association  strength   o  Unfollow  not  a  reliable  measure   o  But  an  interesting  property  to  investigate  when  it  happens   Not covered here, but potential for an encore ! Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
  • 87. Twi5er  Networks-­‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook •  Twitter Network == Interest Network •  Be  cognizant  of  the  above  when  you  apply  traditional  network   properties  to  Twitter     •  For  example,     o  Six  degrees  of  separation  doesn't  make  sense  (most  of  the  time)  in   Twitter  –  except  may  be  for  Cliques   o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?   •  Probably  not   o  Do  cut  sets  make  sense  ?     •  Probably  not   o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques   o  Bipartite  graphs  do  make  sense  
  • 88. Cliques  (1  of  2) •  “Maximal  subset  of  the  vertices  in  an   undirected  network  such  that  every  member   of  the  set  is  connected  by  an  edge  to  every   other”   •  Cohesive  subgroup,  closely  connected   •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.   connected  to  at  least  n-­‐k  others)   •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse   network;  1-­‐plex  being  the  perfect  clique   Ref:  Networks,  An  Introduction-­‐‑Newman
  • 89. Cliques  (2  of  2) •  k-­‐core  –  at  least  k  others  in  the  subset;   (n-­‐k)-­‐plex   •  k-­‐clique  –  no  more  than  k  distance  away   o  Path  inside  or  outside  the  subset   o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)   •  We  will  apply  k-­‐plex  Cliques  for  one  of   our  hands-­‐on     Ref:  Networks,  An  Introduction-­‐‑Newman
  • 90. Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work   on  the  Twitter  platform   o  Collect  Tweets   o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons   •  Naïve  Bayes/Max  Entropy  Class/SVM   o  Aggregated  Text  Sentiment/Moving  Average   •  I  chose  not  to  dive  deeper  because  of  time  constraints   o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,   all  in  3  hrs   •  Next  3  Slides  have  couple  of  interesting  examples    
  • 91. Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment   •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800     h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
  • 92. Need  I  say  more  ? “A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the   human  eye”       h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf
  • 93.
  • 95. Interesting Vectors of Exploration 1.  Find  trending  #tags  &  then  related  #tags  –  using   cliques  over  co-­‐#tag-­‐citation,  which  infers  topics   related  to  trending  topics   2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or   group  of  users   3.  Analysis-­‐In/Out  flow,  Tweet  Flow   –  Frequent  @mention   4.  Find  affiliation  networks  by  List  memberships,  #tags   or  frequent  @mentions    
  • 96. Interesting Vectors of Exploration 5.  Use  centrality  measures  to  determine  mortals  vs.   celebrities   6.  Classify  Tweet  networks/cliques  based  on  message   passing  characteristics   –  Tweets  vs.  Retweets,  No  of  reweets,…   7.  Retweet  Network   –  Measure  Influence  by  retweet  count  &  frequency   –  Information  contagion  by  looking  at  different  retweet   network  subcomponents  –  who,  when,  how  much,…  
  • 97. Twi5er  Network   Graph  Analysis An  Example  
  • 98. Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related   Twitter  account   •  Goals:   o  Analyze  the  social  graph  characteristics  of  the  users  who  are   following  the  account   In this •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the   tutorial followers  of  @clouderati   o  How  many  cliques  ?  How  strong  are  they  ?   o  Does  the  @mention  support  the  clique  inferences  ?   For you to o  What  are  the  retweet  characteristics  ?   explore !! o  How  does  the  #tag  network  graph  look  like  ?      
  • 99. Twi5er  Analysis  Pipeline  Story  Board   Stages,  Strategies,  APIs  &  Tasks Stage  4   Stag o  e  5   o  Get  &  Store  User  details   For  e (distinct  user  list)   follo ach  @c o  w loud o  Unroll   Find er   erat  frie i   inte nd=f rsec o tion llower   Note:  Needed  a   Note:  Unroll      -­‐  se stage  took  time   t   command  buffer   to  manage  scale   &  missteps   (~980,000  users)     Stage  3   Stage  6 raph    s ocial  g heory   o  Create twork  t ne o  Get  distinct  user  list   o  Apply   ues  &  other   applying  the   liq o  Infer  c s     set(union(list))  operation   tie proper
  • 100. @clouderati  Twi5er  Social  Graph   •  Stats  (Retrospect  after  the  runs):   o  Stage  1     •  @clouderati  has  2072  followers   o  Stage  2   •  Limiting  followers  to  5,000  per  user   o  Stage  3   •  Digging  1st  level  (set  union  of  followers  &  friends  of  the   followers  of  @clouderati)  explodes  into  ~980,000  distinct   users   o  MongoDB  of  the  cache  and  intermediate  datasets  ~10  GB   o  The  database  was  hosted  at  AWS  (Hi  Mem  XLarge  –  m2.xlarge  ),  8   X  15  GB,  Raid  10,  opened  to  Internet  with  DB  authentication  
  • 101. Code  &  Run  Walk  Through o  Code:   §  oscon_2012_user_list_spider_01.py   o  Challenges:   Stage  1   §  Nothing  fancy   §  Get  the  record  and  store   o  Get  @clouderati  Followers   o  Store  in  MongoDB   §  Would  have  had  to  recurse  through  a  REST   cursor  if  there  were  more  than  5000  followers   §  @clouderati  has  2072  followers   o  Interesting  Points:  
  • 102. Code  &  Run  Walk  Through o  Code:   §  oscon_2012_user_list_spider_02.py   §  oscon_2012_twitter_utils.py   §  oscon_2012_mongo.py   §  oscon_2012_validate_dataset.py   o  Challenges:   §  Multiple  runs,  errors  et  al  !   Stage  2   o  Interesting  Points:   §  Set  operation  between  two  mongo  collections  for  restart  buffer   o  Crawl  1  level  deep   §  Protected  users,  some  had  0  followers,  or  0  friends   o  Get  friends  &  followers   §  Interesting  operations  for  validate,  re-­‐crawl  and  refresh   o  Validate,  re-­‐crawl  &  refresh   §  Added  “status_code”  to  differentiate  protected  users   §  {'$set':  {'status_code':  '401  Unauthorized,401  Unauthorized'}}   §  Getting friends & followers of 2000 users is the hardest (or so I thought, until I got through the next stage!)