The Art of Social Media
       Analysis
 with Twitter & Python


                                      krishna sankar
                                           @ksankar
 http://www.oscon.com/oscon2012/public/schedule/detail/23130
Intro	



                                           API,
                                          Objects,…	

o  House	
  Rules	
  (1	
  of	
  2)	
                                    Twitter
                                                                         Network           We will analyze @clouderati,
       o  Doesn’t	
  assume	
  any	
  knowledge	
                        Analysis          2072 followers, exploding to
          of	
  Twitter	
  API	
                                         Pipeline	

       ~980,000 distinct users down
                                                                                           one level
       o  Goal:	
  Everybody	
  in	
  the	
  same	
  
          page	
  &	
  get	
  a	
  working	
  
          knowledge	
  of	
  Twitter	
  API	
  
                                                            NLP, NLTK,
       o  To	
  bootstrap	
  your	
  exploration	
                                     @mention           Cliques, social
                                                            Sentiment
                                                                                        network                graph
          into	
  Social	
  Network	
  Analysis	
  &	
       Analysis

          Twitter	
  	
                                                           Rewteeet analytics,
                                                                                                           Growth,
                                                           #tag Network              Information
       o  Simple	
  programs,	
  to	
  illustrate	
                                   contagion            weakties
          usage	
  &	
  data	
  manipulation	
  
Intro	



                                                API,
                                               Objects,…	

                                                                                    Twitter
o  House	
  Rules	
  (2	
  of	
  2)	
                                               Network           We will analyze @clouderati,
                                                                                    Analysis          2072 followers, exploding to
       o  Am	
  using	
  the	
  requests	
  library	
  
                                                                                    Pipeline	

       ~980,000 distinct users down
       o  There	
  are	
  good	
  Twitter	
  frameworks	
                                             one level
          for	
  python,	
  but	
  wanted	
  to	
  build	
  
          from	
  the	
  basics.	
  Once	
  one	
  
          understands	
  the	
  fundamentals,	
  
          frameworks	
  can	
  help	
                                  NLP, NLTK,
                                                                                                  @mention           Cliques, social
                                                                       Sentiment
       o  Many	
  areas	
  to	
  explore	
  –	
  not	
  enough	
        Analysis
                                                                                                   network                graph
          time.	
  So	
  decided	
  to	
  focus	
  on	
  social	
  
          graph,	
  cliques	
  &	
  networkx	
                                               Rewteeet analytics,
                                                                                                                      Growth,
                                                                      #tag Network              Information
                                                                                                 contagion            weakties
About  Me	
•    Lead	
  Engineer/Data	
  Scientist/AWS	
  Ops	
  Guy	
  at	
  
     Genophen.com	
  
       o    Co-­‐chair	
  –	
  2012	
  IEEE	
  Precision	
  Time	
  Synchronization	
  	
  
               •  http://www.ispcs.org/2012/index.html	
  
       o    Blog	
  :	
  http://doubleclix.wordpress.com/	
  
       o    Quora	
  :	
  http://www.quora.com/Krishna-­‐Sankar	
  
•    Prior	
  Gigs	
  
       o    Lead	
  Architect	
  (Egnyte)	
  
       o    Distinguished	
  Engineer	
  (CSCO)	
  
       o    Employee	
  #64439	
  (CSCO)	
  to	
  #39(Egnyte)	
  &	
  now	
  #9	
  !	
  
•    Current	
  Focus:	
  
       o    Design,	
  build	
  &	
  ops	
  of	
  BioInformatics/Consumer	
  Infrastructure	
  on	
  AWS,	
  
            MongoDB,	
  Solr,	
  Drupal,GitHub,…	
  
       o    Big	
  Data	
  (more	
  of	
  variety,	
  variability,	
  context	
  &	
  graphs,	
  than	
  volume	
  or	
  velocity	
  –	
  
            so	
  far	
  !)	
  
       o    Overlay	
  based	
  semantic	
  search	
  &	
  ranking	
  
•    Other	
  related	
  Presentations	
  
       o    http://goo.gl/P1rhc	
  Big	
  Data	
  Engineering	
  Top	
  10	
  Pragmatics	
  (Summary)	
  
       o    http://goo.gl/0SQDV	
  The	
  Art	
  of	
  Big	
  Data	
  (Detailed)	
  
       o    http://goo.gl/EaUKH	
  The	
  Hitchhiker’s	
  Guide	
  to	
  Kaggle	
  OSCON	
  2011	
  Tutorial	
  
Twitter Tips – A Baker’s Dozen	
1.    Twitter	
  APIs	
  are	
  (more	
  or	
  less)	
  congruent	
  &	
  symmetric	
  
2.    Twitter	
  is	
  usually	
  right	
  &	
  simple	
  -­‐	
  recheck	
  when	
  you	
  get	
  unexpected	
  results	
  
      before	
  blaming	
  Twitter	
  
      o      I	
  was	
  getting	
  numbers	
  when	
  I	
  was	
  expecting	
  screen_names	
  in	
  user	
  objects.	
  
      o      Was	
  ready	
  to	
  send	
  blasting	
  e-­‐mails	
  to	
  Twitter	
  team.	
  Decided	
  to	
  check	
  one	
  more	
  time	
  
             and	
  found	
  that	
  my	
  parameter	
  key	
  was	
  wrong-­‐screen_name	
  instead	
  of	
  user_id	
  
      o      Always test with one or two records before a long run ! - learned the hard way
3.    Twitter	
  APIs	
  are	
  very	
  powerful	
  –	
  consistent	
  use	
  can	
  bear	
  huge	
  data	
  
      o      In	
  a	
  week,	
  you	
  can	
  pull	
  in	
  4-­‐5	
  million	
  users	
  &	
  some	
  tweets	
  !	
  	
  
      o      Night runs are far more faster & error-free
4.    Use	
  a	
  NOSQL	
  data	
  store	
  as	
  a	
  command	
  buffer	
  &	
  data	
  buffer	
  
      o      Would	
  make	
  it	
  easy	
  to	
  work	
  with	
  Twitter	
  at	
  scale	
  
      o      I	
  use	
  	
  MongoDB	
  
                                                                                                                             The
      o      Keep	
  the	
  schema	
  simple	
  &	
  no	
  fancy	
  transformation	
                                             End
            •                And	
  as	
  far	
  as	
  possible	
  same	
  as	
  the	
  ( json)	
  response	
  	
  	
         Beg As Th
                                                                                                                                  inni
      o      Use	
  NOSQL	
  CLI	
  for	
  trimming	
  records	
  et	
  al	
                                                          ng	
 e
Twitter Tips – A Baker’s Dozen	

5.     Always	
  use	
  a	
  big	
  data	
  pipeline	
  
      o       Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize
      o       That	
  way	
  you	
  can	
  orthogonally	
  extend,	
  with	
  functional	
  components	
  like	
  command	
  buffers,	
  
              validation	
  et	
  al	
  	
  
6.     Use	
  functional	
  approach	
  for	
  a	
  scalable	
  pipeline	
  
      o       Compose	
  your	
  data	
  big	
  pipeline	
  with	
  well	
  defined	
  granular	
  functions,	
  each	
  doing	
  only	
  one	
  thing	
  
      o       Don’t	
  overload	
  the	
  functional	
  components	
  (i.e.	
  no	
  collect,	
  unroll	
  &	
  store	
  as	
  a	
  single	
  component)	
  
      o       Have	
  well	
  defined	
  functional	
  components	
  with	
  appropriate	
  caching,	
  buffering,	
  checkpoints	
  &	
  
              restart	
  techniques	
  
             •        This did create some trouble for me, as we will see later
7.     Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh	
  cycle	
  
       o  The	
  equivalent	
  of	
  the	
  traditional	
  ETL	
  
       o  Validation	
  stage	
  &	
  validation	
  routines	
  are	
  important	
  
               •    Cannot	
  expect	
  perfect	
  runs	
  
               •    Cannot	
  manually	
  look	
  at	
  data	
  either,	
  when	
  data	
  is	
  at	
  scale	
  
8.     Have	
  control	
  numbers	
  to	
  validate	
  runs	
  &	
  monitor	
  them	
  
      o       I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
              number through the various runs ! 
      o       There will be a separate printout of the control numbers that will be kept in the operations files
Twitter Tips – A Baker’s Dozen	
9.  Program	
  defensively	
  	
  
      o      more so for a REST-based-Big Data-Analytics systems
      o      Expect	
  failures	
  at	
  the	
  transport	
  layer	
  &	
  accommodate	
  for	
  them	
  	
  
10.  Have	
  Erlang-­‐style	
  supervisors	
  in	
  your	
  pipeline	
  
      o      Fail	
  fast	
  &	
  move	
  on	
  
      o      Don’t	
  linger	
  and	
  try	
  to	
  fix	
  errors	
  that	
  cannot	
  be	
  controlled	
  at	
  that	
  layer	
  
      o      A	
  higher	
  layer	
  process	
  will	
  circle	
  back	
  and	
  do	
  incremental	
  runs	
  to	
  
             correct	
  missing	
  spiders	
  and	
  crawls	
  
      o      Be	
  aware	
  of	
  visibility	
  &	
  lack	
  of	
  context.	
  Validate	
  at	
  the	
  lowest	
  layer	
  that	
  
             has	
  enough	
  context	
  to	
  take	
  corrective	
  actions	
  
      o      I have an example in part 2
11.  Data	
  will	
  never	
  be	
  perfect	
  
       o  Know	
  your	
  data	
  &	
  accommodate	
  for	
  it’s	
  idiosyncrasies	
  	
  
              •  for	
  example:	
  0	
  followers,	
  protected	
  users,	
  0	
  friends,…	
  
Twitter Tips – A Baker’s Dozen	
12.  Check	
  Point	
  frequently	
  (preferably	
  after	
  ever	
  API	
  call)	
  &	
  have	
  a	
  
     re-­‐startable	
  command	
  buffer	
  cache	
  	
  
     o      See a MongoDB example in Part 2
13.  Don’t	
  bombard	
  the	
  URL	
  
     o      Wait	
  a	
  few	
  seconds	
  before	
  successful	
  calls.	
  This	
  will	
  end	
  up	
  with	
  a	
  
            scalable	
  system,	
  eventually	
  
     o      I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
            work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 
14.  Always	
  measure	
  the	
  elapsed	
  time	
  of	
  your	
  API	
  runs	
  &	
  processing	
  
     o      Kind	
  of	
  early	
  warning	
  when	
  something	
  is	
  wrong	
  
15.  Develop	
  incrementally;	
  don’t	
  fail	
  to	
  check	
  “cut	
  &	
  paste”	
  errors	
  
Twitter Tips – A Baker’s Dozen	
16.  The	
  Twitter	
  big	
  data	
  pipeline	
  has	
  lots	
  of	
  opportunities	
  for	
  parallelism	
  
      o       Leverage	
  data	
  parallelism	
  frameworks	
  like	
  MapReduce	
  
      o       But	
  first	
  :	
  
             §       Prototype	
  as	
  a	
  linear	
  system,	
  	
  
             §       Optimize	
  and	
  tweak	
  the	
  functional	
  modules	
  &	
  cache	
  strategies,	
  	
  
             §       Note	
  down	
  stages	
  and	
  tasks	
  that	
  can	
  be	
  parallelized	
  and	
  	
  
             §       Then	
  parallelize	
  them	
  
      o       For the example project, we will see later, I did not leverage any parallel frameworks, but the
              opportunities were clearly evident. I will point them out, as we progress through the tutorial
17.  	
  Pay	
  attention	
  to	
  handoffs	
  between	
  stages	
  
      o      They	
  might	
  require	
  transformation	
  –	
  for	
  example	
  collect	
  &	
  store	
  might	
  store	
  a	
  user	
  list	
  
             as	
  multiple	
  arrays,	
  while	
  the	
  model	
  requires	
  each	
  user	
  to	
  be	
  a	
  document	
  for	
  
             aggregation	
  	
  
      o      But resist the urge to overload collect with transform
             o       i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
                     the array to separate documents 
      o      Add transformation as a granular function – of course, with appropriate buffering, caching,
             checkpoints & restart techniques 
18.  Have	
  a	
  good	
  log	
  management	
  system	
  to	
  capture	
  and	
  wade	
  through	
  
     logs	
  	
  
Twitter Tips – A Baker’s Dozen	
19.  Understand	
  the	
  underlying	
  network	
  characteristics	
  for	
  the	
  
     inference	
  you	
  want	
  to	
  make	
  
     o    Twitter	
  Network	
  	
  !=	
  Facebook	
  Network	
  ,	
  	
  Twitter	
  Graph	
  !=	
  LinkedIn	
  Graph	
  
     o    Twitter	
  Network	
  is	
  more	
  of	
  an	
  Interest	
  Network	
  
     o    So, many of the traditional network mechanisms & mechanics, like network
          diameter & degrees of separation, might not make sense
     o    But, others like Cliques and Bipartite Graphs do
Twitter Gripes	
1.     Need	
  more	
  rich	
  APIs	
  for	
  #tags	
  
      o      Somewhat	
  similar	
  to	
  users	
  viz.	
  followers,	
  friends	
  et	
  al	
  
      o      Might	
  make	
  sense	
  to	
  make	
  #tags	
  a	
  top	
  level	
  object	
  with	
  it’s	
  own	
  semantics	
  
2.  HTTP	
  Error	
  Return	
  is	
  not	
  uniform	
  	
  
      o      Returns	
  400	
  bad	
  Request	
  instead	
  of	
  420	
  
      o      Granted, there is enough information to figure this out
3.  Need	
  an	
  easier	
  way	
  to	
  get	
  screen_name	
  from	
  user_id	
  
4.  “following”	
  vs.	
  “friends_count”	
  i.e.	
  “following”	
  is	
  a	
  dummy	
  variable.	
  
      o      There are a few like this, most probably for backward compatibility
5.     Parameter	
  Validation	
  is	
  not	
  uniform	
  
      o      Gives	
  “404	
  Not	
  found”	
  instead	
  of	
  “406	
  Not	
  Acceptable”	
  or	
  “413	
  Too	
  Long”	
  or	
  “416	
  
             Range	
  Unacceptable”	
  
6.  Overall	
  more	
  validation	
  would	
  help	
  
      o      Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
             rest is easy to figure out
A Fork	

                           	
  
                & 	
  deep
       ,NLTK	
   	
  
•   NLP weets
    into	
  T ment	
  
             4
       o  Sen ysis	
  
           Anal


             • Not enough time for both
                • I chose the Social Graph route
A minute about Twitter as platform & it’s evolution	


                                                                                                   blog/
                                                                                           er. com/ tter-­‐
                                                                                     twitt         wi
                                                                           ps:/ /dev. nsistent-­‐t
                                                                        htt ring-­‐co
                                                                              e
                                                                         deliv ence	
                                                    “The micro-blogging service must find the
                                                                               ri
                                                                          expe
                                                                                                                                         right balance of running a profitable
                                                                                                                                         business and maintaining a robust
         “.. we want to make sure that the Twitter experience is                                                                         developers' community.” – Chenda, CBS
     straightforward and easy to understand -- whether you’re on
                                                                                                                                         news!
              Twitter.com or elsewhere on the web”-Michael!
My	
  Wish	
  &	
  Hope	
  
•  I	
  spend	
  a	
  lot	
  of	
  time	
  with	
  Twitter	
  &	
  derive	
  value;	
  the	
  platform	
  is	
  rich	
  &	
  the	
  APIs	
  intuitive	
  
•  I	
  did	
  like	
  the	
  fact	
  that	
  tweets	
  are	
  part	
  of	
  LinkedIn.	
  I	
  still	
  used	
  Twitter	
  more	
  than	
  LinkedIn	
  
          o      I	
  don’t	
  think	
  showing	
  Tweets	
  in	
  LinkedIn	
  took	
  anything	
  away	
  from	
  the	
  Twitter	
  experience	
  
          o      LinkedIn	
  experience	
  &	
  Twitter	
  experience	
  are	
  different	
  &	
  distinct.	
  Showing	
  tweets	
  in	
  LinkedIn	
  didn’t	
  change	
  that	
  
•       I	
  sincerely	
  hope	
  that	
  the	
  platform	
  grows	
  with	
  a	
  rich	
  developer	
  eco	
  system	
  
•       Orthogonally	
  extensible	
  platform	
  is	
  essential	
  
•       Of	
  course,	
  along	
  with	
  a	
  congruent	
  user	
  experience	
  –	
  “	
  …	
  core	
  Twitter	
  consumption	
  experience	
  through	
  consistent	
  tools”	
  
•    For	
  Hands	
  on	
  Today	
  
                                                                                                                    Setup	
      o  Python	
  2.7.3	
  
      o  easy_install	
  –v	
  requests	
  
           •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐
              request	
  
      o  easy_install	
  –v	
  requests-­‐oauth	
  
      o  Hands	
  on	
  programs	
  at	
  https://github.com/xsankar/oscon2012-­‐handson	
  
•    For	
  advanced	
  data	
  science	
  with	
  social	
  graphs	
  
      o  easy_install	
  –v	
  networkx	
  
      o  easy_install	
  –v	
  numpy	
  
      o  easy_install	
  –v	
  nltk	
  	
  
           •  Not	
  for	
  this	
  tutorial,	
  but	
  good	
  for	
  sentiment	
  analysis	
  et	
  al	
  
      o  Mongodb	
  	
  
           •  I	
  used	
  MongoDB	
  in	
  AWS	
  m2.xlarge,	
  RAID	
  10	
  X	
  8	
  X	
  15	
  GB	
  EBS	
  
      o  graphviz	
  -­‐	
  http://www.graphviz.org/;	
  easy_install	
  pygraphviz	
  
      o  easy_install	
  pydot	
  
Thanks To these Giants …
Problem Domain For this tutorial	

•  Data	
  Science	
  (trends,	
  analytics	
  et	
  al)	
  on	
  Social	
  Networks	
  as	
  
   observed	
  by	
  Twitter	
  primitives	
  
     o  Not	
  for	
  Twitter	
  based	
  apps	
  for	
  real	
  time	
  tweets	
  
     o  Not	
  web	
  sites	
  with	
  real	
  time	
  tweets	
  
•  By	
  looking	
  at	
  the	
  domain	
  in	
  aggregate	
  to	
  derive	
  inferences	
  &	
  
   actionable	
  recommendations	
  
•  Which	
  also	
  means,	
  you	
  need	
  to	
  be	
  deliberate	
  &	
  systemic	
  (	
  i.e.	
  
   not	
  look	
  at	
  a	
  fluctuation	
  as	
  a	
  trend	
  but	
  dig	
  deeper	
  before	
  
   pronouncing	
  a	
  trend)	
  
Agenda	

I.     Mechanics	
  :	
  Twitter	
  API	
  (1:30	
  PM	
  -­‐	
  3:00	
  PM)	
  	
  
      o    Essential	
  Fundamentals	
  (Rate	
  Limit,	
  HTTP	
  Codes	
  et	
  al)	
  
      o    Objects	
  
      o    API	
  
      o    Hands-­‐on	
  (2:45	
  PM	
  -­‐	
  3:00	
  PM)	
  
II.  Break	
  (3:00	
  PM	
  -­‐	
  3:30	
  PM)	
  
III.  Twitter	
  Social	
  Graph	
  Analysis	
  (3:30	
  PM	
  -­‐	
  5:00	
  PM)	
  
      o      Underlying	
  Concepts	
  
      o      Social	
  Graph	
  Analysis	
  of	
  @clouderati	
  
           §  Stages,	
  Strategies	
  &	
  Tasks	
  
           §  Code	
  Walk	
  thru	
  	
  
Open  This  First
Twi5er  API  :  Read  These  First	
•    Using	
  Twitter	
  Brand	
  
      o  New	
  logo	
  &	
  associated	
  guidelines	
  :	
  https://twitter.com/about/logos	
  
      o  Twitter	
  Rules	
  :	
  
         https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐
         best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules	
  
      o  Developer	
  Rules	
  of	
  the	
  road	
  https://dev.twitter.com/terms/api-­‐terms	
  
•    Read	
  These	
  Links	
  First	
  
      1.       https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know	
  
      2.       https://dev.twitter.com/docs/faq	
  
      3.       Field	
  Guide	
  to	
  Objects	
  https://dev.twitter.com/docs/platform-­‐objects	
  
      4.       Security	
  https://dev.twitter.com/docs/security-­‐best-­‐practices	
  
      5.       Media	
  Best	
  Practices	
  :	
  https://dev.twitter.com/media	
  
      6.       Consolidates	
  Page	
  :	
  https://dev.twitter.com/docs	
  
      7.       Streaming	
  APIs	
  https://dev.twitter.com/docs/streaming-­‐apis	
  
      8.       How	
  to	
  Appeal	
  (Not	
  that	
  you	
  all	
  would	
  need	
  it	
  !)	
  https://support.twitter.com/
               articles/72585	
  
•    Only	
  One	
  version	
  of	
  Twitter	
  APIs	
  
API  Status  Page	




•    https://dev.twitter.com/status	
  
•    https://dev.twitter.com/issues	
  
•    https://dev.twitter.com/discussions	
  
h5ps://dev.twi5er.com/status	




http://www.buzzfeed.com/tommywilhelm/google-­‐
users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter	
  
Open  This  First	
•  Install	
  pre-­‐req	
  as	
  per	
  the	
  setup	
  slide	
  
•  Run	
  	
  
    o  oscon2012_open_this_first.py	
  
    o  To	
  test	
  connectivity	
  –	
  “canary	
  query”	
  

•  Run	
  
    o  oscon2012_rate_limit_status.py	
  
    o  Use	
  http://www.epochconverter.com	
  to	
  check	
  reset_time	
  

•  Formats	
  xml,	
  json,	
  atom	
  &	
  rss	
  
Twitter	
  API	
  
                                                                                                                 Near-realtime,
                                                                                                                 High Volume	


                                                                                                                          Follow users,
Core Data,	

                 REST	
                                                           Streaming	
                topics, data
Core Twitter                                                                                                              mining	

Objects	

                                                                                                             Public	
  Streams	
  
                                     Seach &                                                                    User	
  Streams	
  
                                      Trend	

     Twitter	
                                                  Twitter	
                                        Site	
  Streams	
  
      REST	
                                                    Search	
                           Firehose	
  

                   Build	
  Profile	
                                          Keywords	
  
                     Create/Post	
  Tweets	
                                   Specific	
  User	
  
                       Reply	
                                                  Trends	
  
                       Favorite,	
  Re-­‐tweet	
                                  Rate	
  Limit	
  :	
  	
  
                            Rate	
  Limit	
  :	
  150/350	
                       	
  	
  	
  Complexity	
  &	
  Frequency	
  
Rate  Limit
Rate  Limits	
 •  By	
  API	
  type	
  &	
  Authentication	
  Mode	
  
         API	

          No authC	

           authC	

             Error	


REST	
             150/hr	
              350/hr	
         400	
  

Search	
           Complexity	
  &	
     -­‐N/A-­‐	
      420	
  
                   Frequency	
  

Streaming	
                              Upto	
  1%	
  

Fire	
  hose	
     none	
                none	
  
Rate  Limit  Header	
•  {	
  
•  "status":	
  "200	
  OK",	
  	
  
•  	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•  	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•  	
  	
  "x-­‐mid":	
  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐remaining":	
  "149",	
  	
  
•  	
  	
  "x-­‐ratelimit-­‐reset":	
  "1340467358",	
  	
  
•  	
  	
  "x-­‐runtime":	
  "0.04144",	
  	
  
•  	
  	
  "x-­‐transaction":	
  "2b49ac31cf8709af",	
  	
  
•  	
  	
  "x-­‐transaction-­‐mask":	
  
   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"	
  
•  }	
  
Rate  Limit-­‐‑ed  Header	
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  max-­‐age=300",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐length":	
  "150",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;	
  charset=utf-­‐8",	
  	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  00:48:25	
  GMT",	
  	
  
•    	
  	
  "expires":	
  "Wed,	
  04	
  Jul	
  2012	
  00:53:25	
  GMT",	
  	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  	
  ”…	
  
•    	
  	
  "status":	
  "400	
  Bad	
  Request",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341363230",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01126"	
  
•    }	
  
Rate  Limit  Example	
•  Run	
  
    o  oscon2012_rate_limit_02.py	
  

•  It	
  iterates	
  through	
  a	
  list	
  to	
  get	
  followers	
  	
  
•  List	
  is	
  2072	
  long	
  
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  00:54:16	
  GMT",	
  	
  
•    "status":	
  "200	
  OK",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•    	
  	
  "x-­‐mid":	
  "f31c7278ef8b6e28571166d359132f152289c3b8",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
                           Last	
  time,	
  it	
  gave	
  me	
  5	
  min.	
  
                                                                                Now	
  the	
  reset	
  timer	
  is	
  1	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "147",	
  	
  
                                                                                hour	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341366831",	
  	
  
                                                                                150	
  calls,	
  not	
  authenticated	
  
•    	
  	
  "x-­‐runtime":	
  "0.02768",	
  	
  
•    	
  	
  "x-­‐transaction":	
  "f1bafd60112dddeb",	
  	
  
•    	
  	
  "x-­‐transaction-­‐mask":	
  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"	
  
•    }	
  
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  max-­‐age=300",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;	
  charset=utf-­‐8",	
  	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  00:55:04	
  GMT",	
  	
  
                                                                                And  Rate  Limit  kicked-­‐‑in	
•    …	
  
•    "status":	
  "400	
  Bad	
  Request",	
  	
  
•    	
  	
  "transfer-­‐encoding":	
  "chunked",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341366831",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01342"	
  
•    }	
  
API  with  OAuth	
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "date":	
  "Wed,	
  04	
  Jul	
  2012	
  01:32:01	
  GMT",	
  	
  
•    	
  	
  "etag":	
  ""dd419c02ed00fc6b2a825cc27wbe040"",	
  	
  
•    	
  	
  "expires":	
  "Tue,	
  31	
  Mar	
  1981	
  05:00:00	
  GMT",	
  	
  
•    	
  	
  "last-­‐modified":	
  "Wed,	
  04	
  Jul	
  2012	
  01:32:01	
  GMT",	
  	
  
•    	
  	
  "pragma":	
  "no-­‐cache",	
  	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    …	
  
•    "status":	
  "200	
  OK",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐access-­‐level":	
  "read",	
  	
  
•    	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•    	
  	
  "x-­‐mid":	
  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "349",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341369121",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.05539",	
  	
                                                  OAuth	
  
• 
• 
     	
  	
  "x-­‐transaction":	
  "9f8508fe4c73a407",	
  	
  
     	
  	
  "x-­‐transaction-­‐mask":	
  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"	
  
                                                                                            “api-­‐identified”	
  
•    }	
                                                                                       1	
  hr	
  reset	
  
                                                                                               350	
  calls	
  
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "date":	
  "Thu,	
  05	
  Jul	
  2012	
  14:56:05	
  GMT",	
  	
  
•    …	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "133",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341500165",	
  	
  
•    	
  …	
                                                                               Rate  Limit  resets  during  
•    }	
                                                                                      consecutive  calls	
•    ********	
  2416	
  
•    {	
  
                                                                                   +1  
•    …	
                                                                          hour	
•    	
  	
  "date":	
  "Thu,	
  05	
  Jul	
  2012	
  14:56:18	
  GMT",	
  	
  
•    …	
  
•    	
  	
  "status":	
  "200	
  OK",	
  	
  
•    	
  	
  ….	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "349",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341503776",	
  	
  
•    ********	
  2417	
  
Unexplained  Errors	
•    Traceback	
  (most	
  recent	
  call	
  last):	
  
•    	
  	
  File	
  "oscon2012_get_user_info_01.py",	
  line	
  39,	
  in	
  <module>	
  
•    	
  	
  	
  	
  r	
  =	
  client.get(url,	
  params=payload)	
  
•    	
  	
  File	
  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",	
  line	
  244,	
  in	
  get	
  
•    	
  	
  File	
  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",	
  line	
  230,	
  in	
  request	
  
•    	
  	
  File	
  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",	
  line	
  609,	
  in	
  send	
  
•    requests.exceptions.ConnectionError:	
  HTTPSConnectionPool(host='api.twitter.com',	
  port=443):	
  Max	
  
     retries	
  exceeded	
  with	
  url:	
  /1/users/lookup.json?
     user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44
     614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854
     7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8
     962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C
                                                         While	
  trying	
  to	
  get	
  details	
  of	
  1,000,000	
  users,	
  I	
  get	
  this	
  error	
  –	
  
     17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C
                                                         usually	
  10-­‐6	
  AM	
  PST	
  
     42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C
     8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%
                                                         	
  
     2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%
                                                         Got	
  around	
  by	
  “Trap	
  &	
  wait	
  5	
  seconds”	
  
     2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%
                                                         	
  
     2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155
     56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260
                                                         Night	
  Runs	
  are	
  relatively	
  error	
  free	
  
     09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446
     14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886
     54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C
     13727232%2C199803906%2C220435108%2C268531201	
  
•    {	
  
• 
• 
     	
  …	
  
     	
  	
  "date":	
  "Fri,	
  06	
  Jul	
  2012	
  03:41:09	
  GMT",	
  	
  
                                                                                                                                            A Day in the life of
•    	
  	
  "expires":	
  "Fri,	
  06	
  Jul	
  2012	
  03:46:09	
  GMT",	
  	
                                                             Twitter Rate Limit
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  	
  "set-­‐cookie":	
  "dnt=;	
  domain=.twitter.com;	
  path=/;	
  expires=Thu,	
  01-­‐Jan-­‐1970	
  00:00:00	
  GMT",	
  	
  
•    	
  	
  "status":	
  "400	
  Bad	
  Request",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
                                                                                       Missed  by  4  min!	
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1341546334",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01918"	
  
•    }	
  
•    Error,	
  sleeping	
  
•    {	
  
•    	
  …	
  
•    	
  "date":	
  "Fri,	
  06	
  Jul	
  2012	
  03:46:12	
  GMT",	
  	
  
•    	
  …	
  
•    	
  "status":	
  "200	
  OK",	
  	
  
•    	
  …	
  
•    	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "349",	
  	
                           OK  after  5  min  sleep	
•    	
  …	
  
Strategies	
I	
  have	
  no	
  exotic	
  strategies,	
  so	
  far	
  !	
  
1.  Obvious	
  :	
  	
  Track	
  elapsed	
  time	
  &	
  sleep	
  when	
  rate	
  limit	
  kicks	
  in	
  
2.  Combine	
  authenticated	
  &	
  non-­‐authenticated	
  calls	
  
3.  Use	
  multiple	
  API	
  types	
  
4.  Cache	
  
5.  Store	
  &	
  get	
  only	
  what	
  is	
  needed	
  
6.  Checkpoint	
  &	
  buffer	
  request	
  commands	
  
7.  Distributed	
  data	
  parallelism	
  –	
  for	
  example	
  AWS	
  instances	
  
http://www.epochconverter.com/	
  <-­‐	
  useful	
  to	
  debug	
  the	
  timer	

	

Pl share your tips and tricks for conserving the Rate Limit
Authentication
Authentication	
•  Three	
  modes	
  
     o  Anonymous	
  
     o  HTTP	
  Basic	
  Auth	
  
     o  OAuth	
  
•  As	
  of	
  Aug	
  31,	
  2010,	
  only	
  Anonymous	
  or	
  OAuth	
  are	
  
   supported	
  
•  	
  OAuth	
  enables	
  the	
  user	
  to	
  authorize	
  an	
  application	
  
   without	
  sharing	
  credentials	
  
•  Also	
  has	
  the	
  ability	
  to	
  revoke	
  
•  Twitter	
  supports	
  OAuth	
  1.0a	
  
•  OAuth	
  2.0	
  is	
  the	
  new	
  standard,	
  much	
  simpler	
  
     o  No	
  timeframe	
  for	
  Twitter	
  support,	
  yet	
  	
  	
  
OAuth  Pragmatics	
•  Helpful	
  Links	
  
     o    https://dev.twitter.com/docs/auth/oauth	
  
     o    https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth	
  
     o    https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples	
  
     o    http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html	
  
•  Discussion	
  on	
  OAuth	
  internal	
  mechanisms	
  is	
  better	
  left	
  for	
  
   another	
  day	
  
•  For	
  headless	
  applications	
  to	
  get	
  OAuth	
  token,	
  go	
  to	
  https://
   dev.twitter.com/apps	
  
•  	
  Create	
  an	
  application	
  &	
  get	
  four	
  credential	
  pieces	
  
     o  Consumer	
  Key,	
  Consumer	
  Secret,	
  Access	
  Token	
  &	
  Access	
  Token	
  Secret	
  
•  All	
  the	
  frameworks	
  have	
  support	
  for	
  OAuth.	
  So	
  plug	
  –in	
  
   these	
  values	
  &	
  use	
  the	
  framework’s	
  calls	
  
•  I	
  used	
  request-­‐oauth	
  library	
  like	
  so:	
  
request-­‐‑oauth	
               def	
  get_oauth_client():	
                                                                                                                                                                Get	
  client	
  using	
  the	
  
                             	
  	
  	
  consumer_key	
  =	
  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"	
                                                                                              token,	
  key	
  &	
  secret	
  from	
  
                             	
  	
  	
  	
  consumer_secret	
  =	
  "fceb3aedb960374e74f559caeabab3562efe97b4"	
                                                                                         dev.twitter.com/apps	
  
                             	
  	
  	
  	
  access_token	
  =	
  "df919acd38722bc0bd553651c80674fab2b465086782Ls"	
  
                             	
  	
  	
  	
  access_token_secret	
  =	
  "1370adbe858f9d726a43211afea2b2d9928ed878"	
  
                             	
  	
  	
  	
  header_auth	
  =	
  True	
  
                             	
  	
  	
  	
  oauth_hook	
  =	
  OAuthHook(access_token,	
  access_token_secret,	
  consumer_key,	
  consumer_secret,	
  header_auth)	
  
                             	
  	
  	
  	
  client	
  =	
  requests.session(hooks={'pre_request':	
  oauth_hook})	
  
                             	
  	
  	
  	
  return	
  client	
  
                                                                                                                                                                                                           Use	
  the	
  client	
  instead	
  
               def	
  get_followers(user_id):	
                                                                                                                                                                    of	
  requests	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  'https://api.twitter.com/1/followers/ids.json’	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  payload={"user_id":user_id}	
  #	
  if	
  cursor	
  is	
  needed	
  {"cursor":-­‐1,"user_id":scr_name}	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r	
  =	
  requests.get(url,	
  params=payload)	
  

               def	
  get_followers_with_oauth(user_id,client):	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  url	
  =	
  'https://api.twitter.com/1/followers/ids.json'	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  payload={"user_id":user_id}	
  #	
  if	
  cursor	
  is	
  needed	
  {"cursor":-­‐1,"user_id":scr_name}	
  
               	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r	
  =	
  client.get(url,	
  params=payload)	
  

Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
OAuth  Authorize  screen	
                •  The	
  user	
  
                   authenticates	
  with	
  
                   Twitter	
  &	
  grants	
  
                   access	
  to	
  Forbes	
  
                   Social	
  
                •  Forbes	
  social	
  
                   doesn’t	
  have	
  the	
  
                   users	
  credentials,	
  
                   but	
  uses	
  OAuth	
  to	
  
                   access	
  the	
  user’s	
  
                   account	
  
HTTP  Status  
  Codes
HTTP  status  Codes	
         •  0	
  Never	
  made	
  it	
  to	
  Twitter	
  Servers	
  -­‐	
   •          404	
  Not	
  Found	
  
            Library	
  error	
                                              •          406	
  Not	
  Acceptable	
  
         •  200	
  OK	
                                                     •          413	
  Too	
  Long	
  
         •  304	
  Not	
  Modified	
                                         •          416	
  Range	
  Unacceptable	
  
         •  400	
  Bad	
  Request	
                                         •          420	
  Enhance	
  Your	
  Calm	
  
                o  Check	
  error	
  message	
  for	
  explanation	
                    o  Rate	
  Limited	
  
                o  REST	
  Rate	
  Limit	
  !	
  	
                              •  500	
  Internal	
  Server	
  Error	
  
         •  401	
  UnAuthorized	
                                                •  502	
  Bad	
  Gateway	
  	
  
                o  Beware	
  –	
  you	
  could	
  get	
  this	
  for	
  other	
         o  Down	
  for	
  maintenance	
  
                   reasons	
  as	
  well.	
  	
  	
                               •    503	
  Service	
  Unavailable	
  
         •  403	
  Forbidden	
                                                          o  Overloaded	
  “Fail	
  whale”	
  
                o  Hit	
  Update	
  Limit	
  (>	
  max	
  Tweets/day,	
          •  504	
  Gateway	
  Timeout	
  
                   following	
  too	
  many	
  people)	
                                o  Overloaded	
  


h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
HTTP  Status  Code  -­‐‑  Example	
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  max-­‐age=300",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐length":	
  "91",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;	
  charset=utf-­‐8",	
  	
  
•    	
  	
  "date":	
  "Sat,	
  23	
  Jun	
  2012	
  00:06:56	
  GMT",	
  	
  
•    	
  	
  "expires":	
  "Sat,	
  23	
  Jun	
  2012	
  00:11:56	
  GMT",	
  	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  …	
  
•    	
  	
  "status":	
  "401	
  Unauthorized",	
  	
  
•    	
  	
  "vary":	
  "Accept-­‐Encoding",	
  	
  
•    	
  	
  "www-­‐authenticate":	
  "OAuth	
  realm="https://api.twitter.com"",	
  	
  
• 
• 
     	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
     	
  	
  "x-­‐ratelimit-­‐limit":	
  "0",	
  	
  
                                                                                                      Detailed	
  error	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "0",	
  	
                                              message	
  	
  in	
  JSON	
  !	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1340413616",	
  	
  
•    	
  	
  "x-­‐runtime":	
  "0.01997"	
                                                               I	
  like	
  this	
  
•    }	
  
•    {	
  
•    	
  	
  "errors":	
  [	
  
•    	
  	
  	
  	
  {	
  
•    	
  	
  	
  	
  	
  	
  "code":	
  53,	
  	
  
•    	
  	
  	
  	
  	
  	
  "message":	
  "Basic	
  authentication	
  is	
  not	
  supported"	
  
•    	
  	
  	
  	
  }	
  
•    	
  	
  ]	
  
•    }	
  
HTTP  Status  Code  –  Confusing  Example	
•    {	
                                                                •  GET	
  https://api.twitter.com/1/users/lookup.json?
•    …	
  
                                                                               screen_nme=twitterapi,twitter&include_entities=
•    	
  	
  "pragma":	
  "no-­‐cache",	
  	
  
                                                                               true	
  
•    	
  	
  "server":	
  "tfe",	
  	
  
•    	
  …	
  	
                                                        •  Spelling	
  Mistake	
  
•    	
  	
  "status":	
  "404	
  Not	
  Found",	
  	
                           o  Should	
  be	
  screen_name	
  
•    	
  	
  …	
                                                        •  But	
  confusing	
  error	
  !	
  
•    }	
  
•    {	
                                                                •  Should	
  be	
  406	
  Not	
  Acceptable	
  or	
  413	
  Too	
  Long	
  ,	
  
•    	
  	
  "errors":	
  [	
                                                  showing	
  parameter	
  error	
  
•    	
  	
  	
  	
  {	
  
•    	
  	
  	
  	
  	
  	
  "code":	
  34,	
  	
  
•    	
  	
  	
  	
  	
  	
  "message":	
  "Sorry,	
  that	
  page	
  does	
  not	
  exist"	
  
•    	
  	
  	
  	
  }	
  
•    	
  	
  ]	
  
•    }	
  
HTTP  Status  Code  -­‐‑  Example	
•    {	
  
•    	
  	
  "cache-­‐control":	
  "no-­‐cache,	
  no-­‐store,	
  must-­‐revalidate,	
  pre-­‐check=0,	
  post-­‐check=0",	
  	
  
•    	
  	
  "content-­‐encoding":	
  "gzip",	
  	
  
•    	
  	
  "content-­‐length":	
  "112",	
  	
  
•    	
  	
  "content-­‐type":	
  "application/json;charset=utf-­‐8",	
  	
                                         Sometimes,	
  the	
  errors	
  are	
  
•    	
  	
  "date":	
  "Sat,	
  23	
  Jun	
  2012	
  01:23:47	
  GMT",	
  	
                                       not	
  correct.	
  I	
  got	
  this	
  error	
  
•    	
  	
  "expires":	
  "Tue,	
  31	
  Mar	
  1981	
  05:00:00	
  GMT",	
  	
  
•    …	
  
                                                                                                                    for	
  user_timeline.json	
  w/	
  
•    	
  	
  "status":	
  "401	
  Unauthorized",	
  	
                                                              user_id=20,15,12	
  
•    	
  	
  "www-­‐authenticate":	
  "OAuth	
  realm="https://api.twitter.com"",	
  	
                           Clearly	
  a	
  parameter	
  error	
  
•    	
  	
  "x-­‐frame-­‐options":	
  "SAMEORIGIN",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api",	
  	
  
                                                                                                                    (i.e.	
  more	
  parameters)	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "150",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "147",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐reset":	
  "1340417742",	
  	
  
•    	
  	
  "x-­‐transaction":	
  "d545a806f9c72b98"	
  
•    }	
  
•    {	
  
•    	
  	
  "error":	
  "Not	
  authorized",	
  	
  
•    	
  	
  "request":	
  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"	
  
•    }	
  
Objects
Followers	
  
                                                      Twitter	
  Platform	
  
    Friends	
  
                      Are Followed By	

                                                                  Objects	
  
 Follow	

                   Users	
  
                        Status Update	

       @     user_mentions	
  
                                                                                      Entities	
  
                                                   embed	

    urls	
  
          Temporally
                               Tweets	
  
                                                        embe
                                                            d   	

          Ordered	

                                                      media	
  

    TimeLine	
                                         #	

                                  Places	
                    hashtags	
  


h5ps://dev.twi5er.com/docs/platform-­‐‑objects
Tweets	
                •  A.k.a	
  Status	
  Updates	
  
                •  Interesting	
  fields	
  
                      o    Coordinates	
  <-­‐	
  geo	
  location	
  
                      o    created_at	
  
                      o    entities	
  (will	
  see	
  later)	
  
                      o    Id,	
  id_str	
  
                      o    possibly	
  sensitive	
  
                      o    user	
  (will	
  see	
  later)	
  
                             •  perspectival	
  attributes	
  embedded	
  within	
  a	
  child	
  object	
  of	
  an	
  unlike	
  parent	
  –	
  
                                hard	
  to	
  maintain	
  at	
  scale	
  
                             •  https://dev.twitter.com/docs/faq#6981	
  
                      o  withheld_in_countries	
  	
  
                             •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses	
  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
A  word  about  id,  id_str	
                  •  June	
  1,	
  2010	
  
                         o  Snowflake	
  the	
  id	
  generator	
  service	
  
                         o  “The	
  full	
  ID	
  is	
  composed	
  of	
  a	
  timestamp,	
  
                            a	
  worker	
  number,	
  and	
  a	
  sequence	
  
                            number”	
  
                         o  Had	
  problems	
  with	
  JavaScript	
  to	
  handle	
  
                            numbers	
  >	
  53	
  bits	
  
                         o  “id”:819797	
  
                         o  “id_str”:”819797”	
  




h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html	
h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI	
h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
Tweets  -­‐‑  example	
•  Let	
  us	
  run	
  oscon2012-­‐tweets.py	
  
•  Example	
  of	
  tweet	
  
   o  coordinates	
  
   o  id	
  	
  
   o  id_str	
  
Users	
                •    followers_count	
  
                •    geo_enabled	
  
                •    Id,	
  Id_str	
  
                •    name,	
  screen_name	
  
                •    Protected	
  
                •    status,	
  statuses_count	
  
                •    withheld_in_countries	
  
h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
Users  –  Let  us  run  some  examples	
•  Run	
  	
  
     o  oscon_2012_users.py	
  
         •  Lookup	
  users	
  by	
  screen_name	
  
     o  oscon12_first_20_ids.py	
  
         •  Lookup	
  users	
  by	
  user_id	
  
•  Inspect	
  the	
  results	
  
     o  id,	
  name,	
  status,	
  status_count,	
  protected,	
  followers	
  
        (for	
  top	
  10	
  followers),	
  withheld	
  users	
  
•  Can	
  use	
  information	
  for	
  customizing	
  
   the	
  user’s	
  screen	
  in	
  your	
  web	
  app	
  
Entities	
                    •  Metadata	
  &	
  Contextual	
  Information	
  
                    •  You	
  can	
  parse	
  them,	
  but	
  Entities	
  
                       parse	
  them	
  out	
  as	
  structured	
  data	
  
                    •  REST	
  API/Search	
  API	
  –	
  
                       include_entities=1	
  
                    •  Streaming	
  API	
  –	
  included	
  by	
  default	
  
                    •  hashtags,	
  media,	
  urls,	
  
                       user_mentions	
  
h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities	
h5ps://dev.twi5er.com/docs/tweet-­‐‑entities	
h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
Entities	
•  Run	
  	
  
     o  oscon2012_entities.py	
  

•  Inspect	
  hashtags,	
  urls	
  et	
  al	
  	
  
Places	
                  •    attributes	
  
                  •    bounding_box	
  
                  •    Id	
  (as	
  a	
  string!)	
  
                  •    country	
  
                  •    name	
  


h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places	
h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
Places	
•  Can	
  search	
  for	
  tweets	
  near	
  a	
  place	
  like	
  so:	
  
•  Get	
  latlong	
  of	
  conv	
  center	
  [45.52929,-­‐122.66289]	
  
     o  Tweets	
  near	
  that	
  place	
  
•  Tweets	
  near	
  San	
  Jose	
  [37.395715,-­‐122.102308]	
  
•  We	
  will	
  not	
  see	
  further	
  here.	
  But	
  very	
  useful	
  
Timelines	
             •  Collections	
  of	
  tweets	
  ordered	
  by	
  time	
  
             •  Use	
  max_id	
  &	
  since_id	
  for	
  navigation	
  




h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
Other  Objects  &  APIs	
•  Lists	
  
•  Notifications	
  
•  Friendships/exists	
  to	
  see	
  if	
  one	
  follows	
  
   the	
  other	
  
Followers	
  
                                                      Twitter	
  Platform	
  
    Friends	
  
                      Are Followed By	

                                                                  Objects	
  
 Follow	

                   Users	
  
                        Status Update	

       @     user_mentions	
  
                                                                                      Entities	
  
                                                   embed	

    urls	
  
          Temporally
                               Tweets	
  
                                                        embe
                                                            d   	

          Ordered	

                                                      media	
  

    TimeLine	
                                         #	

                                  Places	
                    hashtags	
  


h5ps://dev.twi5er.com/docs/platform-­‐‑objects
Hands-­‐‑on  Exercise  (15  min)	
•  Setup	
  environment	
  –	
  slide	
  #14	
  
•  Sanity	
  Check	
  Environment	
  &	
  Libraries	
  
      o  oscon2012_open_this_first.py	
  
      o  oscon2012_rate_limit_status.py	
  
•  Get	
  objects	
  (show	
  calls)	
  
      o    Lookup	
  users	
  by	
  screen_name	
  	
  -­‐	
  oscon12_users.py	
  
      o    Lookup	
  users	
  by	
  id	
  -­‐	
  oscon12_first_20_ids.py	
  
      o    Lookup	
  tweets	
  -­‐	
  oscon12_tweets.py	
  
      o    Get	
  entities	
  -­‐	
  oscon12_entities.py	
  
•  Inspect	
  the	
  results	
  
•  Explore	
  a	
  little	
  bit	
  
•  Discussion	
  
Twi5er  APIs
Twitter	
  API	
  
                                                                                     Near-realtime,
                                                                                     High Volume	


                                                                                           Follow users,
Core Data,	

            REST	
                                          Streaming	
       topics, data
Core Twitter                                                                               mining	

Objects	

                                                                                   Public  Streams	
                             Seach &                                                User  Streams	
                              Trend	

     Twitter	
                               Twitter	
                                Site  Streams	
      REST	
                                 Search	
                      Firehose	

                   Build  Profile	
                            Keywords	
                    Create/Post  Tweets	
                      Specific  User	
                      Reply	
                                   Trends	
                      Favorite,  Re-­‐‑tweet	
                    Rate  Limit  :  	
                        Rate  Limit  :  150/350	
                       Complexity  &  Frequency
Twi5er  REST  API	
•    https://dev.twitter.com/docs/api	
  
•    What	
  we	
  were	
  doing	
  were	
  the	
  REST	
  API	
  
•    Request-­‐Response	
  
•    Anonymous	
  or	
  OAuth	
  
•    Rate	
  Limited	
  :	
  
      o  150/350	
  
Twi5er  Trends	
•  oscon2012-­‐trends.py	
  
•  Trends/weekly,	
  Trends/monthly	
  
•  Let	
  us	
  run	
  some	
  examples	
  
     o  oscon2012_trends_daily.py	
  
     o  oscon2012_trends_weekly.py	
  

•  Trends	
  &	
  hashtags	
  
     o    #hashtag	
  euro2012	
  
     o    http://hashtags.org/euro2012	
  
     o    http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/	
  
     o    http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html	
  
     o    Top	
  10	
  :	
  http://twittercounter.com/pages/100,	
  http://twitaholic.com/	
  
Brand  Rank  w/  Twi5er	
•  Walk	
  Through	
  &	
  results	
  of	
  following	
  
     o  oscon2012_brand_01.py	
  
•  Followed	
  10	
  user-­‐brands	
  for	
  a	
  few	
  days	
  to	
  find	
  
   growth	
  
•  Brand	
  Rank	
  	
  
     o  Growth	
  of	
  a	
  brand	
  w.r.t	
  the	
  industry	
  
     o  Surge	
  in	
  popularity	
  –	
  could	
  be	
  due	
  to	
  –ve	
  or	
  +ve	
  buzz.	
  Need	
  to	
  understand	
  &	
  
        correlate	
  using	
  Twitter	
  APIs	
  &	
  metrics	
  
•  API	
  :	
  url='https://api.twitter.com/1/users/
   lookup.json'	
  
•  payload={"screen_name":"miamiheat,okcthunder,n
   ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,
   googleio,OReillyMedia"}	
  
Brand  Rank  w/  Twi5er	
                     Clouderati  
                       is  very  
                        stable
Brand  Rank  w/  Twi5er  
    Tech  Brands	
            •    Google	
  I/O	
  showed	
  a	
  spike	
  on	
  6/27-­‐	
  
                 6/28	
  
            •    OReillyMedia	
  shares	
  some	
  spike	
  
            •    Looking	
  at	
  a	
  few	
  days	
  worth	
  of	
  
                 data,	
  our	
  best	
  inference	
  is	
  that	
  
                 “oscon	
  doesn’t	
  track	
  with	
  googleio”	
  
            •    “Clouderati	
  doesn’t	
  track	
  at	
  all”	
  
Brand  Rank  w/  Twi5er  
   World  of  Soccer	
            •  FOXSoccer,UEFAcom	
  
               track	
  each	
  other	
  	
  

                    The  numbers  seldom  
                   decrease.  So  calculating  
                    –ve  velocity  will  not  
                             work	
                   OTOH,  if  you  see  a  –ve  
                     velocity,  investigate
Brand  Rank  w/  Twi5er  
                 World  of  Basketball	
•  NBA,	
  MiamiHeat,	
  okcthunder	
  track	
  each	
  other	
  
•  Used	
  %	
  than	
  absolute	
  numbers	
  to	
  compare	
  
•  The	
  hike	
  on	
  7/6	
  to	
  7/10	
  is	
  interesting.	
  	
  	
  
Brand  Rank  w/  Twi5er  
    Rising  Tide  …	
 •  For	
  some	
  reason,	
  all	
  numbers	
  are	
  going	
  up	
  7/6	
  thru	
  
    7/10	
  –	
  except	
  for	
  clouderati!	
  
 •  Is	
  a	
  rising	
  (Twitter)	
  tide	
  lifting	
  all	
  (well,	
  almost	
  all)	
  ?	
  
Trivia  :  Search  API	
•  Search(search.twitter.com)	
  
   o  Built	
  by	
  Summize	
  which	
  was	
  acquired	
  by	
  Twitter	
  in	
  
      2008	
  
   o  Summize	
  described	
  itself	
  as	
  “sentiment	
  mining”	
  
Search  API	
              •  Very	
  simple	
  	
  
                   o  GET	
  http://search.twitter.com/search.json?q=<blah>	
  
              •  Based	
  on	
  a	
  search	
  criteria	
  
              •  “The Twitter Search API is a dedicated API for
                 running searches against the real-time index of
                 recent Tweets”
              •  Recent	
  =	
  Last	
  6-­‐9	
  days	
  worth	
  of	
  tweets	
  
              •  Anonymous	
  Call	
  
              •  Rate	
  Limit	
  
                   o  Not	
  No.	
  of	
  calls/hour,	
  but	
  Complexity	
  &	
  Frequency	
  
h5ps://dev.twi5er.com/docs/using-­‐‑search	
h5ps://dev.twi5er.com/docs/api/1/get/search
Search  API	
•  Filters	
  
    o    Search	
  URL	
  encoded	
  
    o    @	
  =	
  %40,	
  #=%23	
  
    o    	
  emoticons	
  	
  :)	
  and	
  :(,	
  
    o    http://search.twitter.com/search.atom?q=sometimes+%3A)	
  
    o    http://search.twitter.com/search.atom?q=sometimes+%3A(	
  

•  Location	
  Filters,	
  date	
  filters	
  
•  Content	
  searches	
  
Streaming  API	
•    Not	
  request	
  response;	
  but	
  stream	
  
•    Twitter	
  frameworks	
  have	
  the	
  support	
  
•    Rate	
  Limit	
  :	
  Upto	
  1%	
  
•    Stall	
  warning	
  if	
  the	
  client	
  is	
  falling	
  behind	
  
•    Good	
  Documentation	
  Links	
  
      o  https://dev.twitter.com/docs/streaming-­‐apis/connecting	
  
      o  https://dev.twitter.com/docs/streaming-­‐apis/parameters	
  
      o  https://dev.twitter.com/docs/streaming-­‐apis/processing	
  
Firehose	
•  ~	
  400	
  million	
  public	
  tweets/day	
  
•  If	
  you	
  are	
  working	
  with	
  Twitter	
  firehose,	
  I	
  envy	
  you	
  !	
  




•  If	
  you	
  hit	
  real	
  limits,	
  then	
  explore	
  the	
  firehose	
  route	
  
•  AFAIK,	
  it	
  is	
  not	
  cheap,	
  but	
  worth	
  it	
  
API  Best  Practices	
              1.  Use	
  JSON	
  
              2.  Use	
  user_id	
  than	
  screen_name	
  
                     o  User_id	
  is	
  constant	
  while	
  screen_name	
  can	
  change	
  
              3.  max_id	
  and	
  since_id	
  
                     o  For	
  example	
  direct	
  messages,	
  if	
  you	
  have	
  last	
  message	
  use	
  
                        since_id	
  for	
  search	
  
                     o  max_id	
  how	
  far	
  to	
  go	
  back	
  
              4.  Cache	
  as	
  much	
  as	
  you	
  can	
  
              5.  Set	
  the	
  User-­‐Agent	
  header	
  for	
  debugging	
  
              I have listed a few good blogs that have API best practices, in the
              reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the
                                                      sources
Twitter	
  API	
  
                                                                                     Near-realtime,
                                                                                     High Volume	


                                                                                           Follow users,
Core Data,	

            REST	
                                          Streaming	
       topics, data
Core Twitter                                                                               mining	

Objects	

                                                                                   Public  Streams	
                             Seach &                                                User  Streams	
                              Trend	

     Twitter	
                               Twitter	
                                Site  Streams	
      REST	
                                 Search	
                      Firehose	

                   Build  Profile	
                                    Questions	
  ?	
  
                                                              Keywords	
                    Create/Post  Tweets	
                      Specific  User	
                      Reply	
                                   Trends	
                      Favorite,  Re-­‐‑tweet	
                    Rate  Limit  :  	
                        Rate  Limit  :  150/350	
                       Complexity  &  Frequency
Part II
          SNA
         Part II
Twitter Network Analysis
2.	
  Store	
         3.	
  Transform	
  &	
  	
  
           1.	
  Collect	
  
                                                                                  Analyze	
  


                                                                                              the
                             Validate Dataset &                                        . Keep don’t
                                                                                 Tip: 3 simple;
                              re-crawl/refresh	

                                     a
                                                                                schem afrai d to
                                                                                     be
                                                                                            for m
Most	
  important	
  &	
                                                              trans
the	
  ugliest	
  slide	
  in	
  
       this	
  deck	
  !	
             as
                                lem ent ,
                          1. Imp ipeline                                   4.	
  Model	
  
                     Tip: age d p nolith              5.	
  Predict,	
            &	
  	
  
                       a st r a mo                                          Reason	
  
                         neve                       Recommend	
  &	
  
                                                       Visualize	
  
Trivia	
•  Social	
  Network	
  Analysis	
  originated	
  as	
  Sociometry	
  &	
  
   the	
  social	
  network	
  was	
  called	
  a	
  sociogram	
  
•  Back	
  then,	
  Facebook	
  was	
  called	
  SocioBinder!	
  
•  Jacob	
  Levi	
  Morano,	
  is	
  considered	
  the	
  originator	
  
    o  NYTimes,	
  April	
  3,	
  1933,	
  P.	
  17	
  
Twi5er  Networks-­‐‑Definitions	
•  Nodes	
  
   o  Users	
  
   o  #tags	
  

•  Edges	
  
   o    Follows	
  
   o    Friends	
  
   o    @mentions	
  
   o    #tags	
  

•  Directed	
  
Twi5er  Networks-­‐‑Definitions	
•  In-­‐degree	
  
    o  Followers	
  
•  Out-­‐Degree	
  
    o  Friends/Follow	
  
•  Centrality	
  Measures	
  
•  Hubs	
  &	
  Authorities	
  
    o  Hubs/Directories	
  tell	
  us	
  where	
  
       Authorities	
  are	
  
    o  “Of	
  Mortals	
  &	
  Celebrities”	
  is	
  
       more	
  “Twitter-­‐style”	
  
Twi5er  Networks-­‐‑Properties	
                                                                                   M
•  Concepts	
  From	
  Citation	
                                    N
   Networks	
                                                                      K
                                                                                           J	
   o  Cocitation	
                                                         L	
  
                                                                                                 I	
       •  Common	
  papers	
  that	
  cite	
  a	
  paper	
                         A
       •  Common	
  Followers	
                                      B                   G
            o  C	
  &	
  G	
  (Followed	
  by	
  F	
  &	
  H)	
  
                                                                    C              H
   o  Bibliographic	
  Coupling	
  
       •  Cite	
  the	
  same	
  papers	
                         D                    F	
  
       •  Common	
  Friends	
  (i.e.	
  follow	
  same	
               E
          person)	
  
            o  D,	
  E,	
  F	
  &	
  H	
  
Twi5er  Networks-­‐‑Properties	
•  Concepts	
  From	
  Citation	
  Networks	
                                                 M
    o  Cocitation	
                                                                   N
         •  Common	
  papers	
  that	
  cite	
  a	
  paper	
                                      K
         •  Common	
  Followers	
  
                                                                                                          J	
  
                                                                              L	
  
               o  C	
  &	
  G	
  (Followed	
  by	
  F	
  &	
  H)	
                                                I	
  
    o  Bibliographic	
  Coupling	
                                                        A
         •  Cite	
  the	
  same	
  papers	
                                           B                   G
         •  Common	
  Friends	
  	
  (i.e.	
  follow	
  same	
  person)	
  
               o  D,	
  E,	
  F	
  &	
  H	
  follow	
  C	
  
               o  H	
  &	
  F	
  follow	
  C	
  &	
  G	
                                      H
                                                                                    C
                       •  So	
  H	
  &	
  F	
  have	
  high	
  coupling	
   D
                       •  Hence,	
  if	
  H	
  follows	
  A,	
  we	
  can	
                           F	
  
                              recommend	
  F	
  to	
  follow	
  A	
                 E
Twi5er  Networks-­‐‑Properties	
•  Bipartite/Affiliation	
  Networks	
  
   o  Two	
  disjoint	
  subsets	
  
   o  The	
  bipartite	
  concept	
  is	
  very	
  relevant	
  to	
  Twitter	
  social	
  graph	
  
   o  Membership	
  in	
  Lists	
  	
  
       •  lists	
  vs.	
  users	
  bipartite	
  graph	
  
   o  Common	
  #Tags	
  in	
  Tweets	
  	
  
       •  #tags	
  vs.	
  members	
  bipartite	
  graph	
  
   o  @mention	
  together	
  
       •  ?	
  Can	
  this	
  be	
  a	
  bipartite	
  graph	
  
       •  ?	
  How	
  would	
  we	
  fold	
  this	
  ?	
  
Other  Metrics  &  Mechanisms	
                   •      Kronecker	
  Graphs	
  Models	
  
                           o  Kronecker	
  product	
  is	
  a	
  way	
  of	
  generating	
  self-­‐similar	
  matrices	
  
                           o  Prof.Leskovec	
  et	
  al	
  define	
  the	
  Kronecker	
  product	
  of	
  two	
  graphs	
  as	
  the	
  Kronecker	
  product	
  of	
  
                              their	
  adjacency	
  matrices	
  
                           o  Application	
  :	
  Generating	
  models	
  for	
  analysis,	
  prediction,	
  anomaly	
  detection	
  et	
  al	
  
                   •      Erdos-­‐Renyl	
  Random	
  Graphs	
  
                           o  Easy	
  to	
  build	
  a	
  Gn,p	
  graph	
  
                           o  Assumes	
  equal	
  likelihood	
  of	
  edges	
  between	
  two	
  nodes	
  
                           o  In a Twitter social network, we can create a more realistic expected distribution (adding the
                              “social reality” dimension) by inspecting the #tags & @mentions
                   •      Network	
  Diameter	
  
                   •      Weak	
  Ties	
  
                   •      Follower	
  velocity	
  (+ve	
  &	
  –ve),	
  Association	
  strength	
  
                           o  Unfollow	
  not	
  a	
  reliable	
  measure	
  
                           o  But	
  an	
  interesting	
  property	
  to	
  investigate	
  when	
  it	
  happens	
  


                        Not covered here, but potential for an encore !
Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
Twi5er  Networks-­‐‑Properties	
•  Twitter != LinkedIn, Twitter != Facebook
•  Twitter Network == Interest Network
•  Be	
  cognizant	
  of	
  the	
  above	
  when	
  you	
  apply	
  traditional	
  network	
  
   properties	
  to	
  Twitter	
  	
  
•  For	
  example,	
  	
  
      o  Six	
  degrees	
  of	
  separation	
  doesn't	
  make	
  sense	
  (most	
  of	
  the	
  time)	
  in	
  
         Twitter	
  –	
  except	
  may	
  be	
  for	
  Cliques	
  
      o  Is	
  diameter	
  a	
  reliable	
  measure	
  for	
  a	
  Twitter	
  Network	
  ?	
  
              •  Probably	
  not	
  
      o  Do	
  cut	
  sets	
  make	
  sense	
  ?	
  	
  
              •  Probably	
  not	
  
      o  But	
  citation	
  network	
  principles	
  do	
  apply;	
  we	
  can	
  learn	
  from	
  cliques	
  
      o  Bipartite	
  graphs	
  do	
  make	
  sense	
  
Cliques  (1  of  2)	
•  “Maximal	
  subset	
  of	
  the	
  vertices	
  in	
  an	
  
   undirected	
  network	
  such	
  that	
  every	
  member	
  
   of	
  the	
  set	
  is	
  connected	
  by	
  an	
  edge	
  to	
  every	
  
   other”	
  
•  Cohesive	
  subgroup,	
  closely	
  connected	
  
•  Near-­‐cliques	
  than	
  a	
  perfect	
  clique	
  (k-­‐plex	
  i.e.	
  
   connected	
  to	
  at	
  least	
  n-­‐k	
  others)	
  
•  k-­‐plex	
  clique	
  to	
  discover	
  sub	
  groups	
  in	
  a	
  sparse	
  
   network;	
  1-­‐plex	
  being	
  the	
  perfect	
  clique	
  
                                                 Ref:  Networks,  An  Introduction-­‐‑Newman
Cliques  (2  of  2)	
•  k-­‐core	
  –	
  at	
  least	
  k	
  others	
  in	
  the	
  subset;	
  
   (n-­‐k)-­‐plex	
  
•  k-­‐clique	
  –	
  no	
  more	
  than	
  k	
  distance	
  away	
  
    o  Path	
  inside	
  or	
  outside	
  the	
  subset	
  
    o  k-­‐clan	
  or	
  k-­‐club	
  (path	
  inside	
  the	
  subset)	
  

•  We	
  will	
  apply	
  k-­‐plex	
  Cliques	
  for	
  one	
  of	
  
   our	
  hands-­‐on	
  	
  

                                                                  Ref:  Networks,  An  Introduction-­‐‑Newman
Sentiment  Analysis	
•  Sentiment	
  Analysis	
  is	
  an	
  important	
  &	
  interesting	
  work	
  
   on	
  the	
  Twitter	
  platform	
  
       o  Collect	
  Tweets	
  
       o  Opinion	
  Estimation	
  -­‐Pass	
  thru	
  Classifier,	
  Sentiment	
  Lexicons	
  
             •  Naïve	
  Bayes/Max	
  Entropy	
  Class/SVM	
  
       o  Aggregated	
  Text	
  Sentiment/Moving	
  Average	
  
•  I	
  chose	
  not	
  to	
  dive	
  deeper	
  because	
  of	
  time	
  constraints	
  
       o  Couldn’t	
  do	
  justice	
  to	
  API,	
  Social	
  Network	
  and	
  Sentiment	
  Analysis,	
  
          all	
  in	
  3	
  hrs	
  
•  Next	
  3	
  Slides	
  have	
  couple	
  of	
  interesting	
  examples	
  
	
  
Sentiment  Analysis	
                  •  Twitter	
  Mining	
  for	
  Airline	
  Sentiment	
  
                  •  Opinion	
  Lexicon	
  -­‐	
  +ve	
  2000,	
  -­‐ve	
  4800	
  
                  	
  




h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment	
h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
Need  I  say  more  ?	
                       “A	
  bit	
  of	
  clever	
  math	
  can	
  uncover	
  interes4ng	
  pa7erns	
  that	
  are	
  not	
  visible	
  to	
  the	
  
                                                                            human	
  eye”	
  	
  	
  




h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket	
h5p://www.relevantdata.com/pdfs/IUStudy.pdf
Project	
  Ideas	
  
Interesting Vectors of Exploration	

1.  Find	
  trending	
  #tags	
  &	
  then	
  related	
  #tags	
  –	
  using	
  
    cliques	
  over	
  co-­‐#tag-­‐citation,	
  which	
  infers	
  topics	
  
    related	
  to	
  trending	
  topics	
  
2.  Related	
  #tag	
  topics	
  over	
  a	
  set	
  of	
  tweets	
  by	
  a	
  user	
  or	
  
    group	
  of	
  users	
  
3.  Analysis-­‐In/Out	
  flow,	
  Tweet	
  Flow	
  
      –  Frequent	
  @mention	
  
4.  Find	
  affiliation	
  networks	
  by	
  List	
  memberships,	
  #tags	
  
    or	
  frequent	
  @mentions	
  	
  
Interesting Vectors of Exploration	

5.  Use	
  centrality	
  measures	
  to	
  determine	
  mortals	
  vs.	
  
    celebrities	
  
6.  Classify	
  Tweet	
  networks/cliques	
  based	
  on	
  message	
  
    passing	
  characteristics	
  
    –    Tweets	
  vs.	
  Retweets,	
  No	
  of	
  reweets,…	
  
7.  Retweet	
  Network	
  
    –    Measure	
  Influence	
  by	
  retweet	
  count	
  &	
  frequency	
  
    –    Information	
  contagion	
  by	
  looking	
  at	
  different	
  retweet	
  
         network	
  subcomponents	
  –	
  who,	
  when,	
  how	
  much,…	
  
Twi5er  Network  
Graph  Analysis	
      An	
  Example	
  
Analysis  Story  Board	
              •  @clouderati	
  is	
  a	
  popular	
  cloud	
  related	
  
                 Twitter	
  account	
  
              •  Goals:	
  
                  o  Analyze	
  the	
  social	
  graph	
  characteristics	
  of	
  the	
  users	
  who	
  are	
  
                     following	
  the	
  account	
  
 In this               •  Dig	
  one	
  level	
  deep,	
  to	
  the	
  followers	
  &	
  friends,	
  of	
  the	
  
 tutorial
                followers	
  of	
  @clouderati	
  
                  o  How	
  many	
  cliques	
  ?	
  How	
  strong	
  are	
  they	
  ?	
  
                  o  Does	
  the	
  @mention	
  support	
  the	
  clique	
  inferences	
  ?	
  
For you to        o  What	
  are	
  the	
  retweet	
  characteristics	
  ?	
  
explore !!
       o  How	
  does	
  the	
  #tag	
  network	
  graph	
  look	
  like	
  ?	
  	
  	
  
Twi5er  Analysis  Pipeline  Story  Board  
                  Stages,  Strategies,  APIs  &  Tasks	
                   Stage	
  4	
  
                                                                                            Stag
                                                                      o                         e	
  5	
  
  o  Get	
  &	
  Store	
  User	
  details	
                                For	
  e
     (distinct	
  user	
  list)	
                                         follo ach	
  @c
                                                                 o                 w            loud
  o  Unroll	
                                                        Find er	
                      erat
                                                                              	
  frie                           i	
  
                                                                    inte               nd=f
                                                                            rsec              o
                                                                                     tion llower	
  
  Note:	
  Needed	
  a	
                        Note:	
  Unroll	
                        	
           	
  -­‐	
  se
                                                stage	
  took	
  time	
                                                t	
  
  command	
  buffer	
  
  to	
  manage	
  scale	
                       &	
  missteps	
  
  (~980,000	
  users)	
  



                                                                                                       	
  
                              Stage	
  3	
                                                  Stage	
  6
                                                                                                             raph	
  
                                                                                             	
  s ocial	
  g heory	
  
                                                                            o      Create twork	
  t
                                                                                               ne
               o  Get	
  distinct	
  user	
  list	
  
                                                                            o      Apply	
   ues	
  &	
  other	
  
                  applying	
  the	
                                                              liq
                                                                             o      Infer	
  c s	
  	
  
                  set(union(list))	
  operation	
                                                  tie
                                                                                     proper
@clouderati  Twi5er  Social  Graph  	
•  Stats	
  (Retrospect	
  after	
  the	
  runs):	
  
    o  Stage	
  1	
  	
  
           •  @clouderati	
  has	
  2072	
  followers	
  
    o  Stage	
  2	
  
           •  Limiting	
  followers	
  to	
  5,000	
  per	
  user	
  
    o  Stage	
  3	
  
           •  Digging	
  1st	
  level	
  (set	
  union	
  of	
  followers	
  &	
  friends	
  of	
  the	
  
              followers	
  of	
  @clouderati)	
  explodes	
  into	
  ~980,000	
  distinct	
  
              users	
  
    o  MongoDB	
  of	
  the	
  cache	
  and	
  intermediate	
  datasets	
  ~10	
  GB	
  
    o  The	
  database	
  was	
  hosted	
  at	
  AWS	
  (Hi	
  Mem	
  XLarge	
  –	
  m2.xlarge	
  ),	
  8	
  
       X	
  15	
  GB,	
  Raid	
  10,	
  opened	
  to	
  Internet	
  with	
  DB	
  authentication	
  
Code  &  Run  Walk  Through	
                                         o  Code:	
  
                                            §  oscon_2012_user_list_spider_01.py	
  

                                         o  Challenges:	
  
            Stage	
  1	
  
                                            §  Nothing	
  fancy	
  
                                            §  Get	
  the	
  record	
  and	
  store	
  
o  Get	
  @clouderati	
  Followers	
  
o  Store	
  in	
  MongoDB	
                 §  Would	
  have	
  had	
  to	
  recurse	
  through	
  a	
  REST	
  
                                                cursor	
  if	
  there	
  were	
  more	
  than	
  5000	
  followers	
  
                                            §  @clouderati	
  has	
  2072	
  followers	
  

                                         o  Interesting	
  Points:	
  
Code  &  Run  Walk  Through	
                                                 o  Code:	
  
                                                      §    oscon_2012_user_list_spider_02.py	
  
                                                      §    oscon_2012_twitter_utils.py	
  
                                                      §    oscon_2012_mongo.py	
  
                                                      §    oscon_2012_validate_dataset.py	
  
                                                 o  Challenges:	
  
                                                      §    Multiple	
  runs,	
  errors	
  et	
  al	
  !	
  
               Stage	
  2	
  
                                                 o  Interesting	
  Points:	
  
                                                      §  Set	
  operation	
  between	
  two	
  mongo	
  collections	
  for	
  restart	
  buffer	
  
o  Crawl	
  1	
  level	
  deep	
  
                                                      §  Protected	
  users,	
  some	
  had	
  0	
  followers,	
  or	
  0	
  friends	
  
o  Get	
  friends	
  &	
  followers	
  
                                                      §  Interesting	
  operations	
  for	
  validate,	
  re-­‐crawl	
  and	
  refresh	
  
o  Validate,	
  re-­‐crawl	
  &	
  refresh	
  
                                                      §  Added	
  “status_code”	
  to	
  differentiate	
  protected	
  users	
  
                                                             §  {'$set':	
  {'status_code':	
  '401	
  Unauthorized,401	
  Unauthorized'}}	
  
                                                      §  Getting friends & followers of 2000 users is the hardest (or so I thought,
                                                          until I got through the next stage!)                     	
  	
  
Validate-­‐‑Recrawl-­‐‑Refresh  Logs	
•    pymongo	
  version	
  =	
  	
  2.2	
  
•    Connected	
  to	
  DB!	
                                                         o  1st	
  run	
  –	
  132	
  bad	
  records	
  
•    …	
                                                                              o  This	
  is	
  the	
  classic	
  Erlang-­‐style	
  
•    2075	
                                                                                supervisor	
  
•    Error	
  Friends	
  :	
  	
  <type	
  'exceptions.KeyError'>	
                   o  The	
  crawl	
  continue	
  on	
  transport	
  errors	
  
•    4ff3cd40e5557c00c7000000	
  -­‐	
  none	
  has	
  2072	
  followers	
  &	
  0	
  friends	
  
•    Error	
  Friends	
  :	
  	
  <type	
  'exceptions.KeyError'>	
  
                                                                                           without	
  worrying	
  about	
  retry	
  
•                                                                                     o  Validate	
  will	
  recrawl	
  &	
  refresh	
  as	
  
     4ff3a958e5557cfc58000000	
  -­‐	
  none	
  has	
  2072	
  followers	
  &	
  0	
  friends	
  
•    Error	
  Friends	
  :	
  	
  <type	
  'exceptions.KeyError'>	
                        needed	
  
•    4ff3ccdee5557c00b6000000	
  -­‐	
  none	
  has	
  2072	
  followers	
  &	
  0	
  friends	
  
•    4ff3d3b9e5557c01b900001e	
  -­‐	
  371187804	
  has	
  0	
  followers	
  &	
  0	
  friends	
  
•    4ff3d3d8e5557c01b9000048	
  -­‐	
  63488295	
  has	
  155	
  followers	
  &	
  0	
  friends	
  
•    4ff3d3d9e5557c01b9000049	
  -­‐	
  342712617	
  has	
  0	
  followers	
  &	
  0	
  friends	
  
•    4ff3d3d9e5557c01b900004a	
  -­‐	
  21266738	
  has	
  0	
  followers	
  &	
  0	
  friends	
  
•    4ff3d3dae5557c01b900004b	
  -­‐	
  204652853	
  has	
  0	
  followers	
  &	
  0	
  friends	
  
•    …	
  
•    4ff475cfe5557c1657000074	
  -­‐	
  258944989	
  has	
  0	
  followers	
  &	
  0	
  friends	
  
•    4ff475d3e5557c165700007d	
  -­‐	
  327286780	
  has	
  0	
  followers	
  &	
  0	
  friends	
  
•    Looks	
  like	
  we	
  have	
  132	
  not	
  so	
  good	
  records	
  
•    Elapsed	
  Time	
  =	
  0.546846	
  
Code  &  Run  Walk  Through	
                                         o  Code:	
  
                                            §  oscon2012_analytics_01.py	
  

            Stage	
  3	
                 o  Challenges:	
  
                                            o  Figure	
  out	
  the	
  right	
  Set	
  operations	
  
o  Get	
  distinct	
  user	
  list	
  
   applying	
  the	
  
   set(union(list))	
  operation	
       o  Interesting	
  Points:	
  
                                            §  973,323	
  unique	
  users	
  !	
  
                                            §  Recursively	
  apply	
  set	
  union	
  over	
  400,00	
  lists	
  
                                            §  Set	
  operations	
  took	
  slightly	
  more	
  than	
  a	
  minute	
  	
  
Code  &  Run  Walk  Through	
                                              o  Code:	
  
                                                  §    oscon2012_analytics_01.py	
  (focus	
  on	
  cmd	
  string	
  creation)	
  
                                                  §    oscon2012_get_user_info_01.py	
  
                                                  §    oscon2012_unroll_user_list_01.py	
  
                                                  §    oscon2012_unroll_user_list_02.py	
  
              Stage	
  4	
  
                                              o  Challenges:	
  
o  Get	
  &	
  Store	
  User	
  details	
         §  Where	
  do	
  I	
  start	
  ?	
  
   (distinct	
  user	
  list)	
                         •  In	
  the	
  next	
  few	
  slides	
  	
  
o  Unroll	
                                       §  Took	
  me	
  a	
  few	
  days	
  to	
  get	
  it	
  right	
  (along	
  with	
  my	
  daily	
  job!)	
  
                                                  §  Unfortunately	
  I	
  did	
  not	
  employ	
  parallelism	
  &	
  didn’t	
  use	
  my	
  
                                                      MacPro	
  with	
  32	
  GB	
  memory.	
  So	
  the	
  runs	
  were	
  long	
  
                                                  §  But	
  learned	
  hard	
  lessons	
  on	
  check	
  point	
  &	
  restart	
  
                                              o  Interesting	
  Points:	
  
                                                  §  Tracking	
  Control	
  Numbers	
  
                                                  §  Time	
  …	
  Marathon	
  unroll	
  run	
  19:33:33	
  !	
  
Twi5er  @  scale  Pa5ern	
•  Challenge:	
  
    o  You	
  want	
  to	
  get	
  screen	
  names,	
  follower	
  counts	
  and	
  other	
  details	
  for	
  a	
  million	
  
       users	
  
•  Problem:	
  
    o  No	
  easy	
  REST	
  API	
  
    o  https://api.twitter.com/1/users/lookup.json	
  will	
  take	
  100	
  user_ids	
  and	
  give	
  
       details	
  
•  Solution:	
  
    o  This	
  is	
  a	
  scalability	
  challenge.	
  Approach	
  it	
  like	
  so	
  
    o  Create	
  a	
  command	
  buffer	
  collection	
  in	
  MongoDB	
  splitting	
  millon	
  user_ids	
  
       into	
  batches	
  of	
  100	
  
    o  Have	
  a	
  “done”	
  flag	
  initialized	
  to	
  0	
  for	
  checkpoint	
  &	
  restart	
  
    o  After	
  each	
  cmd	
  str	
  is	
  executed,	
  rest	
  “done”:1	
  
    o  For	
  subsequent	
  runs,	
  ignore	
  “done”:1.	
  	
  
    o  Also	
  helps	
  in	
  control	
  number	
  tracking	
  
Control  numbers
Control  Numbers	
•    >	
  db.t_users_info.count()	
  
•    8122	
  
•    >	
  db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no”:)	
  
•    63	
  
•    >	
  db.api_str.find({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1})	
  
                                                                                  The	
  collection	
  should	
  have	
  8185	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4daeae5557c28bf001d53"),	
  "seq_no"	
  :	
  5433	
  }	
  
                                                                                  documents	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4daeae5557c28bf001d59"),	
  "seq_no"	
  :o5439	
  }	
  
                                                                                  But	
  it	
  has	
   	
   nly	
  8122.	
  
                                                                                  Where	
  did	
  the	
  rest	
  go	
  ?	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4daeae5557c28bf001d5f"),	
  "seq_no"	
  :	
  5445	
  }	
  
                         63	
  of	
  them	
  still	
  have	
  done=0	
  
•                        8122	
  +	
  63	
  =	
  8185	
  !	
  
     {	
  "_id"	
  :	
  ObjectId("4ff4daebe5557c28bf001d74"),	
  "seq_no"	
  :	
  5466	
  }	
  
                         Aha,	
  mystery	
  solved.	
  	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4daece5557c28bf001d7a"),	
  "seq_no"	
  :	
  5472	
  }	
  
                         	
  	
  	
  They	
  fell	
  through	
  the	
  cracks	
  
•                        Need	
  a	
  catch-­‐all	
  final	
  run	
  	
  	
  
     {	
  "_id"	
  :	
  ObjectId("4ff4daece5557c28bf001d80"),	
  "seq_no"	
  :	
  5478	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4daede5557c28bf001d90"),	
  "seq_no"	
  :	
  5494	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4daefe5557c28bf001daf"),	
  "seq_no"	
  :	
  5525	
  }	
  
Day  in  the  life  of  a  Control  Number  Detective  –  Run  #1	
•    Remember	
  :	
  973,323	
  users.	
  So,	
  9734	
  cmd	
  strings	
  (100	
  users	
  perstring)	
  
•    >	
  >	
  db.api_str.count()	
  
•    9831	
  
•    >	
  db.api_str.count({"done":0})	
  
•    239	
  
•    >>	
  db.t_users_info.count()	
  
•    9592	
  
•    >	
  >	
  db.api_str.count({"api_str":""})	
  
•    97	
  
•    So	
  we	
  should	
  have	
  9831	
  –	
  97	
  =	
  9734	
  records	
  
•    The	
  second	
  run	
  should	
  generate	
  9734-­‐9592	
  =	
  142	
  calls	
  (i.e.	
  350-­‐142=208	
  rate-­‐limit	
  should	
  remain).	
  Let	
  us	
  see.	
  
•    {	
  
•    	
  	
  …	
  
•    	
  	
  "x-­‐ratelimit-­‐class":	
  "api_identified",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "209",	
  	
  
•    	
  	
  …	
  
•    }	
  
•    Yep,	
  209	
  left	
  
•    >	
  
Day  in  the  life  of  a  Control  Number  Detective  –  Run  #2	
•    Remember	
  :	
  973,323	
  users.	
  So,	
  9734	
  cmd	
  strings	
  (100	
  users	
  perstring)	
  
•    >	
  db.t_users_info.count()	
                                                                           •    {	
  
•    9728	
                                                                                                   •    	
  …	
  
•    >	
  db.api_str.count({"api_str":""})	
                                                                  •    	
  	
  "x-­‐ratelimit-­‐limit":	
  "350",	
  	
  
•    97	
                                                                                                     •    	
  	
  "x-­‐ratelimit-­‐remaining":	
  "344",	
  	
  
•    >	
  db.api_str.count({"done":0})	
                                                                      •    	
  …	
  
•    103	
                                                                                                    •    }	
  
•    >9734-­‐9728=6,	
  same	
  as	
  103-­‐97	
  !	
                                                         •    Yep,	
  6	
  more	
  records	
  
•    Run	
  once	
  more	
  !	
                                                                               •    >	
  db.t_users_info.count()	
  
•    >	
  db.api_str.find({"done":0},{"seq_no":1})	
                                                           •    9734	
  
•    …	
                                                                                                      •    Good,	
  got	
  9734	
  !	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4dbd4e5557c28bf002e22"),	
  "seq_no"	
  :	
  9736	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4db05e5557c28bf001f47"),	
  "seq_no"	
  :	
  5933	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4db8be5557c28bf0028f6"),	
  "seq_no"	
  :	
  8412	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4dba2e5557c28bf002a8c"),	
  "seq_no"	
  :	
  8818	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4dbaee5557c28bf002b69"),	
  "seq_no"	
  :	
  9039	
  }	
  
•    {	
  "_id"	
  :	
  ObjectId("4ff4dbb8e5557c28bf002c1c"),	
  "seq_no"	
  :	
  9218	
  }	
  
•    …	
  


                                                                       Professor Layton
                                                                       would be proud !

                    In  fact,  I  have  all  the  four  &  plan  to  spend  sometime  with  them  &  Laphraig  !
Monitor  runs  &  track  control  numbers	




Unroll	
  run	
  8:48	
  PM	
  to	
  ~4:08	
  PM	
  next	
  day	
  !	
  
Track  error  &  the  document  numbers
Code  &  Run  Walk  Through	
                                                   o  Code:	
  
                                                       §  oscon2012_find_strong_ties_01.py	
  
                                                       §  oscon2012_social_graph_stats_01.py	
  
               Stage	
  5	
                        o  Challenges:	
  
o  For	
  each	
  @clouderati	
  
                                                       §  None.	
  Python	
  set	
  operations	
  made	
  this	
  easy	
  
   follower	
  
o  Find	
  friend=follower	
  	
  -­‐	
  set	
  
                                                   o  Interesting	
  Points:	
  
   intersection	
                                      §  Even	
  at	
  this	
  scale,	
  single	
  machine	
  is	
  not	
  enough	
  
                                                       §  Should	
  have	
  tried	
  data	
  parallelism	
  	
  
                                                              •  This	
  task	
  is	
  well	
  suited	
  to	
  leverage	
  data	
  
                                                                 parallelism	
  as	
  it	
  is	
  commutative	
  &	
  associative	
  
                                                       •  Was	
  getting	
  invalid	
  cursor	
  error	
  from	
  MongoDB	
  
                                                       •  So	
  had	
  to	
  do	
  the	
  updates	
  in	
  two	
  steps	
  
Code  &  Run  Walk  Through	
                                        o  Code:	
  
                                            §  oscon2012_find_cliques_01.py	
  
                                        o  Challenges:	
  
             Stage	
  6	
                   o  Lots	
  of	
  good	
  information	
  hidden	
  in	
  
                                               the	
  data	
  !	
  
o  Create	
  social	
  graph	
  
o  Apply	
  network	
  theory	
             o  Memory	
  !	
  
o  Infer	
  cliques	
  &	
  other	
  
   properties	
  	
                     o  Interesting	
  Points:	
  
                                            o  Graph,	
  List	
  &	
  set	
  operations	
  
                                            o  networkx	
  has	
  lots	
  of	
  interesting	
  
                                               graph	
  algorithms	
  
                                            o  Collections.Counter	
  to	
  the	
  rescue	
  
Twi5er  Social  Graph  Analysis  
                                                                                                                      of  @clouderati	
 o      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2072	
  Followers;	
  973,323	
  
        unique	
  users	
  one	
  level	
  down	
  w/	
  
        followers/friends	
  trimmed	
  at	
  5,000	
  
 o      Strong	
  ties	
  	
  
            o  follower=friend	
  
 o      235,697	
  users,	
  462,	
  419	
  edges	
  
 o      501,367	
  	
  Cliques	
  
 o      253	
  unique	
  users	
  8,906	
  Cliques	
  w/	
  >	
  
        10	
  users	
  
 o      GeorgeReese	
  in	
  7,973	
  of	
  them	
  !	
  See	
  
        List	
  for	
  1st	
  125	
  
 o      krishnan	
  3,446,randy	
  2,197,	
  joe	
  1,977,	
  
        sam	
  1,937,	
  jp	
  485,	
  stu	
  403,	
  urquhart	
  
        263,beaker	
  226,	
  acroll	
  149,	
  adrian	
  63,	
  
        gevaperry	
  24	
  
 o      Of	
  course,	
  clique	
  analysis	
  does	
  not	
  
        tell	
  us	
  the	
  whole	
  story	
  …	
  	
  
Clique	
  Distribution	
  =	
  {2:	
  296521,	
  3:	
  58368,	
  4:	
  36421,	
  5:	
  28788,	
  6:	
  24197,	
  7:	
  20240,	
  8:	
  15997,	
  
9:	
  11929,	
  10:	
  6576,	
  11:	
  1909,	
  12:	
  364,	
  13:	
  55,	
  14:	
  2}	
  
Twi5er  Social  Graph  Analysis  
                                                          of  @clouderati	
                      Celebrity	
  –	
  very	
  low	
  strong	
  ties	
                  Higher	
  Celebrity,	
  low	
  strong	
  ties	
  



o  sort	
  by	
  
   followers	
  vs.	
  
   sort	
  by	
  
   strong	
  ties	
  is	
  
   interesting	
  
                                                                            Medium	
  Celebrity,	
  medium	
  strong	
  ties	
  
Twi5er  Social  
Graph  Analysis  
of  @clouderati	
o    A	
  higher	
  “Strong	
  Ties”	
  
     number	
  is	
  interesting	
  
       §    It	
  means	
  a	
  very	
  high	
  
             follower-­‐friend	
  intersection	
  
       §    Reeves	
  62%,	
  bgolden	
  	
  85%	
  
o    Bur	
  a	
  high	
  clique	
  with	
  a	
  
     smaller	
  “Strong	
  ties”	
  show	
  
     more	
  cohesive	
  &	
  stronger	
  
     social	
  graph	
  
      §  eg.Krishnan	
  -­‐	
  15%	
  
               friends-­‐followers	
  	
  
      §  Samj	
  –	
  33%	
  
Twi5er  Social  Graph  Analysis  
                                                    of  @clouderati	


o  Ideas	
  for	
  
   more	
  
   Exploration	
  
    §    Include	
  all	
  
          followers	
  (instead	
  
          of	
  stopping	
  at	
  the	
  
          5000	
  cap)	
  
    §    Get	
  tweets	
  &	
  track	
  
          @mention	
  
    §    Frequent	
  
          @mention	
  shows	
  
          more	
  stronger	
  ties	
  
    §    #tag	
  analysis	
  could	
  
          show	
  some	
  
          interesting	
  
          networks	
  
Twitter Tips – A Baker’s Dozen	
1.    Twitter	
  APIs	
  are	
  (more	
  or	
  less)	
  congruent	
  &	
  symmetric	
  
2.    Twitter	
  is	
  usually	
  right	
  &	
  simple	
  -­‐	
  recheck	
  when	
  you	
  get	
  unexpected	
  results	
  
      before	
  blaming	
  Twitter	
  
      o      I	
  was	
  getting	
  numbers	
  when	
  I	
  was	
  expecting	
  screen_names	
  in	
  user	
  objects.	
  
      o      Was	
  ready	
  to	
  send	
  blasting	
  e-­‐mails	
  to	
  Twitter	
  team.	
  Decided	
  to	
  check	
  one	
  more	
  time	
  
             and	
  found	
  that	
  my	
  parameter	
  key	
  was	
  wrong-­‐screen_name	
  instead	
  of	
  user_id	
  
      o      Always test with one or two records before a long run ! - learned the hard way
3.    Twitter	
  APIs	
  are	
  very	
  powerful	
  –	
  consistent	
  use	
  can	
  bear	
  huge	
  data	
  
      o      In	
  a	
  week,	
  you	
  can	
  pull	
  in	
  4-­‐5	
  million	
  users	
  &	
  some	
  tweets	
  !	
  	
  
      o      Night runs are far more faster & error-free
4.    Use	
  a	
  NOSQL	
  data	
  store	
  as	
  a	
  command	
  buffer	
  &	
  data	
  buffer	
  
      o      Would	
  make	
  it	
  easy	
  to	
  work	
  with	
  Twitter	
  at	
  scale	
  
      o      I	
  use	
  	
  MongoDB	
  
                                                                                                                             The
      o      Keep	
  the	
  schema	
  simple	
  &	
  no	
  fancy	
  transformation	
                                             Beg
                                                                                                                                     in
            •                And	
  as	
  far	
  as	
  possible	
  same	
  as	
  the	
  ( json)	
  response	
  	
  	
            The ning A
      o      Use	
  NOSQL	
  CLI	
  for	
  trimming	
  records	
  et	
  al	
                                                        End     s
Twitter Tips – A Baker’s Dozen	

5.     Always	
  use	
  a	
  big	
  data	
  pipeline	
  
      o       Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize
      o       That	
  way	
  you	
  can	
  orthogonally	
  extend,	
  with	
  functional	
  components	
  like	
  command	
  buffers,	
  
              validation	
  et	
  al	
  	
  
6.     Use	
  functional	
  approach	
  for	
  a	
  scalable	
  pipeline	
  
      o       Compose	
  your	
  data	
  big	
  pipeline	
  with	
  well	
  defined	
  granular	
  functions,	
  each	
  doing	
  only	
  one	
  thing	
  
      o       Don’t	
  overload	
  the	
  functional	
  components	
  (i.e.	
  no	
  collect,	
  unroll	
  &	
  store	
  as	
  a	
  single	
  component)	
  
      o       Have	
  well	
  defined	
  functional	
  components	
  with	
  appropriate	
  caching,	
  buffering,	
  checkpoints	
  &	
  
              restart	
  techniques	
  
             •        This did create some trouble for me, as we will see later
7.     Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh	
  cycle	
  
       o  The	
  equivalent	
  of	
  the	
  traditional	
  ETL	
  
       o  Validation	
  stage	
  &	
  validation	
  routines	
  are	
  important	
  
               •    Cannot	
  expect	
  perfect	
  runs	
  
               •    Cannot	
  manually	
  look	
  at	
  data	
  either,	
  when	
  data	
  is	
  at	
  scale	
  
8.     Have	
  control	
  numbers	
  to	
  validate	
  runs	
  &	
  monitor	
  them	
  
      o       I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
              number through the various runs ! 
      o       There will be a separate printout of the control numbers that will be kept in the operations files
Twitter Tips – A Baker’s Dozen	
9.  Program	
  defensively	
  	
  
      o      more so for a REST-based-Big Data-Analytics systems
      o      Expect	
  failures	
  at	
  the	
  transport	
  layer	
  &	
  accommodate	
  for	
  them	
  	
  
10.  Have	
  Erlang-­‐style	
  supervisors	
  in	
  your	
  pipeline	
  
      o      Fail	
  fast	
  &	
  move	
  on	
  
      o      Don’t	
  linger	
  and	
  try	
  to	
  fix	
  errors	
  that	
  cannot	
  be	
  controlled	
  at	
  that	
  layer	
  
      o      A	
  higher	
  layer	
  process	
  will	
  circle	
  back	
  and	
  do	
  incremental	
  runs	
  to	
  
             correct	
  missing	
  spiders	
  and	
  crawls	
  
      o      Be	
  aware	
  of	
  visibility	
  &	
  lack	
  of	
  context.	
  Validate	
  at	
  the	
  lowest	
  layer	
  that	
  
             has	
  enough	
  context	
  to	
  take	
  corrective	
  actions	
  
      o      I have an example in part 2
11.  Data	
  will	
  never	
  be	
  perfect	
  
       o  Know	
  your	
  data	
  &	
  accommodate	
  for	
  it’s	
  idiosyncrasies	
  	
  
              •  for	
  example:	
  0	
  followers,	
  protected	
  users,	
  0	
  friends,…	
  
Twitter Tips – A Baker’s Dozen	
12.  Check	
  Point	
  frequently	
  (preferably	
  after	
  ever	
  API	
  call)	
  &	
  have	
  a	
  
     re-­‐startable	
  command	
  buffer	
  cache	
  	
  
     o      See a MongoDB example in Part 2
13.  Don’t	
  bombard	
  the	
  URL	
  
     o      Wait	
  a	
  few	
  seconds	
  before	
  successful	
  calls.	
  This	
  will	
  end	
  up	
  with	
  a	
  
            scalable	
  system,	
  eventually	
  
     o      I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
            work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 
14.  Always	
  measure	
  the	
  elapsed	
  time	
  of	
  your	
  API	
  runs	
  &	
  processing	
  
     o      Kind	
  of	
  early	
  warning	
  when	
  something	
  is	
  wrong	
  
15.  Develop	
  incrementally;	
  don’t	
  fail	
  to	
  check	
  “cut	
  &	
  paste”	
  errors	
  
Twitter Tips – A Baker’s Dozen	
16.  The	
  Twitter	
  big	
  data	
  pipeline	
  has	
  lots	
  of	
  opportunities	
  for	
  parallelism	
  
      o       Leverage	
  data	
  parallelism	
  frameworks	
  like	
  MapReduce	
  
      o       But	
  first	
  :	
  
             §       Prototype	
  as	
  a	
  linear	
  system,	
  	
  
             §       Optimize	
  and	
  tweak	
  the	
  functional	
  modules	
  &	
  cache	
  strategies,	
  	
  
             §       Note	
  down	
  stages	
  and	
  tasks	
  that	
  can	
  be	
  parallelized	
  and	
  	
  
             §       Then	
  parallelize	
  them	
  
      o       For the example project, we will see later, I did not leverage any parallel frameworks, but the
              opportunities were clearly evident. I will point them out, as we progress through the tutorial
17.  	
  Pay	
  attention	
  to	
  handoffs	
  between	
  stages	
  
      o      They	
  might	
  require	
  transformation	
  –	
  for	
  example	
  collect	
  &	
  store	
  might	
  store	
  a	
  user	
  list	
  
             as	
  multiple	
  arrays,	
  while	
  the	
  model	
  requires	
  each	
  user	
  to	
  be	
  a	
  document	
  for	
  
             aggregation	
  	
  
      o      But resist the urge to overload collect with transform
             o       i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
                     the array to separate documents 
      o      Add transformation as a granular function – of course, with appropriate buffering, caching,
             checkpoints & restart techniques 
18.  Have	
  a	
  good	
  log	
  management	
  system	
  to	
  capture	
  and	
  wade	
  through	
  
     logs	
  	
  
Twitter Tips – A Baker’s Dozen	
19.  Understand	
  the	
  underlying	
  network	
  characteristics	
  for	
  the	
  
     inference	
  you	
  want	
  to	
  make	
  
     o    Twitter	
  Network	
  	
  !=	
  Facebook	
  Network	
  ,	
  	
  Twitter	
  Graph	
  !=	
  LinkedIn	
  Graph	
  
     o    Twitter	
  Network	
  is	
  more	
  of	
  an	
  Interest	
  Network	
  
     o    So, many of the traditional network mechanisms & mechanics, like network
          diameter & degrees of separation, might not make sense
     o    But, others like Cliques and Bipartite Graphs do
Twitter Gripes	
1.     Need	
  more	
  rich	
  APIs	
  for	
  #tags	
  
      o      Somewhat	
  similar	
  to	
  users	
  viz.	
  followers,	
  friends	
  et	
  al	
  
      o      Might	
  make	
  sense	
  to	
  make	
  #tags	
  a	
  top	
  level	
  object	
  with	
  it’s	
  own	
  semantics	
  
2.  HTTP	
  Error	
  Return	
  is	
  not	
  uniform	
  	
  
      o      Returns	
  400	
  bad	
  Request	
  instead	
  of	
  420	
  
      o      Granted, there is enough information to figure this out
3.  Need	
  an	
  easier	
  way	
  to	
  get	
  screen_name	
  from	
  user_id	
  
4.  “following”	
  vs.	
  “friends_count”	
  i.e.	
  “following”	
  is	
  a	
  dummy	
  variable.	
  
      o      There are a few like this, most probably for backward compatibility
5.     Parameter	
  Validation	
  is	
  not	
  uniform	
  
      o      Gives	
  “404	
  Not	
  found”	
  instead	
  of	
  “406	
  Not	
  Acceptable”	
  or	
  “416	
  Range	
  
             Unacceptable”	
  
6.  Overall	
  more	
  validation	
  would	
  help	
  
      o      Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
             rest is easy to figure out
Thanks To these Giants …
Thanks To these Giants …
Thanks To these Giants …
Thanks To these Giants …
Thanks To these Giants …
I had a good time researching &

     preparing for this Tutorial. 

I hope you learned a few new things 

    have a few vectors to follow

The Art of Social Media Analysis with Twitter & Python

  • 1.
    The Art ofSocial Media Analysis with Twitter & Python krishna sankar @ksankar http://www.oscon.com/oscon2012/public/schedule/detail/23130
  • 2.
    Intro API, Objects,… o  House  Rules  (1  of  2)   Twitter Network We will analyze @clouderati, o  Doesn’t  assume  any  knowledge   Analysis 2072 followers, exploding to of  Twitter  API   Pipeline ~980,000 distinct users down one level o  Goal:  Everybody  in  the  same   page  &  get  a  working   knowledge  of  Twitter  API   NLP, NLTK, o  To  bootstrap  your  exploration   @mention Cliques, social Sentiment network graph into  Social  Network  Analysis  &   Analysis Twitter     Rewteeet analytics, Growth, #tag Network Information o  Simple  programs,  to  illustrate   contagion weakties usage  &  data  manipulation  
  • 3.
    Intro API, Objects,… Twitter o  House  Rules  (2  of  2)   Network We will analyze @clouderati, Analysis 2072 followers, exploding to o  Am  using  the  requests  library   Pipeline ~980,000 distinct users down o  There  are  good  Twitter  frameworks   one level for  python,  but  wanted  to  build   from  the  basics.  Once  one   understands  the  fundamentals,   frameworks  can  help   NLP, NLTK, @mention Cliques, social Sentiment o  Many  areas  to  explore  –  not  enough   Analysis network graph time.  So  decided  to  focus  on  social   graph,  cliques  &  networkx   Rewteeet analytics, Growth, #tag Network Information contagion weakties
  • 4.
    About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at   Genophen.com   o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization     •  http://www.ispcs.org/2012/index.html   o  Blog  :  http://doubleclix.wordpress.com/   o  Quora  :  http://www.quora.com/Krishna-­‐Sankar   •  Prior  Gigs   o  Lead  Architect  (Egnyte)   o  Distinguished  Engineer  (CSCO)   o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !   •  Current  Focus:   o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,   MongoDB,  Solr,  Drupal,GitHub,…   o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –   so  far  !)   o  Overlay  based  semantic  search  &  ranking   •  Other  related  Presentations   o  http://goo.gl/P1rhc  Big  Data  Engineering  Top  10  Pragmatics  (Summary)   o  http://goo.gl/0SQDV  The  Art  of  Big  Data  (Detailed)   o  http://goo.gl/EaUKH  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  
  • 5.
    Twitter Tips –A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric   2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way 3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free 4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   End •  And  as  far  as  possible  same  as  the  ( json)  response       Beg As Th inni o  Use  NOSQL  CLI  for  trimming  records  et  al   ng e
  • 6.
    Twitter Tips –A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al     6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later 7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale   8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  • 7.
    Twitter Tips –A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them     10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 2 11.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  • 8.
    Twitter Tips –A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 2 13.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong   15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  • 9.
    Twitter Tips –A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial 17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  • 10.
    Twitter Tips –A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  • 11.
    Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics   2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out 3.  Need  an  easier  way  to  get  screen_name  from  user_id   4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility 5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416   Range  Unacceptable”   6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  • 12.
    A Fork   &  deep ,NLTK     •   NLP weets into  T ment   4 o  Sen ysis   Anal • Not enough time for both • I chose the Social Graph route
  • 13.
    A minute aboutTwitter as platform & it’s evolution blog/ er. com/ tter-­‐ twitt wi ps:/ /dev. nsistent-­‐t htt ring-­‐co e deliv ence   “The micro-blogging service must find the ri expe right balance of running a profitable business and maintaining a robust “.. we want to make sure that the Twitter experience is developers' community.” – Chenda, CBS straightforward and easy to understand -- whether you’re on news! Twitter.com or elsewhere on the web”-Michael! My  Wish  &  Hope   •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive   •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn   o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience   o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that   •  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system   •  Orthogonally  extensible  platform  is  essential   •  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  
  • 14.
    •  For  Hands  on  Today   Setup o  Python  2.7.3   o  easy_install  –v  requests   •  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐ request   o  easy_install  –v  requests-­‐oauth   o  Hands  on  programs  at  https://github.com/xsankar/oscon2012-­‐handson   •  For  advanced  data  science  with  social  graphs   o  easy_install  –v  networkx   o  easy_install  –v  numpy   o  easy_install  –v  nltk     •  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al   o  Mongodb     •  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS   o  graphviz  -­‐  http://www.graphviz.org/;  easy_install  pygraphviz   o  easy_install  pydot  
  • 15.
    Thanks To theseGiants …
  • 16.
    Problem Domain Forthis tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as   observed  by  Twitter  primitives   o  Not  for  Twitter  based  apps  for  real  time  tweets   o  Not  web  sites  with  real  time  tweets   •  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &   actionable  recommendations   •  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.   not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before   pronouncing  a  trend)  
  • 17.
    Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)     o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)   o  Objects   o  API   o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)   II.  Break  (3:00  PM  -­‐  3:30  PM)   III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)   o  Underlying  Concepts   o  Social  Graph  Analysis  of  @clouderati   §  Stages,  Strategies  &  Tasks   §  Code  Walk  thru    
  • 18.
  • 19.
    Twi5er  API  : Read  These  First •  Using  Twitter  Brand   o  New  logo  &  associated  guidelines  :  https://twitter.com/about/logos   o  Twitter  Rules  :   https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐ best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules   o  Developer  Rules  of  the  road  https://dev.twitter.com/terms/api-­‐terms   •  Read  These  Links  First   1.  https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know   2.  https://dev.twitter.com/docs/faq   3.  Field  Guide  to  Objects  https://dev.twitter.com/docs/platform-­‐objects   4.  Security  https://dev.twitter.com/docs/security-­‐best-­‐practices   5.  Media  Best  Practices  :  https://dev.twitter.com/media   6.  Consolidates  Page  :  https://dev.twitter.com/docs   7.  Streaming  APIs  https://dev.twitter.com/docs/streaming-­‐apis   8.  How  to  Appeal  (Not  that  you  all  would  need  it  !)  https://support.twitter.com/ articles/72585   •  Only  One  version  of  Twitter  APIs  
  • 20.
    API  Status  Page •  https://dev.twitter.com/status   •  https://dev.twitter.com/issues   •  https://dev.twitter.com/discussions  
  • 21.
  • 22.
    Open  This  First • Install  pre-­‐req  as  per  the  setup  slide   •  Run     o  oscon2012_open_this_first.py   o  To  test  connectivity  –  “canary  query”   •  Run   o  oscon2012_rate_limit_status.py   o  Use  http://www.epochconverter.com  to  check  reset_time   •  Formats  xml,  json,  atom  &  rss  
  • 23.
    Twitter  API   Near-realtime, High Volume Follow users, Core Data, REST   Streaming   topics, data Core Twitter mining Objects Public  Streams   Seach & User  Streams   Trend Twitter   Twitter   Site  Streams   REST   Search   Firehose   Build  Profile   Keywords   Create/Post  Tweets   Specific  User   Reply   Trends   Favorite,  Re-­‐tweet   Rate  Limit  :     Rate  Limit  :  150/350        Complexity  &  Frequency  
  • 24.
  • 25.
    Rate  Limits • By  API  type  &  Authentication  Mode   API No authC authC Error REST   150/hr   350/hr   400   Search   Complexity  &   -­‐N/A-­‐   420   Frequency   Streaming   Upto  1%   Fire  hose   none   none  
  • 26.
    Rate  Limit  Header • {   •  "status":  "200  OK",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "149",     •     "x-­‐ratelimit-­‐reset":  "1340467358",     •     "x-­‐runtime":  "0.04144",     •     "x-­‐transaction":  "2b49ac31cf8709af",     •     "x-­‐transaction-­‐mask":   "a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"   •  }  
  • 27.
    Rate  Limit-­‐‑ed  Header •  {   •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",     •     "content-­‐encoding":  "gzip",     •     "content-­‐length":  "150",     •     "content-­‐type":  "application/json;  charset=utf-­‐8",     •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",     •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",     •     "server":  "tfe",     •     ”…   •     "status":  "400  Bad  Request",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "0",     •     "x-­‐ratelimit-­‐reset":  "1341363230",     •     "x-­‐runtime":  "0.01126"   •  }  
  • 28.
    Rate  Limit  Example • Run   o  oscon2012_rate_limit_02.py   •  It  iterates  through  a  list  to  get  followers     •  List  is  2072  long  
  • 29.
    •  {   •     …   •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",     •  "status":  "200  OK",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     Last  time,  it  gave  me  5  min.   Now  the  reset  timer  is  1   •     "x-­‐ratelimit-­‐remaining":  "147",     hour     •     "x-­‐ratelimit-­‐reset":  "1341366831",     150  calls,  not  authenticated   •     "x-­‐runtime":  "0.02768",     •     "x-­‐transaction":  "f1bafd60112dddeb",     •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   •  }  
  • 30.
    •  {   •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",     •     "content-­‐encoding":  "gzip",     •     "content-­‐type":  "application/json;  charset=utf-­‐8",     •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",     And  Rate  Limit  kicked-­‐‑in •  …   •  "status":  "400  Bad  Request",     •     "transfer-­‐encoding":  "chunked",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐ratelimit-­‐class":  "api",     •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "0",     •     "x-­‐ratelimit-­‐reset":  "1341366831",     •     "x-­‐runtime":  "0.01342"   •  }  
  • 31.
    API  with  OAuth •  {   •     …   •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",     •     "etag":  ""dd419c02ed00fc6b2a825cc27wbe040"",     •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",     •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",     •     "pragma":  "no-­‐cache",     •     "server":  "tfe",     •  …   •  "status":  "200  OK",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐access-­‐level":  "read",     •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",     •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "349",     •     "x-­‐ratelimit-­‐reset":  "1341369121",     •     "x-­‐runtime":  "0.05539",     OAuth   •  •     "x-­‐transaction":  "9f8508fe4c73a407",        "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"   “api-­‐identified”   •  }   1  hr  reset   350  calls  
  • 32.
    •  {   •     …   •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",     •  …   •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "133",     •     "x-­‐ratelimit-­‐reset":  "1341500165",     •   …   Rate  Limit  resets  during   •  }   consecutive  calls •  ********  2416   •  {   +1   •  …   hour •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",     •  …   •     "status":  "200  OK",     •     ….   •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "349",     •     "x-­‐ratelimit-­‐reset":  "1341503776",     •  ********  2417  
  • 33.
    Unexplained  Errors •  Traceback  (most  recent  call  last):   •     File  "oscon2012_get_user_info_01.py",  line  39,  in  <module>   •         r  =  client.get(url,  params=payload)   •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  244,  in  get   •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  230,  in  request   •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",  line  609,  in  send   •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(host='api.twitter.com',  port=443):  Max   retries  exceeded  with  url:  /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –   17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually  10-­‐6  AM  PST   42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%   2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got  around  by  “Trap  &  wait  5  seconds”   2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%   2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night  Runs  are  relatively  error  free   09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201  
  • 34.
    •  {   •  •   …      "date":  "Fri,  06  Jul  2012  03:41:09  GMT",     A Day in the life of •     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",     Twitter Rate Limit •     "server":  "tfe",     •     "set-­‐cookie":  "dnt=;  domain=.twitter.com;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",     •     "status":  "400  Bad  Request",     •     "vary":  "Accept-­‐Encoding",     •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "0",     Missed  by  4  min! •     "x-­‐ratelimit-­‐reset":  "1341546334",     •     "x-­‐runtime":  "0.01918"   •  }   •  Error,  sleeping   •  {   •   …   •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",     •   …   •   "status":  "200  OK",     •   …   •   "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "349",     OK  after  5  min  sleep •   …  
  • 35.
    Strategies I  have  no  exotic  strategies,  so  far  !   1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in   2.  Combine  authenticated  &  non-­‐authenticated  calls   3.  Use  multiple  API  types   4.  Cache   5.  Store  &  get  only  what  is  needed   6.  Checkpoint  &  buffer  request  commands   7.  Distributed  data  parallelism  –  for  example  AWS  instances   http://www.epochconverter.com/  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit
  • 36.
  • 37.
    Authentication •  Three  modes   o  Anonymous   o  HTTP  Basic  Auth   o  OAuth   •  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are   supported   •   OAuth  enables  the  user  to  authorize  an  application   without  sharing  credentials   •  Also  has  the  ability  to  revoke   •  Twitter  supports  OAuth  1.0a   •  OAuth  2.0  is  the  new  standard,  much  simpler   o  No  timeframe  for  Twitter  support,  yet      
  • 38.
    OAuth  Pragmatics •  Helpful  Links   o  https://dev.twitter.com/docs/auth/oauth   o  https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth   o  https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples   o  http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html   •  Discussion  on  OAuth  internal  mechanisms  is  better  left  for   another  day   •  For  headless  applications  to  get  OAuth  token,  go  to  https:// dev.twitter.com/apps   •   Create  an  application  &  get  four  credential  pieces   o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret   •  All  the  frameworks  have  support  for  OAuth.  So  plug  –in   these  values  &  use  the  framework’s  calls   •  I  used  request-­‐oauth  library  like  so:  
  • 39.
    request-­‐‑oauth def  get_oauth_client():   Get  client  using  the        consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"   token,  key  &  secret  from          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"   dev.twitter.com/apps          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={'pre_request':  oauth_hook})          return  client   Use  the  client  instead   def  get_followers(user_id):   of  requests                                      url  =  'https://api.twitter.com/1/followers/ids.json’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)   def  get_followers_with_oauth(user_id,client):                                        url  =  'https://api.twitter.com/1/followers/ids.json'                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)   Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth
  • 40.
    OAuth  Authorize  screen •  The  user   authenticates  with   Twitter  &  grants   access  to  Forbes   Social   •  Forbes  social   doesn’t  have  the   users  credentials,   but  uses  OAuth  to   access  the  user’s   account  
  • 41.
  • 42.
    HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐   •  404  Not  Found   Library  error   •  406  Not  Acceptable   •  200  OK   •  413  Too  Long   •  304  Not  Modified   •  416  Range  Unacceptable   •  400  Bad  Request   •  420  Enhance  Your  Calm   o  Check  error  message  for  explanation   o  Rate  Limited   o  REST  Rate  Limit  !     •  500  Internal  Server  Error   •  401  UnAuthorized   •  502  Bad  Gateway     o  Beware  –  you  could  get  this  for  other   o  Down  for  maintenance   reasons  as  well.       •  503  Service  Unavailable   •  403  Forbidden   o  Overloaded  “Fail  whale”   o  Hit  Update  Limit  (>  max  Tweets/day,   •  504  Gateway  Timeout   following  too  many  people)   o  Overloaded   h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses
  • 43.
    HTTP  Status  Code -­‐‑  Example •  {   •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",     •     "content-­‐encoding":  "gzip",     •     "content-­‐length":  "91",     •     "content-­‐type":  "application/json;  charset=utf-­‐8",     •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",     •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",     •     "server":  "tfe",     •   …   •     "status":  "401  Unauthorized",     •     "vary":  "Accept-­‐Encoding",     •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     •  •     "x-­‐ratelimit-­‐class":  "api",        "x-­‐ratelimit-­‐limit":  "0",     Detailed  error   •     "x-­‐ratelimit-­‐remaining":  "0",     message    in  JSON  !   •     "x-­‐ratelimit-­‐reset":  "1340413616",     •     "x-­‐runtime":  "0.01997"   I  like  this   •  }   •  {   •     "errors":  [   •         {   •             "code":  53,     •             "message":  "Basic  authentication  is  not  supported"   •         }   •     ]   •  }  
  • 44.
    HTTP  Status  Code –  Confusing  Example •  {   •  GET  https://api.twitter.com/1/users/lookup.json? •  …   screen_nme=twitterapi,twitter&include_entities= •     "pragma":  "no-­‐cache",     true   •     "server":  "tfe",     •   …     •  Spelling  Mistake   •     "status":  "404  Not  Found",     o  Should  be  screen_name   •     …   •  But  confusing  error  !   •  }   •  {   •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,   •     "errors":  [   showing  parameter  error   •         {   •             "code":  34,     •             "message":  "Sorry,  that  page  does  not  exist"   •         }   •     ]   •  }  
  • 45.
    HTTP  Status  Code -­‐‑  Example •  {   •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",     •     "content-­‐encoding":  "gzip",     •     "content-­‐length":  "112",     •     "content-­‐type":  "application/json;charset=utf-­‐8",     Sometimes,  the  errors  are   •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",     not  correct.  I  got  this  error   •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",     •  …   for  user_timeline.json  w/   •     "status":  "401  Unauthorized",     user_id=20,15,12   •     "www-­‐authenticate":  "OAuth  realm="https://api.twitter.com"",     Clearly  a  parameter  error   •     "x-­‐frame-­‐options":  "SAMEORIGIN",     •     "x-­‐ratelimit-­‐class":  "api",     (i.e.  more  parameters)   •     "x-­‐ratelimit-­‐limit":  "150",     •     "x-­‐ratelimit-­‐remaining":  "147",     •     "x-­‐ratelimit-­‐reset":  "1340417742",     •     "x-­‐transaction":  "d545a806f9c72b98"   •  }   •  {   •     "error":  "Not  authorized",     •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"   •  }  
  • 46.
  • 47.
    Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags   h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  • 48.
    Tweets •  A.k.a  Status  Updates   •  Interesting  fields   o  Coordinates  <-­‐  geo  location   o  created_at   o  entities  (will  see  later)   o  Id,  id_str   o  possibly  sensitive   o  user  (will  see  later)   •  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –   hard  to  maintain  at  scale   •  https://dev.twitter.com/docs/faq#6981   o  withheld_in_countries     •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets
  • 49.
    A  word  about id,  id_str •  June  1,  2010   o  Snowflake  the  id  generator  service   o  “The  full  ID  is  composed  of  a  timestamp,   a  worker  number,  and  a  sequence   number”   o  Had  problems  with  JavaScript  to  handle   numbers  >  53  bits   o  “id”:819797   o  “id_str”:”819797”   h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake
  • 50.
    Tweets  -­‐‑  example • Let  us  run  oscon2012-­‐tweets.py   •  Example  of  tweet   o  coordinates   o  id     o  id_str  
  • 51.
    Users •  followers_count   •  geo_enabled   •  Id,  Id_str   •  name,  screen_name   •  Protected   •  status,  statuses_count   •  withheld_in_countries   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users
  • 52.
    Users  –  Let us  run  some  examples •  Run     o  oscon_2012_users.py   •  Lookup  users  by  screen_name   o  oscon12_first_20_ids.py   •  Lookup  users  by  user_id   •  Inspect  the  results   o  id,  name,  status,  status_count,  protected,  followers   (for  top  10  followers),  withheld  users   •  Can  use  information  for  customizing   the  user’s  screen  in  your  web  app  
  • 53.
    Entities •  Metadata  &  Contextual  Information   •  You  can  parse  them,  but  Entities   parse  them  out  as  structured  data   •  REST  API/Search  API  –   include_entities=1   •  Streaming  API  –  included  by  default   •  hashtags,  media,  urls,   user_mentions   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-­‐‑entities h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper
  • 54.
    Entities •  Run     o  oscon2012_entities.py   •  Inspect  hashtags,  urls  et  al    
  • 55.
    Places •  attributes   •  bounding_box   •  Id  (as  a  string!)   •  country   •  name   h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes
  • 56.
    Places •  Can  search  for  tweets  near  a  place  like  so:   •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]   o  Tweets  near  that  place   •  Tweets  near  San  Jose  [37.395715,-­‐122.102308]   •  We  will  not  see  further  here.  But  very  useful  
  • 57.
    Timelines •  Collections  of  tweets  ordered  by  time   •  Use  max_id  &  since_id  for  navigation   h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines
  • 58.
    Other  Objects  & APIs •  Lists   •  Notifications   •  Friendships/exists  to  see  if  one  follows   the  other  
  • 59.
    Followers   Twitter  Platform   Friends   Are Followed By Objects   Follow Users   Status Update @ user_mentions   Entities   embed urls   Temporally Tweets   embe d Ordered media   TimeLine   # Places   hashtags   h5ps://dev.twi5er.com/docs/platform-­‐‑objects
  • 60.
    Hands-­‐‑on  Exercise  (15 min) •  Setup  environment  –  slide  #14   •  Sanity  Check  Environment  &  Libraries   o  oscon2012_open_this_first.py   o  oscon2012_rate_limit_status.py   •  Get  objects  (show  calls)   o  Lookup  users  by  screen_name    -­‐  oscon12_users.py   o  Lookup  users  by  id  -­‐  oscon12_first_20_ids.py   o  Lookup  tweets  -­‐  oscon12_tweets.py   o  Get  entities  -­‐  oscon12_entities.py   •  Inspect  the  results   •  Explore  a  little  bit   •  Discussion  
  • 61.
  • 62.
    Twitter  API   Near-realtime, High Volume Follow users, Core Data, REST   Streaming   topics, data Core Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  • 63.
    Twi5er  REST  API •  https://dev.twitter.com/docs/api   •  What  we  were  doing  were  the  REST  API   •  Request-­‐Response   •  Anonymous  or  OAuth   •  Rate  Limited  :   o  150/350  
  • 64.
    Twi5er  Trends •  oscon2012-­‐trends.py   •  Trends/weekly,  Trends/monthly   •  Let  us  run  some  examples   o  oscon2012_trends_daily.py   o  oscon2012_trends_weekly.py   •  Trends  &  hashtags   o  #hashtag  euro2012   o  http://hashtags.org/euro2012   o  http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/   o  http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html   o  Top  10  :  http://twittercounter.com/pages/100,  http://twitaholic.com/  
  • 65.
    Brand  Rank  w/ Twi5er •  Walk  Through  &  results  of  following   o  oscon2012_brand_01.py   •  Followed  10  user-­‐brands  for  a  few  days  to  find   growth   •  Brand  Rank     o  Growth  of  a  brand  w.r.t  the  industry   o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &   correlate  using  Twitter  APIs  &  metrics   •  API  :  url='https://api.twitter.com/1/users/ lookup.json'   •  payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}  
  • 66.
    Brand  Rank  w/ Twi5er Clouderati   is  very   stable
  • 67.
    Brand  Rank  w/ Twi5er   Tech  Brands •  Google  I/O  showed  a  spike  on  6/27-­‐   6/28   •  OReillyMedia  shares  some  spike   •  Looking  at  a  few  days  worth  of   data,  our  best  inference  is  that   “oscon  doesn’t  track  with  googleio”   •  “Clouderati  doesn’t  track  at  all”  
  • 68.
    Brand  Rank  w/ Twi5er   World  of  Soccer •  FOXSoccer,UEFAcom   track  each  other     The  numbers  seldom   decrease.  So  calculating   –ve  velocity  will  not   work OTOH,  if  you  see  a  –ve   velocity,  investigate
  • 69.
    Brand  Rank  w/ Twi5er   World  of  Basketball •  NBA,  MiamiHeat,  okcthunder  track  each  other   •  Used  %  than  absolute  numbers  to  compare   •  The  hike  on  7/6  to  7/10  is  interesting.      
  • 70.
    Brand  Rank  w/ Twi5er   Rising  Tide  … •  For  some  reason,  all  numbers  are  going  up  7/6  thru   7/10  –  except  for  clouderati!   •  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  
  • 71.
    Trivia  :  Search API •  Search(search.twitter.com)   o  Built  by  Summize  which  was  acquired  by  Twitter  in   2008   o  Summize  described  itself  as  “sentiment  mining”  
  • 72.
    Search  API •  Very  simple     o  GET  http://search.twitter.com/search.json?q=<blah>   •  Based  on  a  search  criteria   •  “The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets” •  Recent  =  Last  6-­‐9  days  worth  of  tweets   •  Anonymous  Call   •  Rate  Limit   o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency   h5ps://dev.twi5er.com/docs/using-­‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search
  • 73.
    Search  API •  Filters   o  Search  URL  encoded   o  @  =  %40,  #=%23   o   emoticons    :)  and  :(,   o  http://search.twitter.com/search.atom?q=sometimes+%3A)   o  http://search.twitter.com/search.atom?q=sometimes+%3A(   •  Location  Filters,  date  filters   •  Content  searches  
  • 74.
    Streaming  API •  Not  request  response;  but  stream   •  Twitter  frameworks  have  the  support   •  Rate  Limit  :  Upto  1%   •  Stall  warning  if  the  client  is  falling  behind   •  Good  Documentation  Links   o  https://dev.twitter.com/docs/streaming-­‐apis/connecting   o  https://dev.twitter.com/docs/streaming-­‐apis/parameters   o  https://dev.twitter.com/docs/streaming-­‐apis/processing  
  • 75.
    Firehose •  ~  400  million  public  tweets/day   •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !   •  If  you  hit  real  limits,  then  explore  the  firehose  route   •  AFAIK,  it  is  not  cheap,  but  worth  it  
  • 76.
    API  Best  Practices 1.  Use  JSON   2.  Use  user_id  than  screen_name   o  User_id  is  constant  while  screen_name  can  change   3.  max_id  and  since_id   o  For  example  direct  messages,  if  you  have  last  message  use   since_id  for  search   o  max_id  how  far  to  go  back   4.  Cache  as  much  as  you  can   5.  Set  the  User-­‐Agent  header  for  debugging   I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
  • 77.
    Twitter  API   Near-realtime, High Volume Follow users, Core Data, REST   Streaming   topics, data Core Twitter mining Objects Public  Streams Seach & User  Streams Trend Twitter   Twitter   Site  Streams REST   Search   Firehose Build  Profile Questions  ?   Keywords Create/Post  Tweets Specific  User Reply Trends Favorite,  Re-­‐‑tweet Rate  Limit  :   Rate  Limit  :  150/350      Complexity  &  Frequency
  • 78.
    Part II SNA Part II Twitter Network Analysis
  • 79.
    2.  Store   3.  Transform  &     1.  Collect   Analyze   the Validate Dataset & . Keep don’t Tip: 3 simple; re-crawl/refresh a schem afrai d to be for m Most  important  &   trans the  ugliest  slide  in   this  deck  !   as lem ent , 1. Imp ipeline 4.  Model   Tip: age d p nolith 5.  Predict,   &     a st r a mo Reason   neve Recommend  &   Visualize  
  • 80.
    Trivia •  Social  Network  Analysis  originated  as  Sociometry  &   the  social  network  was  called  a  sociogram   •  Back  then,  Facebook  was  called  SocioBinder!   •  Jacob  Levi  Morano,  is  considered  the  originator   o  NYTimes,  April  3,  1933,  P.  17  
  • 81.
    Twi5er  Networks-­‐‑Definitions •  Nodes   o  Users   o  #tags   •  Edges   o  Follows   o  Friends   o  @mentions   o  #tags   •  Directed  
  • 82.
    Twi5er  Networks-­‐‑Definitions •  In-­‐degree   o  Followers   •  Out-­‐Degree   o  Friends/Follow   •  Centrality  Measures   •  Hubs  &  Authorities   o  Hubs/Directories  tell  us  where   Authorities  are   o  “Of  Mortals  &  Celebrities”  is   more  “Twitter-­‐style”  
  • 83.
    Twi5er  Networks-­‐‑Properties M •  Concepts  From  Citation   N Networks   K J o  Cocitation   L   I •  Common  papers  that  cite  a  paper   A •  Common  Followers   B G o  C  &  G  (Followed  by  F  &  H)   C H o  Bibliographic  Coupling   •  Cite  the  same  papers   D F   •  Common  Friends  (i.e.  follow  same   E person)   o  D,  E,  F  &  H  
  • 84.
    Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks   M o  Cocitation   N •  Common  papers  that  cite  a  paper   K •  Common  Followers   J   L   o  C  &  G  (Followed  by  F  &  H)   I   o  Bibliographic  Coupling   A •  Cite  the  same  papers   B G •  Common  Friends    (i.e.  follow  same  person)   o  D,  E,  F  &  H  follow  C   o  H  &  F  follow  C  &  G   H C •  So  H  &  F  have  high  coupling   D •  Hence,  if  H  follows  A,  we  can   F   recommend  F  to  follow  A   E
  • 85.
    Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks   o  Two  disjoint  subsets   o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph   o  Membership  in  Lists     •  lists  vs.  users  bipartite  graph   o  Common  #Tags  in  Tweets     •  #tags  vs.  members  bipartite  graph   o  @mention  together   •  ?  Can  this  be  a  bipartite  graph   •  ?  How  would  we  fold  this  ?  
  • 86.
    Other  Metrics  & Mechanisms •  Kronecker  Graphs  Models   o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices   o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of   their  adjacency  matrices   o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al   •  Erdos-­‐Renyl  Random  Graphs   o  Easy  to  build  a  Gn,p  graph   o  Assumes  equal  likelihood  of  edges  between  two  nodes   o  In a Twitter social network, we can create a more realistic expected distribution (adding the “social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter   •  Weak  Ties   •  Follower  velocity  (+ve  &  –ve),  Association  strength   o  Unfollow  not  a  reliable  measure   o  But  an  interesting  property  to  investigate  when  it  happens   Not covered here, but potential for an encore ! Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs
  • 87.
    Twi5er  Networks-­‐‑Properties •  Twitter!= LinkedIn, Twitter != Facebook •  Twitter Network == Interest Network •  Be  cognizant  of  the  above  when  you  apply  traditional  network   properties  to  Twitter     •  For  example,     o  Six  degrees  of  separation  doesn't  make  sense  (most  of  the  time)  in   Twitter  –  except  may  be  for  Cliques   o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?   •  Probably  not   o  Do  cut  sets  make  sense  ?     •  Probably  not   o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques   o  Bipartite  graphs  do  make  sense  
  • 88.
    Cliques  (1  of 2) •  “Maximal  subset  of  the  vertices  in  an   undirected  network  such  that  every  member   of  the  set  is  connected  by  an  edge  to  every   other”   •  Cohesive  subgroup,  closely  connected   •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.   connected  to  at  least  n-­‐k  others)   •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse   network;  1-­‐plex  being  the  perfect  clique   Ref:  Networks,  An  Introduction-­‐‑Newman
  • 89.
    Cliques  (2  of 2) •  k-­‐core  –  at  least  k  others  in  the  subset;   (n-­‐k)-­‐plex   •  k-­‐clique  –  no  more  than  k  distance  away   o  Path  inside  or  outside  the  subset   o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)   •  We  will  apply  k-­‐plex  Cliques  for  one  of   our  hands-­‐on     Ref:  Networks,  An  Introduction-­‐‑Newman
  • 90.
    Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work   on  the  Twitter  platform   o  Collect  Tweets   o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons   •  Naïve  Bayes/Max  Entropy  Class/SVM   o  Aggregated  Text  Sentiment/Moving  Average   •  I  chose  not  to  dive  deeper  because  of  time  constraints   o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,   all  in  3  hrs   •  Next  3  Slides  have  couple  of  interesting  examples    
  • 91.
    Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment   •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800     h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
  • 92.
    Need  I  say more  ? “A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the   human  eye”       h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf
  • 94.
  • 95.
    Interesting Vectors ofExploration 1.  Find  trending  #tags  &  then  related  #tags  –  using   cliques  over  co-­‐#tag-­‐citation,  which  infers  topics   related  to  trending  topics   2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or   group  of  users   3.  Analysis-­‐In/Out  flow,  Tweet  Flow   –  Frequent  @mention   4.  Find  affiliation  networks  by  List  memberships,  #tags   or  frequent  @mentions    
  • 96.
    Interesting Vectors ofExploration 5.  Use  centrality  measures  to  determine  mortals  vs.   celebrities   6.  Classify  Tweet  networks/cliques  based  on  message   passing  characteristics   –  Tweets  vs.  Retweets,  No  of  reweets,…   7.  Retweet  Network   –  Measure  Influence  by  retweet  count  &  frequency   –  Information  contagion  by  looking  at  different  retweet   network  subcomponents  –  who,  when,  how  much,…  
  • 97.
    Twi5er  Network   Graph Analysis An  Example  
  • 98.
    Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related   Twitter  account   •  Goals:   o  Analyze  the  social  graph  characteristics  of  the  users  who  are   following  the  account   In this •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the   tutorial followers  of  @clouderati   o  How  many  cliques  ?  How  strong  are  they  ?   o  Does  the  @mention  support  the  clique  inferences  ?   For you to o  What  are  the  retweet  characteristics  ?   explore !! o  How  does  the  #tag  network  graph  look  like  ?      
  • 99.
    Twi5er  Analysis  Pipeline Story  Board   Stages,  Strategies,  APIs  &  Tasks Stage  4   Stag o  e  5   o  Get  &  Store  User  details   For  e (distinct  user  list)   follo ach  @c o  w loud o  Unroll   Find er   erat  frie i   inte nd=f rsec o tion llower   Note:  Needed  a   Note:  Unroll      -­‐  se stage  took  time   t   command  buffer   to  manage  scale   &  missteps   (~980,000  users)     Stage  3   Stage  6 raph    s ocial  g heory   o  Create twork  t ne o  Get  distinct  user  list   o  Apply   ues  &  other   applying  the   liq o  Infer  c s     set(union(list))  operation   tie proper
  • 100.
    @clouderati  Twi5er  Social Graph   •  Stats  (Retrospect  after  the  runs):   o  Stage  1     •  @clouderati  has  2072  followers   o  Stage  2   •  Limiting  followers  to  5,000  per  user   o  Stage  3   •  Digging  1st  level  (set  union  of  followers  &  friends  of  the   followers  of  @clouderati)  explodes  into  ~980,000  distinct   users   o  MongoDB  of  the  cache  and  intermediate  datasets  ~10  GB   o  The  database  was  hosted  at  AWS  (Hi  Mem  XLarge  –  m2.xlarge  ),  8   X  15  GB,  Raid  10,  opened  to  Internet  with  DB  authentication  
  • 101.
    Code  &  Run Walk  Through o  Code:   §  oscon_2012_user_list_spider_01.py   o  Challenges:   Stage  1   §  Nothing  fancy   §  Get  the  record  and  store   o  Get  @clouderati  Followers   o  Store  in  MongoDB   §  Would  have  had  to  recurse  through  a  REST   cursor  if  there  were  more  than  5000  followers   §  @clouderati  has  2072  followers   o  Interesting  Points:  
  • 102.
    Code  &  Run Walk  Through o  Code:   §  oscon_2012_user_list_spider_02.py   §  oscon_2012_twitter_utils.py   §  oscon_2012_mongo.py   §  oscon_2012_validate_dataset.py   o  Challenges:   §  Multiple  runs,  errors  et  al  !   Stage  2   o  Interesting  Points:   §  Set  operation  between  two  mongo  collections  for  restart  buffer   o  Crawl  1  level  deep   §  Protected  users,  some  had  0  followers,  or  0  friends   o  Get  friends  &  followers   §  Interesting  operations  for  validate,  re-­‐crawl  and  refresh   o  Validate,  re-­‐crawl  &  refresh   §  Added  “status_code”  to  differentiate  protected  users   §  {'$set':  {'status_code':  '401  Unauthorized,401  Unauthorized'}}   §  Getting friends & followers of 2000 users is the hardest (or so I thought, until I got through the next stage!)    
  • 103.
    Validate-­‐‑Recrawl-­‐‑Refresh  Logs •  pymongo  version  =    2.2   •  Connected  to  DB!   o  1st  run  –  132  bad  records   •  …   o  This  is  the  classic  Erlang-­‐style   •  2075   supervisor   •  Error  Friends  :    <type  'exceptions.KeyError'>   o  The  crawl  continue  on  transport  errors   •  4ff3cd40e5557c00c7000000  -­‐  none  has  2072  followers  &  0  friends   •  Error  Friends  :    <type  'exceptions.KeyError'>   without  worrying  about  retry   •  o  Validate  will  recrawl  &  refresh  as   4ff3a958e5557cfc58000000  -­‐  none  has  2072  followers  &  0  friends   •  Error  Friends  :    <type  'exceptions.KeyError'>   needed   •  4ff3ccdee5557c00b6000000  -­‐  none  has  2072  followers  &  0  friends   •  4ff3d3b9e5557c01b900001e  -­‐  371187804  has  0  followers  &  0  friends   •  4ff3d3d8e5557c01b9000048  -­‐  63488295  has  155  followers  &  0  friends   •  4ff3d3d9e5557c01b9000049  -­‐  342712617  has  0  followers  &  0  friends   •  4ff3d3d9e5557c01b900004a  -­‐  21266738  has  0  followers  &  0  friends   •  4ff3d3dae5557c01b900004b  -­‐  204652853  has  0  followers  &  0  friends   •  …   •  4ff475cfe5557c1657000074  -­‐  258944989  has  0  followers  &  0  friends   •  4ff475d3e5557c165700007d  -­‐  327286780  has  0  followers  &  0  friends   •  Looks  like  we  have  132  not  so  good  records   •  Elapsed  Time  =  0.546846  
  • 104.
    Code  &  Run Walk  Through o  Code:   §  oscon2012_analytics_01.py   Stage  3   o  Challenges:   o  Figure  out  the  right  Set  operations   o  Get  distinct  user  list   applying  the   set(union(list))  operation   o  Interesting  Points:   §  973,323  unique  users  !   §  Recursively  apply  set  union  over  400,00  lists   §  Set  operations  took  slightly  more  than  a  minute    
  • 105.
    Code  &  Run Walk  Through o  Code:   §  oscon2012_analytics_01.py  (focus  on  cmd  string  creation)   §  oscon2012_get_user_info_01.py   §  oscon2012_unroll_user_list_01.py   §  oscon2012_unroll_user_list_02.py   Stage  4   o  Challenges:   o  Get  &  Store  User  details   §  Where  do  I  start  ?   (distinct  user  list)   •  In  the  next  few  slides     o  Unroll   §  Took  me  a  few  days  to  get  it  right  (along  with  my  daily  job!)   §  Unfortunately  I  did  not  employ  parallelism  &  didn’t  use  my   MacPro  with  32  GB  memory.  So  the  runs  were  long   §  But  learned  hard  lessons  on  check  point  &  restart   o  Interesting  Points:   §  Tracking  Control  Numbers   §  Time  …  Marathon  unroll  run  19:33:33  !  
  • 106.
    Twi5er  @  scale Pa5ern •  Challenge:   o  You  want  to  get  screen  names,  follower  counts  and  other  details  for  a  million   users   •  Problem:   o  No  easy  REST  API   o  https://api.twitter.com/1/users/lookup.json  will  take  100  user_ids  and  give   details   •  Solution:   o  This  is  a  scalability  challenge.  Approach  it  like  so   o  Create  a  command  buffer  collection  in  MongoDB  splitting  millon  user_ids   into  batches  of  100   o  Have  a  “done”  flag  initialized  to  0  for  checkpoint  &  restart   o  After  each  cmd  str  is  executed,  rest  “done”:1   o  For  subsequent  runs,  ignore  “done”:1.     o  Also  helps  in  control  number  tracking  
  • 107.
  • 108.
    Control  Numbers •  >  db.t_users_info.count()   •  8122   •  >  db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no”:)   •  63   •  >  db.api_str.find({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1})   The  collection  should  have  8185   •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d53"),  "seq_no"  :  5433  }   documents   •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d59"),  "seq_no"  :o5439  }   But  it  has     nly  8122.   Where  did  the  rest  go  ?   •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d5f"),  "seq_no"  :  5445  }   63  of  them  still  have  done=0   •  8122  +  63  =  8185  !   {  "_id"  :  ObjectId("4ff4daebe5557c28bf001d74"),  "seq_no"  :  5466  }   Aha,  mystery  solved.     •  {  "_id"  :  ObjectId("4ff4daece5557c28bf001d7a"),  "seq_no"  :  5472  }        They  fell  through  the  cracks   •  Need  a  catch-­‐all  final  run       {  "_id"  :  ObjectId("4ff4daece5557c28bf001d80"),  "seq_no"  :  5478  }   •  {  "_id"  :  ObjectId("4ff4daede5557c28bf001d90"),  "seq_no"  :  5494  }   •  {  "_id"  :  ObjectId("4ff4daefe5557c28bf001daf"),  "seq_no"  :  5525  }  
  • 109.
    Day  in  the life  of  a  Control  Number  Detective  –  Run  #1 •  Remember  :  973,323  users.  So,  9734  cmd  strings  (100  users  perstring)   •  >  >  db.api_str.count()   •  9831   •  >  db.api_str.count({"done":0})   •  239   •  >>  db.t_users_info.count()   •  9592   •  >  >  db.api_str.count({"api_str":""})   •  97   •  So  we  should  have  9831  –  97  =  9734  records   •  The  second  run  should  generate  9734-­‐9592  =  142  calls  (i.e.  350-­‐142=208  rate-­‐limit  should  remain).  Let  us  see.   •  {   •     …   •     "x-­‐ratelimit-­‐class":  "api_identified",     •     "x-­‐ratelimit-­‐limit":  "350",     •     "x-­‐ratelimit-­‐remaining":  "209",     •     …   •  }   •  Yep,  209  left   •  >  
  • 110.
    Day  in  the life  of  a  Control  Number  Detective  –  Run  #2 •  Remember  :  973,323  users.  So,  9734  cmd  strings  (100  users  perstring)   •  >  db.t_users_info.count()   •  {   •  9728   •   …   •  >  db.api_str.count({"api_str":""})   •     "x-­‐ratelimit-­‐limit":  "350",     •  97   •     "x-­‐ratelimit-­‐remaining":  "344",     •  >  db.api_str.count({"done":0})   •   …   •  103   •  }   •  >9734-­‐9728=6,  same  as  103-­‐97  !   •  Yep,  6  more  records   •  Run  once  more  !   •  >  db.t_users_info.count()   •  >  db.api_str.find({"done":0},{"seq_no":1})   •  9734   •  …   •  Good,  got  9734  !   •  {  "_id"  :  ObjectId("4ff4dbd4e5557c28bf002e22"),  "seq_no"  :  9736  }   •  {  "_id"  :  ObjectId("4ff4db05e5557c28bf001f47"),  "seq_no"  :  5933  }   •  {  "_id"  :  ObjectId("4ff4db8be5557c28bf0028f6"),  "seq_no"  :  8412  }   •  {  "_id"  :  ObjectId("4ff4dba2e5557c28bf002a8c"),  "seq_no"  :  8818  }   •  {  "_id"  :  ObjectId("4ff4dbaee5557c28bf002b69"),  "seq_no"  :  9039  }   •  {  "_id"  :  ObjectId("4ff4dbb8e5557c28bf002c1c"),  "seq_no"  :  9218  }   •  …   Professor Layton would be proud ! In  fact,  I  have  all  the  four  &  plan  to  spend  sometime  with  them  &  Laphraig  !
  • 111.
    Monitor  runs  & track  control  numbers Unroll  run  8:48  PM  to  ~4:08  PM  next  day  !  
  • 112.
    Track  error  & the  document  numbers
  • 113.
    Code  &  Run Walk  Through o  Code:   §  oscon2012_find_strong_ties_01.py   §  oscon2012_social_graph_stats_01.py   Stage  5   o  Challenges:   o  For  each  @clouderati   §  None.  Python  set  operations  made  this  easy   follower   o  Find  friend=follower    -­‐  set   o  Interesting  Points:   intersection   §  Even  at  this  scale,  single  machine  is  not  enough   §  Should  have  tried  data  parallelism     •  This  task  is  well  suited  to  leverage  data   parallelism  as  it  is  commutative  &  associative   •  Was  getting  invalid  cursor  error  from  MongoDB   •  So  had  to  do  the  updates  in  two  steps  
  • 114.
    Code  &  Run Walk  Through o  Code:   §  oscon2012_find_cliques_01.py   o  Challenges:   Stage  6   o  Lots  of  good  information  hidden  in   the  data  !   o  Create  social  graph   o  Apply  network  theory   o  Memory  !   o  Infer  cliques  &  other   properties     o  Interesting  Points:   o  Graph,  List  &  set  operations   o  networkx  has  lots  of  interesting   graph  algorithms   o  Collections.Counter  to  the  rescue  
  • 115.
    Twi5er  Social  Graph Analysis   of  @clouderati o                                       2072  Followers;  973,323   unique  users  one  level  down  w/   followers/friends  trimmed  at  5,000   o  Strong  ties     o  follower=friend   o  235,697  users,  462,  419  edges   o  501,367    Cliques   o  253  unique  users  8,906  Cliques  w/  >   10  users   o  GeorgeReese  in  7,973  of  them  !  See   List  for  1st  125   o  krishnan  3,446,randy  2,197,  joe  1,977,   sam  1,937,  jp  485,  stu  403,  urquhart   263,beaker  226,  acroll  149,  adrian  63,   gevaperry  24   o  Of  course,  clique  analysis  does  not   tell  us  the  whole  story  …     Clique  Distribution  =  {2:  296521,  3:  58368,  4:  36421,  5:  28788,  6:  24197,  7:  20240,  8:  15997,   9:  11929,  10:  6576,  11:  1909,  12:  364,  13:  55,  14:  2}  
  • 116.
    Twi5er  Social  Graph Analysis   of  @clouderati Celebrity  –  very  low  strong  ties   Higher  Celebrity,  low  strong  ties   o  sort  by   followers  vs.   sort  by   strong  ties  is   interesting   Medium  Celebrity,  medium  strong  ties  
  • 117.
    Twi5er  Social   Graph Analysis   of  @clouderati o  A  higher  “Strong  Ties”   number  is  interesting   §  It  means  a  very  high   follower-­‐friend  intersection   §  Reeves  62%,  bgolden    85%   o  Bur  a  high  clique  with  a   smaller  “Strong  ties”  show   more  cohesive  &  stronger   social  graph   §  eg.Krishnan  -­‐  15%   friends-­‐followers     §  Samj  –  33%  
  • 118.
    Twi5er  Social  Graph Analysis   of  @clouderati o  Ideas  for   more   Exploration   §  Include  all   followers  (instead   of  stopping  at  the   5000  cap)   §  Get  tweets  &  track   @mention   §  Frequent   @mention  shows   more  stronger  ties   §  #tag  analysis  could   show  some   interesting   networks  
  • 119.
    Twitter Tips –A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric   2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results   before  blaming  Twitter   o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.   o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time   and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id   o  Always test with one or two records before a long run ! - learned the hard way 3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data   o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !     o  Night runs are far more faster & error-free 4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer   o  Would  make  it  easy  to  work  with  Twitter  at  scale   o  I  use    MongoDB   The o  Keep  the  schema  simple  &  no  fancy  transformation   Beg in •  And  as  far  as  possible  same  as  the  ( json)  response       The ning A o  Use  NOSQL  CLI  for  trimming  records  et  al   End s
  • 120.
    Twitter Tips –A Baker’s Dozen 5.  Always  use  a  big  data  pipeline   o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,   validation  et  al     6.  Use  functional  approach  for  a  scalable  pipeline   o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing   o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)   o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &   restart  techniques   •  This did create some trouble for me, as we will see later 7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle   o  The  equivalent  of  the  traditional  ETL   o  Validation  stage  &  validation  routines  are  important   •  Cannot  expect  perfect  runs   •  Cannot  manually  look  at  data  either,  when  data  is  at  scale   8.  Have  control  numbers  to  validate  runs  &  monitor  them   o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files
  • 121.
    Twitter Tips –A Baker’s Dozen 9.  Program  defensively     o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them     10.  Have  Erlang-­‐style  supervisors  in  your  pipeline   o  Fail  fast  &  move  on   o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer   o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to   correct  missing  spiders  and  crawls   o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that   has  enough  context  to  take  corrective  actions   o  I have an example in part 2 11.  Data  will  never  be  perfect   o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies     •  for  example:  0  followers,  protected  users,  0  friends,…  
  • 122.
    Twitter Tips –A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a   re-­‐startable  command  buffer  cache     o  See a MongoDB example in Part 2 13.  Don’t  bombard  the  URL   o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a   scalable  system,  eventually   o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing   o  Kind  of  early  warning  when  something  is  wrong   15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  
  • 123.
    Twitter Tips –A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism   o  Leverage  data  parallelism  frameworks  like  MapReduce   o  But  first  :   §  Prototype  as  a  linear  system,     §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,     §  Note  down  stages  and  tasks  that  can  be  parallelized  and     §  Then  parallelize  them   o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial 17.   Pay  attention  to  handoffs  between  stages   o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list   as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for   aggregation     o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching, checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through   logs    
  • 124.
    Twitter Tips –A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the   inference  you  want  to  make   o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph   o  Twitter  Network  is  more  of  an  Interest  Network   o  So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do
  • 125.
    Twitter Gripes 1.  Need  more  rich  APIs  for  #tags   o  Somewhat  similar  to  users  viz.  followers,  friends  et  al   o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics   2.  HTTP  Error  Return  is  not  uniform     o  Returns  400  bad  Request  instead  of  420   o  Granted, there is enough information to figure this out 3.  Need  an  easier  way  to  get  screen_name  from  user_id   4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.   o  There are a few like this, most probably for backward compatibility 5.  Parameter  Validation  is  not  uniform   o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “416  Range   Unacceptable”   6.  Overall  more  validation  would  help   o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
  • 126.
    Thanks To theseGiants …
  • 127.
    Thanks To theseGiants …
  • 128.
    Thanks To theseGiants …
  • 129.
    Thanks To theseGiants …
  • 130.
    Thanks To theseGiants …
  • 131.
    I had agood time researching & preparing for this Tutorial. I hope you learned a few new things have a few vectors to follow