Coursera +
AWS CloudSearch
    Frank Chen
    Software Engineer
About
•    Ed-Tech startup providing MOOCs
     o    Massive Open Online Courses
•    New company -- launched 4/18/12
     o    Less than a year old.

•    215 free courses from 33 top universities
     o  Princeton, Stanford, Penn, Duke, etc...
     o  From Cryptography to Modern and Contemporary American
        Poetry
•  2.5+ million users
     o    We reached a million users faster than Facebook and
          Pinterest.
•  ~9 million course enrollments
Platform Scale
•    Moderate-sized (>10,000 concurrent users)
•    65 concurrent courses running now, each with tens of
     thousands of enrollments each
•    >600 "pretty heavy" PHP/Python dynamic pages served
     per second sustained
     o    Might make backend calls to services (e.g. CloudSearch or SES -->
          want low latencies)
•    Various other services (70 instances+ on EC2 running
     at the moment)
•    Spiky traffic
     o    People procrastinate on deadlines - spiky on the weekends
Stack
•    PHP / Python / Scala backed by MySQL
•    Runs on AWS completely
•    Utilizes lots of AWS services
     o    EC2 / ELB for servers
     o    MySQL RDS for databases
     o    S3 for video and static hosting
     o    Cloudfront for video / asset hosting
     o    SES for emails (>1 million emails everyday)
     o    SQS for long running tasks (video encoding, gradebook generation,
          etc...)
     o    SNS for notification services
     o    Route53 for DNS
     o    CloudSearch for forum search
Why CloudSearch?
•    Big issue for us back in March / April. Solution then
     didn't work
     o    MySQL Full Text Search
          §  LIKE %x% AS NATURAL LANGUAGE?
          §  Really terrible results
          §  MyISAM (eww...)

•    Requirements:
     o    Fast searches (we call backend APIs - don't want to keep the users
          waiting too long)
     o    Good results (need to be relevant - don't waste the students' time)
     o    Low/no maintenance (we have enough instances to manage as is)
Why CloudSearch?
•  Alternatives we looked at:
   o  Apache Solr, Sphinx, fiddling with MySQL
•  Then CloudSearch was announced...
•  Early general adopter - we started using
  CloudSearch ~10 days after announcement
   o  We didn't get any heads-up about CS before the public
      announcement
   o  Wrote the code to use CloudSearch and import over our
      existing forum posts / comments in 2 or 3 days.
       §  From decision to production!
       §  Easy to use and great documentation
CloudSearch Uses
      User facing forum search
CloudSearch Uses
•  Analytics
   o  Most frequent searches and other statistics about their courses
      §  Informing instructors about this so they can clarify
          information
   o  Finding posts across forums
      §  Easy for CloudSearch, hard normally because of sharded
          scatter-gather problems
               •    Old way: Querying 600 databases on 4 RDS servers? Not fun
        §    Usage analysis
        §    Unexpected use: Instructors often want to find all their own
              posts so they can save / archive common answers
CloudSearch Scale
•  Moderate scale
•  ~1.5 million documents indexed
   o    All forum posts and comments


•  50,000+ searches a day
   o    Spikey! Depends on when homeworks are due.
Experience




        GREAT!
We Want...
•  "Did you mean..."
  o    Lots of typos from non-native speakers


•  Multilingual Tokenization / Search
  o    We are starting to run courses in other languages...


•  Find Similar Documents
Thank You!
    Questions?
frank@coursera.org

Coursera amazon cloudsearch presentation

  • 1.
    Coursera + AWS CloudSearch Frank Chen Software Engineer
  • 2.
    About •  Ed-Tech startup providing MOOCs o  Massive Open Online Courses •  New company -- launched 4/18/12 o  Less than a year old. •  215 free courses from 33 top universities o  Princeton, Stanford, Penn, Duke, etc... o  From Cryptography to Modern and Contemporary American Poetry •  2.5+ million users o  We reached a million users faster than Facebook and Pinterest. •  ~9 million course enrollments
  • 3.
    Platform Scale •  Moderate-sized (>10,000 concurrent users) •  65 concurrent courses running now, each with tens of thousands of enrollments each •  >600 "pretty heavy" PHP/Python dynamic pages served per second sustained o  Might make backend calls to services (e.g. CloudSearch or SES --> want low latencies) •  Various other services (70 instances+ on EC2 running at the moment) •  Spiky traffic o  People procrastinate on deadlines - spiky on the weekends
  • 4.
    Stack •  PHP / Python / Scala backed by MySQL •  Runs on AWS completely •  Utilizes lots of AWS services o  EC2 / ELB for servers o  MySQL RDS for databases o  S3 for video and static hosting o  Cloudfront for video / asset hosting o  SES for emails (>1 million emails everyday) o  SQS for long running tasks (video encoding, gradebook generation, etc...) o  SNS for notification services o  Route53 for DNS o  CloudSearch for forum search
  • 5.
    Why CloudSearch? •  Big issue for us back in March / April. Solution then didn't work o  MySQL Full Text Search §  LIKE %x% AS NATURAL LANGUAGE? §  Really terrible results §  MyISAM (eww...) •  Requirements: o  Fast searches (we call backend APIs - don't want to keep the users waiting too long) o  Good results (need to be relevant - don't waste the students' time) o  Low/no maintenance (we have enough instances to manage as is)
  • 6.
    Why CloudSearch? •  Alternativeswe looked at: o  Apache Solr, Sphinx, fiddling with MySQL •  Then CloudSearch was announced... •  Early general adopter - we started using CloudSearch ~10 days after announcement o  We didn't get any heads-up about CS before the public announcement o  Wrote the code to use CloudSearch and import over our existing forum posts / comments in 2 or 3 days. §  From decision to production! §  Easy to use and great documentation
  • 7.
    CloudSearch Uses User facing forum search
  • 8.
    CloudSearch Uses •  Analytics o  Most frequent searches and other statistics about their courses §  Informing instructors about this so they can clarify information o  Finding posts across forums §  Easy for CloudSearch, hard normally because of sharded scatter-gather problems •  Old way: Querying 600 databases on 4 RDS servers? Not fun §  Usage analysis §  Unexpected use: Instructors often want to find all their own posts so they can save / archive common answers
  • 9.
    CloudSearch Scale •  Moderatescale •  ~1.5 million documents indexed o  All forum posts and comments •  50,000+ searches a day o  Spikey! Depends on when homeworks are due.
  • 10.
  • 11.
    We Want... •  "Didyou mean..." o  Lots of typos from non-native speakers •  Multilingual Tokenization / Search o  We are starting to run courses in other languages... •  Find Similar Documents
  • 12.
    Thank You! Questions? frank@coursera.org