Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scalding @ Coursera

Ad

@ Coursera 
Daniel Chia 
@DanielJHChia 
Software Engineer, Infrastructure

Ad

Overview 
• Context 
• Growing Needs 
• Hive / Pig / Scalding

Ad

Technical (Online Stack) 
• 100% hosted on AWS 
• Service-oriented architecture 
• Mix of MySQL and Cassandra for persiste...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 22 Ad
1 of 22 Ad
Advertisement

More Related Content

Advertisement
Advertisement

Scalding @ Coursera

  1. 1. @ Coursera Daniel Chia @DanielJHChia Software Engineer, Infrastructure
  2. 2. Overview • Context • Growing Needs • Hive / Pig / Scalding
  3. 3. Technical (Online Stack) • 100% hosted on AWS • Service-oriented architecture • Mix of MySQL and Cassandra for persistence • Scala
  4. 4. Existing Warehouse Streaming
  5. 5. Future Warehouse Flow S3 Event Data
  6. 6. Need 1: Expressive • Joins • Aggregations • Secondary sort • Multiple map-reduce
  7. 7. Need 2: Semi-structured Data • Increased usage of Cassandra • Events data
  8. 8. { “timestamp”:1411359695744, “membershipState":"LearnerEnrolled" }
  9. 9. { "typeName": "multipart", "definition": { "assignmentParts": { "id1": { "typeName": "plainText", "order": 0, "definition": { "prompt": "Write a sentence describing what you think about cereal." } }, "id2": { "typeName": "richText", "order": 1, "definition": { "prompt": "Write a long essay with lots of fancy formatting describing what you think about cereal." } }, "id3": { "typeName": "url", "order": 2, "definition": { "prompt": "Post a link to your favorite cereal." } }, "id4": { "typeName": "plainText", "order": 3, "definition": { …
  10. 10. Choices • Hive • Pig • Scalding
  11. 11. Hive • SQL-like language • Great for simple rollups and aggregations • Procedural transforms difficult to express
  12. 12. Pig • Mature • Procedural • Pig Latin + Lots of UDFs
  13. 13. Scalding – Pros • Succinct • Expressive • All code in one language • Re-use online data models
  14. 14. Scaling – Pros • Easy to test
  15. 15. Scalding – Cons • Have to learn Scala • More heavy weight for simple experimental things. • Many layers abstracted from MapReduce
  16. 16. Scalding – Example • User event data • Want to join with course and topic data
  17. 17. Scalding – Example val events = TypedTsv … /* load data */ .toTypedPipe val courses = TypedTsv … .toTypedPipe val topics = TypedTsv … .toTypedPipe
  18. 18. Scalding – Example events.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */
  19. 19. Scalding – Example events.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */
  20. 20. Scalding – Example events.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .sketch(reducer = 100) .leftJoin(topics.groupBy(_.topicId))
  21. 21. Scalding – Wish-list • More documentation • Scala 2.11 soon, please?
  22. 22. Questions? We’re hiring! coursera.org/jobs

×