Scala and Hadoop @ eBay
What we will cover• Polymorphic Function Values• Higher Kinded/Recursive Types• Cokleislis Star Operators• Scala Macros
I have no clue what those things are
What we will ACTUALLY cover• Why Scala• Why Hadoop• How we use Scala with Hadoop• Lots of CODE!
Why Scala?• JVM• **Functional**• Expressive• How to convince your boss?
Someone on Hacker News saidScala sucks• Compile Times• You changed List again?• Complicated• Leads to Madness
Madness?trait Lazy[+T, P] {var creationParameters: P = None.asInstanceOf[P];lazy val lazyThing: Either[Throwable, T] = try...
Madness?def getSingleInstance[T, P](params: P)(implicitlazyCreator: Lazy[T, P]): T = {lazyCreator.get(params) match {case ...
This is used by ONE client class• Show some self-restraint
Hadoop• void map(K1 key, V1 value,OutputCollector<K2, V2> output, Reporterreporter)• void reduce(K2 key, Iterator<V2> valu...
BIG NUMBERS• Petabytes of data• 1k+ node Hadoop cluster• Multi-billion dollar merchandising business• Lots of users and it...
How should I use Map Reduce?• Raw map reduce • Pig• Hive• Cascading• Scoobi• Scalding 
Decision Time• “And every one that heareth these sayings ofmine (great software engineers of the past),and doeth them not,...
I believe!• Scalding combines the best of PIG andCascading
Good PigA = LOAD input AS (x, y, z);B = FILTER A BY x > 5;DUMP B;C = FOREACH B GENERATE y, z;STORE C INTO output;// do joi...
Bad PigDEFINE NV_terms `perl nv_terms2.pl`ship($scripts/nv_terms2.pl);i5 = stream i4 through NV_terms as (leafcat:chararra...
Other Pig Issues• Scheduling and DAG creation
Cascading Rocks!• What is it?• Supports large workflows and reusablecomponents– DAG generation– Parallel Executions
Cascading code in Scalaval masterPipe = newFilterURLEncodedStrings(masterPipe, "sqr")masterPipe = newFilterInappropriateQu...
Someone should really code review this
Cascading IssuesThis page intentionally left blank
Scalding Timeclass WordCountJob(args : Args) extends Job(args) {TextLine( args("input") ).flatMap(line -> word) { line : S...
Scalding @ eBay• Boilerplate reduction• Extensibility• New hires
Practical Scalding Use• Pimp my pimp• Code generated boilerplate• Cascades• Traps• Testing!
class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRic...
CheckoutTransactionsPipe(//default path logic).project(//fields I need).countUserInteractions(//params).doScoreCalculation...
Collaborative Filtering• Typically hard to run on large datasets
Structured Data Importance• Do people shop by brand?00.20.40.60.811.2SupplyHandbags and Purses
Markov Chains• Investigation of buying patterns in ~50 lines ofcodeval purchases = "firsttime" :: x.take(500).toListval pa...
Mining Search Queries• 20+ billion user queries - give me the top onesper userDe-Dupe Rank ValidateSample Data
AutomationHadoop ProxyBatch Database LoadMachinesCassandraJenkinsMySqlMongo
Questions?www.ebaynyc.com
Scala and Hadoop @ eBay
Upcoming SlideShare
Loading in …5
×

Scala and Hadoop @ eBay

10,936 views

Published on

Slides from Adam Ilardi's presentation at the 5/21 NY Scala Meetup held at eBay NYC.

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,936
On SlideShare
0
From Embeds
0
Number of Embeds
6,512
Actions
Shares
0
Downloads
55
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Introduce myself and ebay NYC
  • Laugh
  • We are starting to use scala for live site recs
  • Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
  • They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
  • Tell them about the example
  • The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
  • Say why raw map reduce stinks. Mention what hive is and scoobi is
  • Explain why we didn’t go with scoobi even though it’s all scala
  • Scheduling and DAG creationWhere is my SOURCE?
  • Mentionazkaban
  • Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
  • Verbose. You still need to write a bunch of code.
  • Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
  • This is actual code to compute a user’s preferences. Explain a bit about user preferences
  • Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
  • Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
  • Talk about the use of cascadesTalk about traps and counters
  • Scalding makes this 100% times easier because of cascades and flows
  • Scala and Hadoop @ eBay

    1. 1. Scala and Hadoop @ eBay
    2. 2. What we will cover• Polymorphic Function Values• Higher Kinded/Recursive Types• Cokleislis Star Operators• Scala Macros
    3. 3. I have no clue what those things are
    4. 4. What we will ACTUALLY cover• Why Scala• Why Hadoop• How we use Scala with Hadoop• Lots of CODE!
    5. 5. Why Scala?• JVM• **Functional**• Expressive• How to convince your boss?
    6. 6. Someone on Hacker News saidScala sucks• Compile Times• You changed List again?• Complicated• Leads to Madness
    7. 7. Madness?trait Lazy[+T, P] {var creationParameters: P = None.asInstanceOf[P];lazy val lazyThing: Either[Throwable, T] = try {Right(create(creationParameters)) }catch { case e => Left(e) }def get(createParams: P): Either[Throwable, T] = {creationParameters = createParamslazyThing}def create(params: P): T}
    8. 8. Madness?def getSingleInstance[T, P](params: P)(implicitlazyCreator: Lazy[T, P]): T = {lazyCreator.get(params) match {case Right(successValue) => successValuecase Left(exception) => throw newStackException(exception)}}
    9. 9. This is used by ONE client class• Show some self-restraint
    10. 10. Hadoop• void map(K1 key, V1 value,OutputCollector<K2, V2> output, Reporterreporter)• void reduce(K2 key, Iterator<V2> values,OutputCollector<K3, V3> output, Reporterreporter)
    11. 11. BIG NUMBERS• Petabytes of data• 1k+ node Hadoop cluster• Multi-billion dollar merchandising business• Lots of users and items 
    12. 12. How should I use Map Reduce?• Raw map reduce • Pig• Hive• Cascading• Scoobi• Scalding 
    13. 13. Decision Time• “And every one that heareth these sayings ofmine (great software engineers of the past),and doeth them not, shall be likened unto afoolish man, which built his house upon thesand.”• “And the rain descended, and the floodscame, and the winds blew, and beat upon thathouse; and it fell: and great was the fall of it.”
    14. 14. I believe!• Scalding combines the best of PIG andCascading
    15. 15. Good PigA = LOAD input AS (x, y, z);B = FILTER A BY x > 5;DUMP B;C = FOREACH B GENERATE y, z;STORE C INTO output;// do joins and group by also
    16. 16. Bad PigDEFINE NV_terms `perl nv_terms2.pl`ship($scripts/nv_terms2.pl);i5 = stream i4 through NV_terms as (leafcat:chararray,name:chararray, name1:chararray);i7 = foreach i5 generate leafcat,com.ebay.pigudf.sic.RtlUDF(0,0,0,$site_id,name) asname,com.ebay.pigudf.sic.RtlUDF(0,0,0,$site_id,name1) asname1;
    17. 17. Other Pig Issues• Scheduling and DAG creation
    18. 18. Cascading Rocks!• What is it?• Supports large workflows and reusablecomponents– DAG generation– Parallel Executions
    19. 19. Cascading code in Scalaval masterPipe = newFilterURLEncodedStrings(masterPipe, "sqr")masterPipe = newFilterInappropriateQueries(masterPipe, "sqr”)masterPipe = new GroupBy(masterPipe,CFields("user_id", "epoch_ts", "sqr"),sortFields)
    20. 20. Someone should really code review this
    21. 21. Cascading IssuesThis page intentionally left blank
    22. 22. Scalding Timeclass WordCountJob(args : Args) extends Job(args) {TextLine( args("input") ).flatMap(line -> word) { line : String => tokenize(line) }.groupBy(word) { _.size }.write( Tsv( args("output") ) )// Split a piece of text into individual words.def tokenize(text : String) : Array[String] = {// Lowercase each word and remove punctuation.text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")}}
    23. 23. Scalding @ eBay• Boilerplate reduction• Extensibility• New hires
    24. 24. Practical Scalding Use• Pimp my pimp• Code generated boilerplate• Cascades• Traps• Testing!
    25. 25. class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) withCommonFunctionstrait CommonFunctions {import Dsl._import RichPipe.assignNamedef pipe: Pipedef reallyComplexFunction(field: Fields, param: Long) = {//mind blowing code here}}}
    26. 26. CheckoutTransactionsPipe(//default path logic).project(//fields I need).countUserInteractions(//params).doScoreCalculation(//params).doConfidenceCalculation(//params)Seems a bit too readable for Scala
    27. 27. Collaborative Filtering• Typically hard to run on large datasets
    28. 28. Structured Data Importance• Do people shop by brand?00.20.40.60.811.2SupplyHandbags and Purses
    29. 29. Markov Chains• Investigation of buying patterns in ~50 lines ofcodeval purchases = "firsttime" :: x.take(500).toListval pairs = purchases zip purchases.tailval grouped = pairs.groupBy(x =>x._1.toString+"-"+x._2.toString)val sizes = grouped map { x => {x._1 -> x._2.size}} toList
    30. 30. Mining Search Queries• 20+ billion user queries - give me the top onesper userDe-Dupe Rank ValidateSample Data
    31. 31. AutomationHadoop ProxyBatch Database LoadMachinesCassandraJenkinsMySqlMongo
    32. 32. Questions?www.ebaynyc.com

    ×