Scala and Hadoop @ eBay
Upcoming SlideShare
Loading in...5
×
 

Scala and Hadoop @ eBay

on

  • 6,589 views

Slides from Adam Ilardi's presentation at the 5/21 NY Scala Meetup held at eBay NYC.

Slides from Adam Ilardi's presentation at the 5/21 NY Scala Meetup held at eBay NYC.

Statistics

Views

Total Views
6,589
Views on SlideShare
2,932
Embed Views
3,657

Actions

Likes
6
Downloads
38
Comments
0

15 Embeds 3,657

http://ebaynyc.tumblr.com 1182
http://www.hakkalabs.co 1024
http://g33ktalk.com 858
http://blog.ebaynyc.com 373
https://twitter.com 184
https://assets.txmblr.com 24
http://translate.googleusercontent.com 3
http://www.bing.com 2
http://www.soso.com 1
http://safe.txmblr.com 1
https://www.google.com 1
http://g33ktalk.staging.wpengine.com 1
http://www.newsblur.com 1
http://digg.com 1
http://yandex.ru 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Introduce myself and ebay NYC
  • Laugh
  • We are starting to use scala for live site recs
  • Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
  • They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
  • Tell them about the example
  • The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
  • Say why raw map reduce stinks. Mention what hive is and scoobi is
  • Explain why we didn’t go with scoobi even though it’s all scala
  • Scheduling and DAG creationWhere is my SOURCE?
  • Mentionazkaban
  • Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
  • Verbose. You still need to write a bunch of code.
  • Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
  • This is actual code to compute a user’s preferences. Explain a bit about user preferences
  • Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
  • Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
  • Talk about the use of cascadesTalk about traps and counters
  • Scalding makes this 100% times easier because of cascades and flows

Scala and Hadoop @ eBay Scala and Hadoop @ eBay Presentation Transcript

  • Scala and Hadoop @ eBay
  • What we will cover• Polymorphic Function Values• Higher Kinded/Recursive Types• Cokleislis Star Operators• Scala Macros
  • I have no clue what those things are
  • What we will ACTUALLY cover• Why Scala• Why Hadoop• How we use Scala with Hadoop• Lots of CODE!
  • Why Scala?• JVM• **Functional**• Expressive• How to convince your boss?
  • Someone on Hacker News saidScala sucks• Compile Times• You changed List again?• Complicated• Leads to Madness
  • Madness?trait Lazy[+T, P] {var creationParameters: P = None.asInstanceOf[P];lazy val lazyThing: Either[Throwable, T] = try {Right(create(creationParameters)) }catch { case e => Left(e) }def get(createParams: P): Either[Throwable, T] = {creationParameters = createParamslazyThing}def create(params: P): T}
  • Madness?def getSingleInstance[T, P](params: P)(implicitlazyCreator: Lazy[T, P]): T = {lazyCreator.get(params) match {case Right(successValue) => successValuecase Left(exception) => throw newStackException(exception)}}
  • This is used by ONE client class• Show some self-restraint
  • Hadoop• void map(K1 key, V1 value,OutputCollector<K2, V2> output, Reporterreporter)• void reduce(K2 key, Iterator<V2> values,OutputCollector<K3, V3> output, Reporterreporter)
  • BIG NUMBERS• Petabytes of data• 1k+ node Hadoop cluster• Multi-billion dollar merchandising business• Lots of users and items 
  • How should I use Map Reduce?• Raw map reduce • Pig• Hive• Cascading• Scoobi• Scalding 
  • Decision Time• “And every one that heareth these sayings ofmine (great software engineers of the past),and doeth them not, shall be likened unto afoolish man, which built his house upon thesand.”• “And the rain descended, and the floodscame, and the winds blew, and beat upon thathouse; and it fell: and great was the fall of it.”
  • I believe!• Scalding combines the best of PIG andCascading
  • Good PigA = LOAD input AS (x, y, z);B = FILTER A BY x > 5;DUMP B;C = FOREACH B GENERATE y, z;STORE C INTO output;// do joins and group by also
  • Bad PigDEFINE NV_terms `perl nv_terms2.pl`ship($scripts/nv_terms2.pl);i5 = stream i4 through NV_terms as (leafcat:chararray,name:chararray, name1:chararray);i7 = foreach i5 generate leafcat,com.ebay.pigudf.sic.RtlUDF(0,0,0,$site_id,name) asname,com.ebay.pigudf.sic.RtlUDF(0,0,0,$site_id,name1) asname1;
  • Other Pig Issues• Scheduling and DAG creation
  • Cascading Rocks!• What is it?• Supports large workflows and reusablecomponents– DAG generation– Parallel Executions
  • Cascading code in Scalaval masterPipe = newFilterURLEncodedStrings(masterPipe, "sqr")masterPipe = newFilterInappropriateQueries(masterPipe, "sqr”)masterPipe = new GroupBy(masterPipe,CFields("user_id", "epoch_ts", "sqr"),sortFields)
  • Someone should really code review this
  • Cascading IssuesThis page intentionally left blank
  • Scalding Timeclass WordCountJob(args : Args) extends Job(args) {TextLine( args("input") ).flatMap(line -> word) { line : String => tokenize(line) }.groupBy(word) { _.size }.write( Tsv( args("output") ) )// Split a piece of text into individual words.def tokenize(text : String) : Array[String] = {// Lowercase each word and remove punctuation.text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")}}
  • Scalding @ eBay• Boilerplate reduction• Extensibility• New hires
  • Practical Scalding Use• Pimp my pimp• Code generated boilerplate• Cascades• Traps• Testing!
  • class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) withCommonFunctionstrait CommonFunctions {import Dsl._import RichPipe.assignNamedef pipe: Pipedef reallyComplexFunction(field: Fields, param: Long) = {//mind blowing code here}}}
  • CheckoutTransactionsPipe(//default path logic).project(//fields I need).countUserInteractions(//params).doScoreCalculation(//params).doConfidenceCalculation(//params)Seems a bit too readable for Scala
  • Collaborative Filtering• Typically hard to run on large datasets
  • Structured Data Importance• Do people shop by brand?00.20.40.60.811.2SupplyHandbags and Purses
  • Markov Chains• Investigation of buying patterns in ~50 lines ofcodeval purchases = "firsttime" :: x.take(500).toListval pairs = purchases zip purchases.tailval grouped = pairs.groupBy(x =>x._1.toString+"-"+x._2.toString)val sizes = grouped map { x => {x._1 -> x._2.size}} toList
  • Mining Search Queries• 20+ billion user queries - give me the top onesper userDe-Dupe Rank ValidateSample Data
  • AutomationHadoop ProxyBatch Database LoadMachinesCassandraJenkinsMySqlMongo
  • Questions?www.ebaynyc.com