SlideShare a Scribd company logo
Scala and Hadoop @ eBay
What we will cover
• Polymorphic Function Values
• Higher Kinded/Recursive Types
• Cokleislis Star Operators
• Scala Macros
I have no clue what those things are
What we will ACTUALLY cover
• Why Scala
• Why Hadoop
• How we use Scala with Hadoop
• Lots of CODE!
Why Scala?
• JVM
• **Functional**
• Expressive
• How to convince your boss?
Someone on Hacker News said
Scala sucks
• Compile Times
• You changed List again?
• Complicated
• Leads to Madness
Madness?
trait Lazy[+T, P] {
var creationParameters: P = None.asInstanceOf[P];
lazy val lazyThing: Either[Throwable, T] = try {
Right(create(creationParameters)) }
catch { case e => Left(e) }
def get(createParams: P): Either[Throwable, T] = {
creationParameters = createParams
lazyThing
}
def create(params: P): T
}
Madness?
def getSingleInstance[T, P](params: P)(implicit
lazyCreator: Lazy[T, P]): T = {
lazyCreator.get(params) match {
case Right(successValue) => successValue
case Left(exception) => throw new
StackException(exception)
}
}
This is used by ONE client class
• Show some self-restraint
Hadoop
• void map(K1 key, V1 value,
OutputCollector<K2, V2> output, Reporter
reporter)
• void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter
reporter)
BIG NUMBERS
• Petabytes of data
• 1k+ node Hadoop cluster
• Multi-billion dollar merchandising business
• Lots of users and items 
How should I use Map Reduce?
• Raw map reduce 
• Pig
• Hive
• Cascading
• Scoobi
• Scalding 
Decision Time
• “And every one that heareth these sayings of
mine (great software engineers of the past),
and doeth them not, shall be likened unto a
foolish man, which built his house upon the
sand.”
• “And the rain descended, and the floods
came, and the winds blew, and beat upon that
house; and it fell: and great was the fall of it.”
I believe!
• Scalding combines the best of PIG and
Cascading
Good Pig
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
// do joins and group by also
Bad Pig
DEFINE NV_terms `perl nv_terms2.pl`
ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray,
name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as
name,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as
name1;
Other Pig Issues
• Scheduling and DAG creation
Cascading Rocks!
• What is it?
• Supports large workflows and reusable
components
– DAG generation
– Parallel Executions
Cascading code in Scala
val masterPipe = new
FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new
FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe,
CFields("user_id", "epoch_ts", "sqr"),
sortFields)
Someone should really code review this
Cascading Issues
This page intentionally left blank
Scalding Time
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}
}
Scalding @ eBay
• Boilerplate reduction
• Extensibility
• New hires
Practical Scalding Use
• Pimp my pimp
• Code generated boilerplate
• Cascades
• Traps
• Testing!
class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {
implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)
class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with
CommonFunctions
trait CommonFunctions {
import Dsl._
import RichPipe.assignName
def pipe: Pipe
def reallyComplexFunction(field: Fields, param: Long) = {
//mind blowing code here
}
}}
CheckoutTransactionsPipe(//default path logic)
.project(//fields I need)
.countUserInteractions(//params)
.doScoreCalculation(//params)
.doConfidenceCalculation(//params)
Seems a bit too readable for Scala
Collaborative Filtering
• Typically hard to run on large datasets
Structured Data Importance
• Do people shop by brand?
0
0.2
0.4
0.6
0.8
1
1.2
Supply
Handbags and Purses
Markov Chains
• Investigation of buying patterns in ~50 lines of
code
val purchases = "firsttime" :: x.take(500).toList
val pairs = purchases zip purchases.tail
val grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString)
val sizes = grouped map { x => {
x._1 -> x._2.size
}} toList
Mining Search Queries
• 20+ billion user queries - give me the top ones
per user
De-Dupe Rank ValidateSample Data
Automation
Hadoop Proxy
Batch Database Load
Machines
Cassandra
Jenkins
MySql
Mongo
Questions?
www.ebaynyc.com

More Related Content

What's hot

Scala Refactoring for Fun and Profit
Scala Refactoring for Fun and ProfitScala Refactoring for Fun and Profit
Scala Refactoring for Fun and Profit
Tomer Gabel
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
Florent Vilmart
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
Sho Yoshida
 
Jug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands onJug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands on
Onofrio Panzarino
 
Java scriptcore brief introduction
Java scriptcore brief introductionJava scriptcore brief introduction
Java scriptcore brief introduction
Horky Chen
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
Florent Vilmart
 
Journey's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream APIJourney's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream API
Maurice Naftalin
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Вадим Челышов
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable Lisp
Astrails
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Modern Data Stack France
 
The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論
scalaconfjp
 
Clojure & Scala
Clojure & ScalaClojure & Scala
Clojure & Scala
Diego Pacheco
 
RubyMotion
RubyMotionRubyMotion
RubyMotion
betabeers
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
Kenneth Geisshirt
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!
Maziyar PANAHI
 
Tools for writing Haskell programs
Tools for writing Haskell programsTools for writing Haskell programs
Tools for writing Haskell programsnkpart
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Tricks
siculars
 
Persistent Data Structures - partial::Conf
Persistent Data Structures - partial::ConfPersistent Data Structures - partial::Conf
Persistent Data Structures - partial::Conf
Ivan Vergiliev
 

What's hot (19)

Scala Refactoring for Fun and Profit
Scala Refactoring for Fun and ProfitScala Refactoring for Fun and Profit
Scala Refactoring for Fun and Profit
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
 
Jug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands onJug Marche: Meeting June 2014. Java 8 hands on
Jug Marche: Meeting June 2014. Java 8 hands on
 
Java scriptcore brief introduction
Java scriptcore brief introductionJava scriptcore brief introduction
Java scriptcore brief introduction
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
 
Journey's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream APIJourney's End – Collection and Reduction in the Stream API
Journey's End – Collection and Reduction in the Stream API
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable Lisp
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論The Evolution of Scala / Scala進化論
The Evolution of Scala / Scala進化論
 
Clojure & Scala
Clojure & ScalaClojure & Scala
Clojure & Scala
 
RubyMotion
RubyMotionRubyMotion
RubyMotion
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!
 
Tools for writing Haskell programs
Tools for writing Haskell programsTools for writing Haskell programs
Tools for writing Haskell programs
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Tricks
 
Persistent Data Structures - partial::Conf
Persistent Data Structures - partial::ConfPersistent Data Structures - partial::Conf
Persistent Data Structures - partial::Conf
 

Similar to Scala and Hadoop @ eBay

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
Perl Careers
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible Java
Charles Nutter
 
The Essential Perl Hacker's Toolkit
The Essential Perl Hacker's ToolkitThe Essential Perl Hacker's Toolkit
The Essential Perl Hacker's Toolkit
Stephen Scaffidi
 
Perl6 meets JVM
Perl6 meets JVMPerl6 meets JVM
Perl6 meets JVM
Tokuhiro Matsuno
 
Fast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on OracleFast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on OracleRaimonds Simanovskis
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)
Patrick Crowley
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
Mario Camou Riveroll
 
SproutCore and the Future of Web Apps
SproutCore and the Future of Web AppsSproutCore and the Future of Web Apps
SproutCore and the Future of Web Apps
Mike Subelsky
 
Ruby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developerRuby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developer
gicappa
 
Scio
ScioScio
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding Presentation
Landoop Ltd
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPJeremy Kendall
 
Down the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM WonderlandDown the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM Wonderland
Charles Nutter
 

Similar to Scala and Hadoop @ eBay (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible Java
 
The Essential Perl Hacker's Toolkit
The Essential Perl Hacker's ToolkitThe Essential Perl Hacker's Toolkit
The Essential Perl Hacker's Toolkit
 
Perl6 meets JVM
Perl6 meets JVMPerl6 meets JVM
Perl6 meets JVM
 
Fast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on OracleFast Web Applications Development with Ruby on Rails on Oracle
Fast Web Applications Development with Ruby on Rails on Oracle
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
 
SproutCore and the Future of Web Apps
SproutCore and the Future of Web AppsSproutCore and the Future of Web Apps
SproutCore and the Future of Web Apps
 
Ruby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developerRuby on Rails survival guide of an aged Java developer
Ruby on Rails survival guide of an aged Java developer
 
[Start] Scala
[Start] Scala[Start] Scala
[Start] Scala
 
Scio
ScioScio
Scio
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding Presentation
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
Down the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM WonderlandDown the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM Wonderland
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 

Scala and Hadoop @ eBay

  • 2. What we will cover • Polymorphic Function Values • Higher Kinded/Recursive Types • Cokleislis Star Operators • Scala Macros
  • 3. I have no clue what those things are
  • 4. What we will ACTUALLY cover • Why Scala • Why Hadoop • How we use Scala with Hadoop • Lots of CODE!
  • 5. Why Scala? • JVM • **Functional** • Expressive • How to convince your boss?
  • 6. Someone on Hacker News said Scala sucks • Compile Times • You changed List again? • Complicated • Leads to Madness
  • 7. Madness? trait Lazy[+T, P] { var creationParameters: P = None.asInstanceOf[P]; lazy val lazyThing: Either[Throwable, T] = try { Right(create(creationParameters)) } catch { case e => Left(e) } def get(createParams: P): Either[Throwable, T] = { creationParameters = createParams lazyThing } def create(params: P): T }
  • 8. Madness? def getSingleInstance[T, P](params: P)(implicit lazyCreator: Lazy[T, P]): T = { lazyCreator.get(params) match { case Right(successValue) => successValue case Left(exception) => throw new StackException(exception) } }
  • 9. This is used by ONE client class • Show some self-restraint
  • 10.
  • 11. Hadoop • void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) • void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)
  • 12. BIG NUMBERS • Petabytes of data • 1k+ node Hadoop cluster • Multi-billion dollar merchandising business • Lots of users and items 
  • 13. How should I use Map Reduce? • Raw map reduce  • Pig • Hive • Cascading • Scoobi • Scalding 
  • 14. Decision Time • “And every one that heareth these sayings of mine (great software engineers of the past), and doeth them not, shall be likened unto a foolish man, which built his house upon the sand.” • “And the rain descended, and the floods came, and the winds blew, and beat upon that house; and it fell: and great was the fall of it.”
  • 15. I believe! • Scalding combines the best of PIG and Cascading
  • 16. Good Pig A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; DUMP B; C = FOREACH B GENERATE y, z; STORE C INTO 'output'; // do joins and group by also
  • 17. Bad Pig DEFINE NV_terms `perl nv_terms2.pl` ship('$scripts/nv_terms2.pl'); i5 = stream i4 through NV_terms as (leafcat:chararray, name:chararray, name1:chararray); i7 = foreach i5 generate leafcat, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as name, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as name1;
  • 18. Other Pig Issues • Scheduling and DAG creation
  • 19. Cascading Rocks! • What is it? • Supports large workflows and reusable components – DAG generation – Parallel Executions
  • 20. Cascading code in Scala val masterPipe = new FilterURLEncodedStrings(masterPipe, "sqr") masterPipe = new FilterInappropriateQueries(masterPipe, "sqr”) masterPipe = new GroupBy(masterPipe, CFields("user_id", "epoch_ts", "sqr"), sortFields)
  • 21. Someone should really code review this
  • 22. Cascading Issues This page intentionally left blank
  • 23. Scalding Time class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 24. Scalding @ eBay • Boilerplate reduction • Extensibility • New hires
  • 25. Practical Scalding Use • Pimp my pimp • Code generated boilerplate • Cascades • Traps • Testing!
  • 26. class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate { implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe) class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with CommonFunctions trait CommonFunctions { import Dsl._ import RichPipe.assignName def pipe: Pipe def reallyComplexFunction(field: Fields, param: Long) = { //mind blowing code here } }}
  • 27. CheckoutTransactionsPipe(//default path logic) .project(//fields I need) .countUserInteractions(//params) .doScoreCalculation(//params) .doConfidenceCalculation(//params) Seems a bit too readable for Scala
  • 28. Collaborative Filtering • Typically hard to run on large datasets
  • 29. Structured Data Importance • Do people shop by brand? 0 0.2 0.4 0.6 0.8 1 1.2 Supply Handbags and Purses
  • 30. Markov Chains • Investigation of buying patterns in ~50 lines of code val purchases = "firsttime" :: x.take(500).toList val pairs = purchases zip purchases.tail val grouped = pairs.groupBy(x => x._1.toString+"-"+x._2.toString) val sizes = grouped map { x => { x._1 -> x._2.size }} toList
  • 31. Mining Search Queries • 20+ billion user queries - give me the top ones per user De-Dupe Rank ValidateSample Data
  • 32. Automation Hadoop Proxy Batch Database Load Machines Cassandra Jenkins MySql Mongo

Editor's Notes

  1. Introduce myself and ebay NYC
  2. Laugh
  3. We are starting to use scala for live site recs
  4. Mention the Option and EitherFirst class functionsMention how great traits areI feel like Haskell will never break into the corporation this is a great draft All my life I’ve wanted a type safe build system. And NOW I have it
  5. They break backward compatibilityWeak IDE support – debugging, refactoring, etcExplain the madness
  6. Tell them about the example
  7. The most complicated system for counting words insert meme hereExplain why we use hadoop. Data is huge. I can’t say when you want to make the jump to map reduce but I see growth in making it THE platform
  8. Say why raw map reduce stinks. Mention what hive is and scoobi is
  9. Explain why we didn’t go with scoobi even though it’s all scala
  10. Scheduling and DAG creationWhere is my SOURCE?
  11. Mentionazkaban
  12. Can do parallel executions of tasks that don’t depend on each otherSupports static dependencies via cascades
  13. Verbose. You still need to write a bunch of code.
  14. Mention about scoobi and how it’s not super stableRemindthen about how it combines the best of PIG and Cascading
  15. This is actual code to compute a user’s preferences. Explain a bit about user preferences
  16. Mahout has some functions for this but they are hard to setup and get goingLess precise than other state of the art methods but still accurateScala Days Talk with Chris Severs
  17. Linear ModelTalk about Concept ExtractionUse SQL Lite for ad hoc queries
  18. Talk about the use of cascadesTalk about traps and counters
  19. Scalding makes this 100% times easier because of cascades and flows