SlideShare a Scribd company logo
1 of 29
Download to read offline
Scala in Hulu’s Data
Platform
Prasan Samtani
prasan.samtani@hulu.com
Wednesday, November 20, 13
Disclaimer
Wednesday, November 20, 13
• Streaming video service
• Next-day same-season TV
• ~4.5 million subscribers
• ~25 million unique visitors
per month
• > 1 billion ads/month
Wednesday, November 20, 13
Beacons
Mobile
device
Living room
device
Browser
Beacon
Beacon
Beacon
Log
collection
server
Wednesday, November 20, 13
What’s in a beacon
80	
  2013-­‐04-­‐01	
   00:00:00	
  
/v3/playback/start?
bitrate=650
&cdn=Akamai
&channel=Anime
&clichéent=Explorer
&computerguid=EA8FA1000232B8F6986C3E0BE55E9333
&contenLd=5003673
…
Wednesday, November 20, 13
Data pipeline
Beacon'
Log'collec+on'Beacon'
Beacon'
MapReduce'
jobs'
Flat'file'of'
beacons'
categorized'by'
type'on'HDFS'
'
Base'facts'in'
Hive'
'
Aggrega+ng'
process'
Published,'
queryEable'
databases'
Wednesday, November 20, 13
Base-facts
computerguid 00-755925-925755-925
userid    5238518
video_id    289696
content_partner_id     398
distribution_partner_id    602
distro_platform_id    14
is_on_hulu    0
onhulu_offhulu_plus_id    2
package_id    1000
…
hourid 383149
watched 76426
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Video id=25
Minutes = 3
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Video id=25
Minutes = 3
Video id=32
Minutes = 8
Wednesday, November 20, 13
Hadoop & MapReduce
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Mapper
Video id=35
Minutes = 5
Video id=32
Minutes = 8
Video id=35
Minutes = 20
Video id=25
Minutes = 3
Video id=25
Minutes = 3
Video id=32
Minutes = 8
Video id=35
Minutes = 25
Wednesday, November 20, 13
Aggregations
• Base-facts live in
relational tables in
HIVE
• Aggregations are
now a matter of
performing the right
HiveQL queries
once data is ready
Wednesday, November 20, 13
Dispatching jobs to the
cluster
• The job scheduler has a single function: To
dispatch jobs to the cluster when they are ready,
according to some notion of priority
• Written in Scala
• Lots of database interaction
Wednesday, November 20, 13
Why Scala?
• Good concurrency support
• More importantly, easy to construct new ideas and
think in terms of higher level abstractions
• “Programs are for people to read, and only
incidentally for computers to execute” - Harold
Abelson
Wednesday, November 20, 13
object Job extends Table[(Int, String, Option[Int], Option[Int], Int, Int)]("job") {
def part1 = allowPickup ~ maxDate ~ regionId ~ ignoreTrailingData ~ jobId ~ jobType ~ fileBatch ~ ignoreLeadingData ~ minDateRang
def part2 = harpyOutputBasePath ~ enabled ~ dependsOn <>(JobData2, JobData2.unapply _)
def all = (part1, part2)
def * = jobId ~ name ~ dependsOn ~ jobType ~ runPriority ~ isKPI
def allowPickup = column[Int]("allowpickup")
def maxDate = column[Option[Date]]("maxdate")
def regionId = column[Int]("regionid")
def ignoreTrailingData = column[Int]("ignoretrailingdata")
def jobId = column[Int]("jobid")
def jobType = column[Option[Int]]("jobtype")
def fileBatch = column[Int]("filebatch")
def ignoreLeadingData = column[Int]("ignoreleadingdata")
def minDateRangeStart = column[Long]("mindaterangestart")
def inputFileType = column[String]("inputfiletype")
def customJar = column[Option[String]]("custom_jar")
def priority = column[Option[Int]]("priority")
def runPriority = column[Int]("runpriority")
def isKPI = column[Int]("isKPI")
def hoursBeforeDay = column[Option[Int]]("hoursbeforeday")
def errorThreshold = column[Option[Int]]("error_threshold")
def hoursAfterDay = column[Option[Int]]("hoursafterday")
def minDate = column[Option[Date]]("mindate")
def hardStopAction = column[Int]("hardstop_action")
def name = column[String]("name")
def harpyOutputBasePath = column[Option[String]]("harpy_output_basepath")
def enabled = column[Int]("enabled")
def dependsOn = column[Option[Int]]("dependson")
}
case class JobData1(allowPickup: Int, maxDate: Option[Date], regionId: Int, ignoreTrailingData: Int, jobId: Int, jobType: Option[In
case class JobData2(harpyOutputBasePath: Option[String], enabled: Int, dependsOn: Option[Int])
case class JobData(data1: JobData1, data2: JobData2) {
def allowPickup = data1.allowPickup
def maxDate = data1.maxDate
def regionId = data1.regionId
def ignoreTrailingData = data1.ignoreTrailingData
def jobId = data1.jobId
def jobType = data1.jobType
def fileBatch = data1.fileBatch
def ignoreLeadingData = data1.ignoreLeadingData
def minDateRangeStart = data1.minDateRangeStart
def inputFileType = data1.inputFileType
Wednesday, November 20, 13
Introducing slickint
package org.zenbowman.slickdemo
import scala.slick.driver.MySQLDriver.simple._
import java.sql.Date
table=Job, primaryKey=jobId, dbname=job, *=jobId ~ name ~ dependsOn ~ jobType ~ runPriority ~ isKPI
jobId : Int : jobid
name : String : name
enabled : Int : enabled
minDateRangeStart : Long : mindaterangestart
fileBatch : Int : filebatch
dependsOn : Int? : dependson
allowPickup : Int : allowpickup
jobType : Int? : jobtype
minDate : Date? : mindate
maxDate : Date? : maxdate
hoursBeforeDay : Int? : hoursbeforeday
hoursAfterDay : Int? : hoursafterday
hardStopAction : Int : hardstop_action
inputFileType : String : inputfiletype
ignoreTrailingData : Int : ignoretrailingdata
ignoreLeadingData : Int : ignoreleadingdata
priority : Int? : priority
errorThreshold : Int? : error_threshold
runPriority : Int : runpriority
regionId : Int : regionid
harpyOutputBasePath : String? : harpy_output_basepath
customJar : String? : custom_jar
isKPI : Int : isKPI
Wednesday, November 20, 13
Generating the Slick code
• To generate the Scala code:
• python slickint.py [input-filename].slickint > [output-filename].scala
• Outstanding issues (wouldn’t mind some help)
• Insertion into tables with > 22 columns
• Automatic schema detection from the database (at
least for MySQL/Postgres)
• github.com/zenbowman/slickint
Wednesday, November 20, 13
Handling external
dependencies
• Jobs can be dependent on whether an external
process has run
• We want a generic method for determining whether
a job has run based on the status of one or more
databases
Wednesday, November 20, 13
Gate-keeper: simple case
result = "SELECT 1 FROM hbaseingest.ingeststatus
WHERE lastfullupdate > ‘$currentHour'
AND hbasetable='dma'"
with “jdbc:mysql://****:3306/hbaseingest?
zeroDateTimeBehavior=convertToNull”
user “****” password "****";
Wednesday, November 20, 13
Gate-keeper: complex case
doomsdayrun = "SELECT min(ended_at_utc) FROM doomsday.run WHERE
started_at_utc >= '$nextHour'" with
with “jdbc:mysql://***:3306/doomsday?
useGmtMillisForDatetimes=true&useJDBCCompliantTimezoneShift=tru
e&useTimezone=false”
user “****” password "*****";
result = "SELECT CASE WHEN COUNT(*) = 0 THEN 1 ELSE 0 END FROM
hbaseingest.ingeststatus WHERE lastdiffupdate <
'$doomsdayrun';"
with “jdbc:mysql://****:3306/hbaseingest?
zeroDateTimeBehavior=convertToNull&useLegacyDatetimeCode=false&
useGmtMillisForDatetimes=true&useJDBCCompliantTimezoneShift=tru
e&useTimezone=false”
user “****” password "****";
Wednesday, November 20, 13
What kind of your system does
your language encourage?
Wednesday, November 20, 13
Parser combinators
object GateKeeperConcepts {
case class GateKeeperVariable(name: String)
trait GateKeeperEvaluator {
def getSql: String
def getResult(rs: ResultSet): Option[String]
}
case class SimpleSqlEvaluator(sqlString: String) extends GateKeeperEvaluator {
def getSql: String = sqlString
def getResult(rs: ResultSet) = { ... }
}
case class ConnectionParameters(connectionString: String, userName: String, password:
String)
case class GateKeepingExpression(leftHandSide: GateKeeperVariable, rightHandSide:
GateKeeperEvaluator, connection: ConnectionParameters)
case class GateKeepingSpec(expressions: Seq[GateKeepingExpression])
}
Wednesday, November 20, 13
The parser
object GateKeeperDSL extends StandardTokenParsers {
import GateKeeperConcepts._
lexical.reserved +=(
"with", "user", "password"
)
lazy val gateKeepingSpec: Parser[GateKeepingSpec] =
rep(gateKeepingExpression) ^^ GateKeepingSpec
lazy val gateKeepingExpression: Parser[GateKeepingExpression] =
ident ~ "=" ~ gateKeeperEvaluator ~ "with" ~ stringLit ~ "user" ~ stringLit ~ "password" ~
stringLit ~ ";" ^^ {
case name ~ _ ~ rhs ~ _ ~ conn ~ _ ~ username ~ _ ~ pwd ~ _ =>
GateKeepingExpression(GateKeeperVariable(name), rhs, ConnectionParameters(conn, username,
pwd))
}
lazy val gateKeeperEvaluator: Parser[GateKeeperEvaluator] =
simpleSqlEvaluator
lazy val simpleSqlEvaluator: Parser[SimpleSqlEvaluator] =
stringLit ^^ {
case sqlString => SimpleSqlEvaluator(sqlString)
}
}
Wednesday, November 20, 13
- Rob Pike and Brian Kernighan, The Practice of Programming
If you find yourself writing too much code to do
a mundane job, or if you have trouble
expressing the process comfortably, maybe you
are using the wrong language
Wednesday, November 20, 13
- Alan Perlis
Syntactic sugar causes cancer of the semicolon
Wednesday, November 20, 13

More Related Content

What's hot

Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017iguazio
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
iguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Realtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingRealtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingSantosh Sahoo
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark JobsDataWorks Summit
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 

What's hot (20)

Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
iguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30th
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Realtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingRealtime Reporting using Spark Streaming
Realtime Reporting using Spark Streaming
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Scio
ScioScio
Scio
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 

Viewers also liked

Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Prasan Samtani
 
Big Data Introducción
Big Data IntroducciónBig Data Introducción
Big Data Introducciónbd4s
 
Introducción al Business Intelligence y al Big Data
Introducción al Business Intelligence y al Big DataIntroducción al Business Intelligence y al Big Data
Introducción al Business Intelligence y al Big DataDavid Hurtado
 
ConférenSquad #4 - Hulu et DASH par Baptiste Coudurier
ConférenSquad #4 - Hulu et DASH par Baptiste CoudurierConférenSquad #4 - Hulu et DASH par Baptiste Coudurier
ConférenSquad #4 - Hulu et DASH par Baptiste CoudurierJustindwah
 
Big data: a data sicentist view
Big data: a data sicentist viewBig data: a data sicentist view
Big data: a data sicentist viewfernandocalle
 
Big Data Architecture con Pentaho
Big Data Architecture con PentahoBig Data Architecture con Pentaho
Big Data Architecture con PentahoDatalytics
 
Scalable Media Workflows in the Cloud
Scalable Media Workflows in the CloudScalable Media Workflows in the Cloud
Scalable Media Workflows in the CloudAmazon Web Services
 
Big Data y el sector salud
Big Data y el sector saludBig Data y el sector salud
Big Data y el sector saludBEEVA_es
 
Big Data en el entorno Bancario
Big Data en el entorno BancarioBig Data en el entorno Bancario
Big Data en el entorno BancarioMartín Cabrera
 
Data Mining. Extracción de Conocimiento en Grandes Bases de Datos
Data Mining. Extracción de Conocimiento en Grandes Bases de DatosData Mining. Extracción de Conocimiento en Grandes Bases de Datos
Data Mining. Extracción de Conocimiento en Grandes Bases de DatosRoberto Espinosa
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 

Viewers also liked (18)

Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
 
Big Data Introducción
Big Data IntroducciónBig Data Introducción
Big Data Introducción
 
Introducción al Business Intelligence y al Big Data
Introducción al Business Intelligence y al Big DataIntroducción al Business Intelligence y al Big Data
Introducción al Business Intelligence y al Big Data
 
ConférenSquad #4 - Hulu et DASH par Baptiste Coudurier
ConférenSquad #4 - Hulu et DASH par Baptiste CoudurierConférenSquad #4 - Hulu et DASH par Baptiste Coudurier
ConférenSquad #4 - Hulu et DASH par Baptiste Coudurier
 
Big data: a data sicentist view
Big data: a data sicentist viewBig data: a data sicentist view
Big data: a data sicentist view
 
Big Data Architecture con Pentaho
Big Data Architecture con PentahoBig Data Architecture con Pentaho
Big Data Architecture con Pentaho
 
Scalable Media Workflows in the Cloud
Scalable Media Workflows in the CloudScalable Media Workflows in the Cloud
Scalable Media Workflows in the Cloud
 
Big Data y el sector salud
Big Data y el sector saludBig Data y el sector salud
Big Data y el sector salud
 
Big Data en el entorno Bancario
Big Data en el entorno BancarioBig Data en el entorno Bancario
Big Data en el entorno Bancario
 
Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Data Mining. Extracción de Conocimiento en Grandes Bases de Datos
Data Mining. Extracción de Conocimiento en Grandes Bases de DatosData Mining. Extracción de Conocimiento en Grandes Bases de Datos
Data Mining. Extracción de Conocimiento en Grandes Bases de Datos
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Big Data y Salud
Big Data y SaludBig Data y Salud
Big Data y Salud
 
Bases de Datos No Relacionales (NoSQL): Cassandra, CouchDB, MongoDB y Neo4j
Bases de Datos No Relacionales (NoSQL): Cassandra, CouchDB, MongoDB y Neo4jBases de Datos No Relacionales (NoSQL): Cassandra, CouchDB, MongoDB y Neo4j
Bases de Datos No Relacionales (NoSQL): Cassandra, CouchDB, MongoDB y Neo4j
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 

Similar to Scala in hulu's data platform

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 
Use Your MySQL Knowledge to Become a MongoDB Guru
Use Your MySQL Knowledge to Become a MongoDB GuruUse Your MySQL Knowledge to Become a MongoDB Guru
Use Your MySQL Knowledge to Become a MongoDB GuruTim Callaghan
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
Zabbix LLD from a C Module by Jan-Piet Mens
Zabbix LLD from a C Module by Jan-Piet MensZabbix LLD from a C Module by Jan-Piet Mens
Zabbix LLD from a C Module by Jan-Piet MensNETWAYS
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Ludovico Caldara
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformRadek Simko
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your DataMongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your DataMongoDB
 
Kotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projectsKotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projectsBartosz Kosarzycki
 
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016STX Next
 
Terrastore - A document database for developers
Terrastore - A document database for developersTerrastore - A document database for developers
Terrastore - A document database for developersSergio Bossa
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
GPars For Beginners
GPars For BeginnersGPars For Beginners
GPars For BeginnersMatt Passell
 

Similar to Scala in hulu's data platform (20)

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Use Your MySQL Knowledge to Become a MongoDB Guru
Use Your MySQL Knowledge to Become a MongoDB GuruUse Your MySQL Knowledge to Become a MongoDB Guru
Use Your MySQL Knowledge to Become a MongoDB Guru
 
Latinoware
LatinowareLatinoware
Latinoware
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Zabbix LLD from a C Module by Jan-Piet Mens
Zabbix LLD from a C Module by Jan-Piet MensZabbix LLD from a C Module by Jan-Piet Mens
Zabbix LLD from a C Module by Jan-Piet Mens
 
Vocanic Map Reduce Lite
Vocanic Map Reduce LiteVocanic Map Reduce Lite
Vocanic Map Reduce Lite
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Mongo db for C# Developers
Mongo db for C# DevelopersMongo db for C# Developers
Mongo db for C# Developers
 
Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
JQuery Flot
JQuery FlotJQuery Flot
JQuery Flot
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your DataMongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
 
Kotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projectsKotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projects
 
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
 
Terrastore - A document database for developers
Terrastore - A document database for developersTerrastore - A document database for developers
Terrastore - A document database for developers
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
GPars For Beginners
GPars For BeginnersGPars For Beginners
GPars For Beginners
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Scala in hulu's data platform

  • 1. Scala in Hulu’s Data Platform Prasan Samtani prasan.samtani@hulu.com Wednesday, November 20, 13
  • 3. • Streaming video service • Next-day same-season TV • ~4.5 million subscribers • ~25 million unique visitors per month • > 1 billion ads/month Wednesday, November 20, 13
  • 5. What’s in a beacon 80  2013-­‐04-­‐01   00:00:00   /v3/playback/start? bitrate=650 &cdn=Akamai &channel=Anime &clichéent=Explorer &computerguid=EA8FA1000232B8F6986C3E0BE55E9333 &contenLd=5003673 … Wednesday, November 20, 13
  • 7. Base-facts computerguid 00-755925-925755-925 userid    5238518 video_id    289696 content_partner_id     398 distribution_partner_id    602 distro_platform_id    14 is_on_hulu    0 onhulu_offhulu_plus_id    2 package_id    1000 … hourid 383149 watched 76426 Wednesday, November 20, 13
  • 8. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Wednesday, November 20, 13
  • 9. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Wednesday, November 20, 13
  • 10. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Wednesday, November 20, 13
  • 11. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Wednesday, November 20, 13
  • 12. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Wednesday, November 20, 13
  • 13. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Video id=25 Minutes = 3 Wednesday, November 20, 13
  • 14. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Video id=25 Minutes = 3 Video id=32 Minutes = 8 Wednesday, November 20, 13
  • 15. Hadoop & MapReduce Mapper Mapper Mapper Reducer Reducer Reducer Mapper Video id=35 Minutes = 5 Video id=32 Minutes = 8 Video id=35 Minutes = 20 Video id=25 Minutes = 3 Video id=25 Minutes = 3 Video id=32 Minutes = 8 Video id=35 Minutes = 25 Wednesday, November 20, 13
  • 16. Aggregations • Base-facts live in relational tables in HIVE • Aggregations are now a matter of performing the right HiveQL queries once data is ready Wednesday, November 20, 13
  • 17. Dispatching jobs to the cluster • The job scheduler has a single function: To dispatch jobs to the cluster when they are ready, according to some notion of priority • Written in Scala • Lots of database interaction Wednesday, November 20, 13
  • 18. Why Scala? • Good concurrency support • More importantly, easy to construct new ideas and think in terms of higher level abstractions • “Programs are for people to read, and only incidentally for computers to execute” - Harold Abelson Wednesday, November 20, 13
  • 19. object Job extends Table[(Int, String, Option[Int], Option[Int], Int, Int)]("job") { def part1 = allowPickup ~ maxDate ~ regionId ~ ignoreTrailingData ~ jobId ~ jobType ~ fileBatch ~ ignoreLeadingData ~ minDateRang def part2 = harpyOutputBasePath ~ enabled ~ dependsOn <>(JobData2, JobData2.unapply _) def all = (part1, part2) def * = jobId ~ name ~ dependsOn ~ jobType ~ runPriority ~ isKPI def allowPickup = column[Int]("allowpickup") def maxDate = column[Option[Date]]("maxdate") def regionId = column[Int]("regionid") def ignoreTrailingData = column[Int]("ignoretrailingdata") def jobId = column[Int]("jobid") def jobType = column[Option[Int]]("jobtype") def fileBatch = column[Int]("filebatch") def ignoreLeadingData = column[Int]("ignoreleadingdata") def minDateRangeStart = column[Long]("mindaterangestart") def inputFileType = column[String]("inputfiletype") def customJar = column[Option[String]]("custom_jar") def priority = column[Option[Int]]("priority") def runPriority = column[Int]("runpriority") def isKPI = column[Int]("isKPI") def hoursBeforeDay = column[Option[Int]]("hoursbeforeday") def errorThreshold = column[Option[Int]]("error_threshold") def hoursAfterDay = column[Option[Int]]("hoursafterday") def minDate = column[Option[Date]]("mindate") def hardStopAction = column[Int]("hardstop_action") def name = column[String]("name") def harpyOutputBasePath = column[Option[String]]("harpy_output_basepath") def enabled = column[Int]("enabled") def dependsOn = column[Option[Int]]("dependson") } case class JobData1(allowPickup: Int, maxDate: Option[Date], regionId: Int, ignoreTrailingData: Int, jobId: Int, jobType: Option[In case class JobData2(harpyOutputBasePath: Option[String], enabled: Int, dependsOn: Option[Int]) case class JobData(data1: JobData1, data2: JobData2) { def allowPickup = data1.allowPickup def maxDate = data1.maxDate def regionId = data1.regionId def ignoreTrailingData = data1.ignoreTrailingData def jobId = data1.jobId def jobType = data1.jobType def fileBatch = data1.fileBatch def ignoreLeadingData = data1.ignoreLeadingData def minDateRangeStart = data1.minDateRangeStart def inputFileType = data1.inputFileType Wednesday, November 20, 13
  • 20. Introducing slickint package org.zenbowman.slickdemo import scala.slick.driver.MySQLDriver.simple._ import java.sql.Date table=Job, primaryKey=jobId, dbname=job, *=jobId ~ name ~ dependsOn ~ jobType ~ runPriority ~ isKPI jobId : Int : jobid name : String : name enabled : Int : enabled minDateRangeStart : Long : mindaterangestart fileBatch : Int : filebatch dependsOn : Int? : dependson allowPickup : Int : allowpickup jobType : Int? : jobtype minDate : Date? : mindate maxDate : Date? : maxdate hoursBeforeDay : Int? : hoursbeforeday hoursAfterDay : Int? : hoursafterday hardStopAction : Int : hardstop_action inputFileType : String : inputfiletype ignoreTrailingData : Int : ignoretrailingdata ignoreLeadingData : Int : ignoreleadingdata priority : Int? : priority errorThreshold : Int? : error_threshold runPriority : Int : runpriority regionId : Int : regionid harpyOutputBasePath : String? : harpy_output_basepath customJar : String? : custom_jar isKPI : Int : isKPI Wednesday, November 20, 13
  • 21. Generating the Slick code • To generate the Scala code: • python slickint.py [input-filename].slickint > [output-filename].scala • Outstanding issues (wouldn’t mind some help) • Insertion into tables with > 22 columns • Automatic schema detection from the database (at least for MySQL/Postgres) • github.com/zenbowman/slickint Wednesday, November 20, 13
  • 22. Handling external dependencies • Jobs can be dependent on whether an external process has run • We want a generic method for determining whether a job has run based on the status of one or more databases Wednesday, November 20, 13
  • 23. Gate-keeper: simple case result = "SELECT 1 FROM hbaseingest.ingeststatus WHERE lastfullupdate > ‘$currentHour' AND hbasetable='dma'" with “jdbc:mysql://****:3306/hbaseingest? zeroDateTimeBehavior=convertToNull” user “****” password "****"; Wednesday, November 20, 13
  • 24. Gate-keeper: complex case doomsdayrun = "SELECT min(ended_at_utc) FROM doomsday.run WHERE started_at_utc >= '$nextHour'" with with “jdbc:mysql://***:3306/doomsday? useGmtMillisForDatetimes=true&useJDBCCompliantTimezoneShift=tru e&useTimezone=false” user “****” password "*****"; result = "SELECT CASE WHEN COUNT(*) = 0 THEN 1 ELSE 0 END FROM hbaseingest.ingeststatus WHERE lastdiffupdate < '$doomsdayrun';" with “jdbc:mysql://****:3306/hbaseingest? zeroDateTimeBehavior=convertToNull&useLegacyDatetimeCode=false& useGmtMillisForDatetimes=true&useJDBCCompliantTimezoneShift=tru e&useTimezone=false” user “****” password "****"; Wednesday, November 20, 13
  • 25. What kind of your system does your language encourage? Wednesday, November 20, 13
  • 26. Parser combinators object GateKeeperConcepts { case class GateKeeperVariable(name: String) trait GateKeeperEvaluator { def getSql: String def getResult(rs: ResultSet): Option[String] } case class SimpleSqlEvaluator(sqlString: String) extends GateKeeperEvaluator { def getSql: String = sqlString def getResult(rs: ResultSet) = { ... } } case class ConnectionParameters(connectionString: String, userName: String, password: String) case class GateKeepingExpression(leftHandSide: GateKeeperVariable, rightHandSide: GateKeeperEvaluator, connection: ConnectionParameters) case class GateKeepingSpec(expressions: Seq[GateKeepingExpression]) } Wednesday, November 20, 13
  • 27. The parser object GateKeeperDSL extends StandardTokenParsers { import GateKeeperConcepts._ lexical.reserved +=( "with", "user", "password" ) lazy val gateKeepingSpec: Parser[GateKeepingSpec] = rep(gateKeepingExpression) ^^ GateKeepingSpec lazy val gateKeepingExpression: Parser[GateKeepingExpression] = ident ~ "=" ~ gateKeeperEvaluator ~ "with" ~ stringLit ~ "user" ~ stringLit ~ "password" ~ stringLit ~ ";" ^^ { case name ~ _ ~ rhs ~ _ ~ conn ~ _ ~ username ~ _ ~ pwd ~ _ => GateKeepingExpression(GateKeeperVariable(name), rhs, ConnectionParameters(conn, username, pwd)) } lazy val gateKeeperEvaluator: Parser[GateKeeperEvaluator] = simpleSqlEvaluator lazy val simpleSqlEvaluator: Parser[SimpleSqlEvaluator] = stringLit ^^ { case sqlString => SimpleSqlEvaluator(sqlString) } } Wednesday, November 20, 13
  • 28. - Rob Pike and Brian Kernighan, The Practice of Programming If you find yourself writing too much code to do a mundane job, or if you have trouble expressing the process comfortably, maybe you are using the wrong language Wednesday, November 20, 13
  • 29. - Alan Perlis Syntactic sugar causes cancer of the semicolon Wednesday, November 20, 13