SlideShare a Scribd company logo
1 of 17
Introducing Satisfaction:
The Next-Generation
Hadoop Scheduler
Jerome Banks
Big Data Engineer
08 April 2015
April SF Hadoop User’s Meetup
Satisfaction
Rise of the Data Product
● Industries of tomorrow produce “Data Products”
o Produces a product created from processing Big Data
o Accuracy and timeliness of data key to it’s value
● Increasing pace of change
o Need to quickly prototype multiple different ideas
o Need to control the accidental complexity of chaos
Satisfaction
Enter the Solution !!!
● Satisfaction runs your Big Data Jobs!
o Geared for the Hadoop/Hive/Scala
Ecosystem
o Monitors Job’s Status
o Tracks Job’s Progress
● Development model for Data Scientists and
Engineers
● Dashboard for Ops and C-Level Execs
Satisfaction
Motivation: Why another Scheduler ?
● Existing solutions hard-to-use and incomplete
o Mish-mash of oozie, Jenkins, and shell scripts
● Data-driven Agile orgs produce increasing number of workflows
● Value of data makes SLA slippage very expensive
● Complexity of interactions between disparate Data sources
● Lack of understanding of org’s data assets
o How was this data generated ?
o Who is using this data and how ?
● Operational issues are painful for DevOps group
o Monitoring, Tracking
o Restarts and Alerting
Satisfaction
Next Generation Hadoop Scheduler
● Runs Hadoop/Hive Jobs
o Successor to oozie/Azkaban/Jenkins
o Extensible to new technologies
● DevOps Infrastructure
o Job Monitoring/Notifications
o Job History/Log Capture
● Development Model
o DSL for Defining Workflows
o Packaging/Deployment for Big Data Apps
Satisfaction
Three sets of customers
● Data Engineers/Data Scientists
o Crunches “Big Data”
o Creates “Data Products”
● DevOps
o Keeps the trains running on time
o Fixes things in bad weather
● C-Level Execs
o Wants to know the status of the company
o How much data are we processing?
o Will we make SLA ?
o What are our data assets? What are they worth?
Satisfaction
Differentiators
● Backward-Chaining
o Define dependencies, not flow graphs
● Data-focused
o Specify DataOutputs, not Actions
o Don’t re-generate already existing data
● Extensible
o Can implement new Satisfiers for new technologies
 Spark, Shell, Scalding, etc..
o Scala DSL, not XML
 Simple things are simple, complex things are possible
Satisfaction
Technology Overview
● Satisfaction is a Scala Application
o DSL for Workflows
o Akka Actor Dependency Engine
o Play! Frontend GUI
o Satisfier Extensions
● Data Products are Scala Projects
o Import needed Satisfiers in SBT
o Implement Flow in DSL
o Deploy to HDFS
Satisfaction
Architecture
Satisfaction
Key Concepts
● Goals And Satisfiers
● Witnesses and Variables
● Tracks and TrackDescriptors
Satisfaction
Goals and Satisfiers
● Developers define Goals to be satisfied
● Goals produce one or more DataOutputs
● Goals depend upon the DataOutputs of other Goals
● A Goal can be satisfied with a specific Satisfier
o HiveSatisfier
o HadoopJobSatisfier
o ShellSatisfier
o SparkSatisfier
Satisfaction
Witnesses and Variables
● Goals define a set of Variables
o dt, hour, topic, network
● A Witness reifies a DataOutput to get a DataInstance
● To satisfy a Goal, you need to specify an appropriate Witness
● One can define rules to depend on Goals with a mapped Witness
o Depend upon yesterday’s data
o Depend upon all instances of a group
Satisfaction
Tracks and TrackDescriptors
● A Track specifies the Goals, and the TopLevel DataOutput
● A TrackDescriptor specifies a specific release of a Track
o Includes Version, User, and Variant
● Track can be “pimped out” with various traits
o Job Scheduling
o Retry-Logic
o Notifications
● Developer creates a repo for a project, and defines the Track
o Uses sbt to build and upload a Track to HDFS
Satisfaction
Development Model
● Data Engineers/Scientists define workflows as Tracks
o Scala Git Repo project for each Track
o Specify DataOutputs and dependencies for the Track
o Define top level Goals in Scala DSL
o Define ETL as Hive Scripts ( or Scalding, Hadoop, etc)
 Add as resources to project
 Can define UDF’s as project Code
o Deploy to HDFS in track Directory
Satisfaction
DEMO !!!
Satisfaction
Next Steps:
● Currently in production Internally
● Source code available
o http://github.com/tagged/satisfaction
● Still a work in progress
o Documentation
o Bug fixes
o UI improvements
● Additional Satisfiers ( Spark, MLLib, Scalding )
● Job Progress and SLA Tracking TBD
Thank you!

More Related Content

Similar to Satisfaction hadoop meetup presentation

Arun-Kumar-OEDQ-Developer
Arun-Kumar-OEDQ-DeveloperArun-Kumar-OEDQ-Developer
Arun-Kumar-OEDQ-Developer
Arun Kumar
 

Similar to Satisfaction hadoop meetup presentation (20)

Altic's big analytics stack, Charly Clairmont, Altic.
Altic's big analytics stack, Charly Clairmont, Altic.Altic's big analytics stack, Charly Clairmont, Altic.
Altic's big analytics stack, Charly Clairmont, Altic.
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
 
SURENDRANATH GANDLA4
SURENDRANATH GANDLA4SURENDRANATH GANDLA4
SURENDRANATH GANDLA4
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Arun-Kumar-OEDQ-Developer
Arun-Kumar-OEDQ-DeveloperArun-Kumar-OEDQ-Developer
Arun-Kumar-OEDQ-Developer
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Resume_VipinKP
Resume_VipinKPResume_VipinKP
Resume_VipinKP
 
Productionizing Data Science at Experience
Productionizing Data Science at ExperienceProductionizing Data Science at Experience
Productionizing Data Science at Experience
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Amit_Kumar_CV
Amit_Kumar_CVAmit_Kumar_CV
Amit_Kumar_CV
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
DevOps at Obama for America(2012) and the DNC (DevOps Days NYC Jan 2013)
DevOps at Obama for America(2012) and the DNC (DevOps Days NYC Jan 2013)DevOps at Obama for America(2012) and the DNC (DevOps Days NYC Jan 2013)
DevOps at Obama for America(2012) and the DNC (DevOps Days NYC Jan 2013)
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
Resume - Deepak v.s
Resume -  Deepak v.sResume -  Deepak v.s
Resume - Deepak v.s
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Satisfaction hadoop meetup presentation

  • 1. Introducing Satisfaction: The Next-Generation Hadoop Scheduler Jerome Banks Big Data Engineer 08 April 2015 April SF Hadoop User’s Meetup
  • 2. Satisfaction Rise of the Data Product ● Industries of tomorrow produce “Data Products” o Produces a product created from processing Big Data o Accuracy and timeliness of data key to it’s value ● Increasing pace of change o Need to quickly prototype multiple different ideas o Need to control the accidental complexity of chaos
  • 3. Satisfaction Enter the Solution !!! ● Satisfaction runs your Big Data Jobs! o Geared for the Hadoop/Hive/Scala Ecosystem o Monitors Job’s Status o Tracks Job’s Progress ● Development model for Data Scientists and Engineers ● Dashboard for Ops and C-Level Execs
  • 4. Satisfaction Motivation: Why another Scheduler ? ● Existing solutions hard-to-use and incomplete o Mish-mash of oozie, Jenkins, and shell scripts ● Data-driven Agile orgs produce increasing number of workflows ● Value of data makes SLA slippage very expensive ● Complexity of interactions between disparate Data sources ● Lack of understanding of org’s data assets o How was this data generated ? o Who is using this data and how ? ● Operational issues are painful for DevOps group o Monitoring, Tracking o Restarts and Alerting
  • 5. Satisfaction Next Generation Hadoop Scheduler ● Runs Hadoop/Hive Jobs o Successor to oozie/Azkaban/Jenkins o Extensible to new technologies ● DevOps Infrastructure o Job Monitoring/Notifications o Job History/Log Capture ● Development Model o DSL for Defining Workflows o Packaging/Deployment for Big Data Apps
  • 6. Satisfaction Three sets of customers ● Data Engineers/Data Scientists o Crunches “Big Data” o Creates “Data Products” ● DevOps o Keeps the trains running on time o Fixes things in bad weather ● C-Level Execs o Wants to know the status of the company o How much data are we processing? o Will we make SLA ? o What are our data assets? What are they worth?
  • 7. Satisfaction Differentiators ● Backward-Chaining o Define dependencies, not flow graphs ● Data-focused o Specify DataOutputs, not Actions o Don’t re-generate already existing data ● Extensible o Can implement new Satisfiers for new technologies  Spark, Shell, Scalding, etc.. o Scala DSL, not XML  Simple things are simple, complex things are possible
  • 8. Satisfaction Technology Overview ● Satisfaction is a Scala Application o DSL for Workflows o Akka Actor Dependency Engine o Play! Frontend GUI o Satisfier Extensions ● Data Products are Scala Projects o Import needed Satisfiers in SBT o Implement Flow in DSL o Deploy to HDFS
  • 10. Satisfaction Key Concepts ● Goals And Satisfiers ● Witnesses and Variables ● Tracks and TrackDescriptors
  • 11. Satisfaction Goals and Satisfiers ● Developers define Goals to be satisfied ● Goals produce one or more DataOutputs ● Goals depend upon the DataOutputs of other Goals ● A Goal can be satisfied with a specific Satisfier o HiveSatisfier o HadoopJobSatisfier o ShellSatisfier o SparkSatisfier
  • 12. Satisfaction Witnesses and Variables ● Goals define a set of Variables o dt, hour, topic, network ● A Witness reifies a DataOutput to get a DataInstance ● To satisfy a Goal, you need to specify an appropriate Witness ● One can define rules to depend on Goals with a mapped Witness o Depend upon yesterday’s data o Depend upon all instances of a group
  • 13. Satisfaction Tracks and TrackDescriptors ● A Track specifies the Goals, and the TopLevel DataOutput ● A TrackDescriptor specifies a specific release of a Track o Includes Version, User, and Variant ● Track can be “pimped out” with various traits o Job Scheduling o Retry-Logic o Notifications ● Developer creates a repo for a project, and defines the Track o Uses sbt to build and upload a Track to HDFS
  • 14. Satisfaction Development Model ● Data Engineers/Scientists define workflows as Tracks o Scala Git Repo project for each Track o Specify DataOutputs and dependencies for the Track o Define top level Goals in Scala DSL o Define ETL as Hive Scripts ( or Scalding, Hadoop, etc)  Add as resources to project  Can define UDF’s as project Code o Deploy to HDFS in track Directory
  • 16. Satisfaction Next Steps: ● Currently in production Internally ● Source code available o http://github.com/tagged/satisfaction ● Still a work in progress o Documentation o Bug fixes o UI improvements ● Additional Satisfiers ( Spark, MLLib, Scalding ) ● Job Progress and SLA Tracking TBD