Overview of Oozie QE Qualification Process


                    Michelle Chiang
                    07/18/2012
Agenda

•   What is Oozie
•   Qualification stages
•   Challenges
•   Future tasks
•   Q&A
What is Oozie?

• Scalable, secure workflow scheduling
  system for Hadoop.
  – Three levels of jobs
    • Workflow job
       – Support actions such as MR, Pig, Java, Distcp
    • Coordinator job
       – Scheduling
    • Bundle job
       – Monitor status of coordinator jobs
Job Submission to Hadoop

  Oozie
  Client                Hadoop Cluster

1. CLI                 Job
                     Tracker      Actual
2. Java Client API               M/R Job
3. WS API


                     Launcher
                      Mapper
  Oozie
  Server
QE Qualification Process

•   Develop test plan in design cycle
•   Design and implement test cases
•   Execute tests
•   Prepare release notes & certification
•   Support production deployment and
    customers’ FAQs
Develop Test Plan

• Prepare test plan for new features
  defined in PRD, or
• Prepare test plan for the selected new
  features checked into the apache source
• Define test strategy
• Test plan is reviewed by QE and Dev
Test plan example

  • test plan for “shell” action
Case       Execution       Expected results                           Comment
ID
Ticket #   Shell action    1.     Read env var, compare action data   Pass/Fail, bug#/JIRA#
                           2.     Read config env var
                           3.     Hadoop fs –ls; hadoop fs -cp
Test_sh*   Bash shell      1, 2, 3

           Perl script     1, 2

           Python script   1, 2

           Java            1, 2

           C++             1, 2
Design and implement test cases




 Design     Prepare
                        Build   Verify/Bug   Automate   Demo
test case   test data
Unit tests

• Unit tests
  – 784 unit tests
  – code coverage: 72%
  – Checked in with code by developers
  – Executed by CI build as a Jenkins job
Functional tests

• Functional tests (including regression
  tests) as of 3.2.0:
  – Use real systems (hadoop, oozie), not
    minicluster or minioozie
  – 1129 shell-based tests
  – 146 Java OozieClient API tests (in testNG)
  – Runtime: 36 hours, on 2 servers/clusters
     • Manual setup time: 20min
Shell-based tests

• Assumptions:
  – secure hadoop cluster is up
  – oozie server is configured and up.
• 2 types of tests
  – Individualized feature tests
     • Customized validation
     • Self-contained
  – 1 script drives many tests
     • Good for repetitive testing, e.g., schema tests
Example: run.sh

              • Prepare: generate job
prepare         property file based on
                given conf and template
              • Upload: delete existing
 upload         data, and upload
                application/data to hdfs
              • Submit: submit oozie jobs
    submit    • Verify: check jobs finish
                successfully

     verify
Test validation (1)

• Add validation into the workflow.xml
  – Apply decision node to check
     • wf:actionData
     • fs:exists
     • Other EL functions
  – Apply Java action to verify
     • capture-output
Test validation (2)

• Add validation into run.sh
  – Apply oozie client commands to check
     • Job status, log, configurations, definition, dryrun
  – Apply shell commands to parse results
     • Download output data, parse and compare
Integration

• Integration tests:
  – 15 tests, within hadoop eco system
     • Including Hadoop, Pig, Hcatalog, Distcp.
  – Runtime: ~5 hours (oozie tests only)
     • Manual setup time: 30min
     • Plus, test package preparation & test run: 3 hr
  – Examples
     • Oozie and MapReduce
     • Oozie and Pig
     • Oozie and Hcatalog
Stress tests (1)

• Performance/stress/longevity tests:
  – 10 tests
  – Runtime:
     • 12 hours for performance/stress tests
     • 7 days for longevity testing.
     • Manual setup & analysis time: ~ 10min per test
Stress tests (2)

• Performance metrics:
  – job submission rate
  – status update
  – no failed jobs
  – number of jobs submitted vs. completed
• Longevity tests:
  – 300 wf jobs/min for 7 days ~= 3M jobs
Memory tests

• Memory/stress tests:
  – 3 tests
  – Runtime: ~ 10 hours.
    • Manual setup & analysis: 30min per test
  – Examples:
    • Purge big amount of wf/coord/bundle jobs
    • Query a coord job with 100k actions
    • Query a coord job with 8k actions by N threads
Upgrade/installation tests

• Upgrade tests:
  – 14 tests
  – Runtime: 4 hours (manual setup: 2hr)
  – steps:
    • Submit wf/coord/bundle jobs
    • Shut down oozie server
    • Upgrade database schema, oozie version, oozie
      config
    • Restart oozie server
Release notes and certification

• Release notes
  – New features
  – Package version and new settings
  – New db schema
• Certification
  – Number of tests being executed and pass
    rate
  – Known issues
Production and customer support

• Document FAQs, e.g., usage of new
  features
• Support production deployment issues
• Meet customers’ SLA requirements
Experiences learned (1)

• Add “time-out” to the test script
  – If the test fails to reach expected status
• Carefully timed the verification step to
  catch transient states.
  – Job status transition, e.g., from PREP to
    RUNNING to PAUSED
Experiences learned (2)

• Increase hadoop capacity
  – Modify hadoop queue capacity property
  – Modify user limit
• Increase database active connections
Experiences learned (3)

• Accumulate large number of jobs for
  testing
  – Increase materialization window
  – Reduce materialization look up interval
  – Coordinator job’s frequency, duration
• Also, check database memory usage
Experiences learned (4)

• Check oozie job log, tomcat server log,
  hadoop jobtracker log for debugging
• Dev adds debugging statements
Challenges - production issues

• Reproduce and debug issues in QE
  environment.
• Set up QE environment as close to
  production as possible.
  – Recent story: using CNAME for oozie URL.
Challenges – backward compatibility

• Oozie always guarantees backward
  compatibility
  – Web-service API
  – Job definitions
  – Client API
• Verify old jobs continue to run in new
  release
Challenges – multiple versions

• Compatibility of multiple versions of other
  components
  – Hadoop API
  – Pig
  – Hcatalog
Work in Progress (1)

• Increase test coverage
  – Java based, testNG framework
  – Server-side oozie white box testing
  – Improved web service API testing
Work in Progress (2)

• Hadoop 2.x integration testing,
  including HDFS federation.
• Memory monitoring framework
• Performance benchmark framework
• Of course, new oozie releases
Open sourcing

• Short term: Shell based tests
  – Review file/data structure
  – Add readme, copyright, etc
  – Work in progress
• Long term: Java based tests
  – oozie-core, oozie-client, oozie-ws
Y! Oozie QE team


                        QE Architect
 Jane Q. Chen     qianchen@yahoo-inc.com

                        QE Engineer
 Marcy Chen        marchen@yahoo-inc.com

                        QE Engineer
Michelle Chiang    mchiang@yahoo-inc.com
Acknowledgement

• All oozie developers in the community!




     http://incubator.apache.org/oozie/

     oozie-dev@incubator.apache.org
Thank you!

• Q&A
• oozie-users@incubator.apache.org
Back up slides
An Oozie Workflow
                                     MapReduce      OK
                                    Streaming job

        FS job    OK
start                  fork                                   join
        (mkdir)

                                       Pig job      OK               Case1
                                                                                 Decision


                                                                                      Case2
                                                              MapReduce
                                                                 job

                                                               OK




                                                                        Java
                                                                        Action

                                                                      OK
                                            OK       FS job
                              end
                                                    (chmod)
Oozie ‘Wordcount’ Workflow Example

• Non-Oozie (single map-reduce job)
From Gateway,

[yourid@gwgd2211 ~]$ hadoop jar hadoop-examples.jar wordcount
-Dmapred.job.queue.name=queue_name inputDir outputDir



• Oozie
                    MapReduce OK
     Start
                    wordcount
                                         End              Workflow.xml

                               ERROR

                        Kill
Example: shell-action workflow.xml
                        <shell xmlns="uri:oozie:shell-action:0.1">
                            <!– skip lines -->
        Shell action        <exec>${SCRIPT}</exec>
                            <argument>-classpath</argument>

                        <argument>./${SCRIPTFILE}:$CLASSPATH</argument>
                            <argument>script</argument>
                           <file>${SCRIPTFILE}#${SCRIPTFILE}</file>
                           <capture-output/>
        wf:actionData
                         </shell>
          matches?

false
                        <decision name="decision1">
               true        <switch><case to="end">${wf:actionData('shell-
                        sh')['PATH1'] == 'Reset'}</case>
                            <default to="fail" />
            end             </switch>
kill                     </decision>
Integration tests

• Compatible with other components
• No system failures, e.g., NN, JT,
  Hcat_server
• Run standalone utility to narrow down
  issues
  – For example, pig, distcp
• Check oozie’s launcher log on Jobtracker
Production environment

• Total number of nodes: 42K+
• Total number of Clusters: 25+
   – 1 oozie server per cluster
• Total number of processed jobs ≈ 750K/month

July 2012 HUG: Overview of Oozie Qualification Process

  • 1.
    Overview of OozieQE Qualification Process Michelle Chiang 07/18/2012
  • 2.
    Agenda • What is Oozie • Qualification stages • Challenges • Future tasks • Q&A
  • 3.
    What is Oozie? •Scalable, secure workflow scheduling system for Hadoop. – Three levels of jobs • Workflow job – Support actions such as MR, Pig, Java, Distcp • Coordinator job – Scheduling • Bundle job – Monitor status of coordinator jobs
  • 4.
    Job Submission toHadoop Oozie Client Hadoop Cluster 1. CLI Job Tracker Actual 2. Java Client API M/R Job 3. WS API Launcher Mapper Oozie Server
  • 5.
    QE Qualification Process • Develop test plan in design cycle • Design and implement test cases • Execute tests • Prepare release notes & certification • Support production deployment and customers’ FAQs
  • 6.
    Develop Test Plan •Prepare test plan for new features defined in PRD, or • Prepare test plan for the selected new features checked into the apache source • Define test strategy • Test plan is reviewed by QE and Dev
  • 7.
    Test plan example • test plan for “shell” action Case Execution Expected results Comment ID Ticket # Shell action 1. Read env var, compare action data Pass/Fail, bug#/JIRA# 2. Read config env var 3. Hadoop fs –ls; hadoop fs -cp Test_sh* Bash shell 1, 2, 3 Perl script 1, 2 Python script 1, 2 Java 1, 2 C++ 1, 2
  • 8.
    Design and implementtest cases Design Prepare Build Verify/Bug Automate Demo test case test data
  • 9.
    Unit tests • Unittests – 784 unit tests – code coverage: 72% – Checked in with code by developers – Executed by CI build as a Jenkins job
  • 10.
    Functional tests • Functionaltests (including regression tests) as of 3.2.0: – Use real systems (hadoop, oozie), not minicluster or minioozie – 1129 shell-based tests – 146 Java OozieClient API tests (in testNG) – Runtime: 36 hours, on 2 servers/clusters • Manual setup time: 20min
  • 11.
    Shell-based tests • Assumptions: – secure hadoop cluster is up – oozie server is configured and up. • 2 types of tests – Individualized feature tests • Customized validation • Self-contained – 1 script drives many tests • Good for repetitive testing, e.g., schema tests
  • 12.
    Example: run.sh • Prepare: generate job prepare property file based on given conf and template • Upload: delete existing upload data, and upload application/data to hdfs • Submit: submit oozie jobs submit • Verify: check jobs finish successfully verify
  • 13.
    Test validation (1) •Add validation into the workflow.xml – Apply decision node to check • wf:actionData • fs:exists • Other EL functions – Apply Java action to verify • capture-output
  • 14.
    Test validation (2) •Add validation into run.sh – Apply oozie client commands to check • Job status, log, configurations, definition, dryrun – Apply shell commands to parse results • Download output data, parse and compare
  • 15.
    Integration • Integration tests: – 15 tests, within hadoop eco system • Including Hadoop, Pig, Hcatalog, Distcp. – Runtime: ~5 hours (oozie tests only) • Manual setup time: 30min • Plus, test package preparation & test run: 3 hr – Examples • Oozie and MapReduce • Oozie and Pig • Oozie and Hcatalog
  • 16.
    Stress tests (1) •Performance/stress/longevity tests: – 10 tests – Runtime: • 12 hours for performance/stress tests • 7 days for longevity testing. • Manual setup & analysis time: ~ 10min per test
  • 17.
    Stress tests (2) •Performance metrics: – job submission rate – status update – no failed jobs – number of jobs submitted vs. completed • Longevity tests: – 300 wf jobs/min for 7 days ~= 3M jobs
  • 18.
    Memory tests • Memory/stresstests: – 3 tests – Runtime: ~ 10 hours. • Manual setup & analysis: 30min per test – Examples: • Purge big amount of wf/coord/bundle jobs • Query a coord job with 100k actions • Query a coord job with 8k actions by N threads
  • 19.
    Upgrade/installation tests • Upgradetests: – 14 tests – Runtime: 4 hours (manual setup: 2hr) – steps: • Submit wf/coord/bundle jobs • Shut down oozie server • Upgrade database schema, oozie version, oozie config • Restart oozie server
  • 20.
    Release notes andcertification • Release notes – New features – Package version and new settings – New db schema • Certification – Number of tests being executed and pass rate – Known issues
  • 21.
    Production and customersupport • Document FAQs, e.g., usage of new features • Support production deployment issues • Meet customers’ SLA requirements
  • 22.
    Experiences learned (1) •Add “time-out” to the test script – If the test fails to reach expected status • Carefully timed the verification step to catch transient states. – Job status transition, e.g., from PREP to RUNNING to PAUSED
  • 23.
    Experiences learned (2) •Increase hadoop capacity – Modify hadoop queue capacity property – Modify user limit • Increase database active connections
  • 24.
    Experiences learned (3) •Accumulate large number of jobs for testing – Increase materialization window – Reduce materialization look up interval – Coordinator job’s frequency, duration • Also, check database memory usage
  • 25.
    Experiences learned (4) •Check oozie job log, tomcat server log, hadoop jobtracker log for debugging • Dev adds debugging statements
  • 26.
    Challenges - productionissues • Reproduce and debug issues in QE environment. • Set up QE environment as close to production as possible. – Recent story: using CNAME for oozie URL.
  • 27.
    Challenges – backwardcompatibility • Oozie always guarantees backward compatibility – Web-service API – Job definitions – Client API • Verify old jobs continue to run in new release
  • 28.
    Challenges – multipleversions • Compatibility of multiple versions of other components – Hadoop API – Pig – Hcatalog
  • 29.
    Work in Progress(1) • Increase test coverage – Java based, testNG framework – Server-side oozie white box testing – Improved web service API testing
  • 30.
    Work in Progress(2) • Hadoop 2.x integration testing, including HDFS federation. • Memory monitoring framework • Performance benchmark framework • Of course, new oozie releases
  • 31.
    Open sourcing • Shortterm: Shell based tests – Review file/data structure – Add readme, copyright, etc – Work in progress • Long term: Java based tests – oozie-core, oozie-client, oozie-ws
  • 32.
    Y! Oozie QEteam QE Architect Jane Q. Chen qianchen@yahoo-inc.com QE Engineer Marcy Chen marchen@yahoo-inc.com QE Engineer Michelle Chiang mchiang@yahoo-inc.com
  • 33.
    Acknowledgement • All ooziedevelopers in the community! http://incubator.apache.org/oozie/ oozie-dev@incubator.apache.org
  • 34.
    Thank you! • Q&A •oozie-users@incubator.apache.org
  • 35.
  • 36.
    An Oozie Workflow MapReduce OK Streaming job FS job OK start fork join (mkdir) Pig job OK Case1 Decision Case2 MapReduce job OK Java Action OK OK FS job end (chmod)
  • 37.
    Oozie ‘Wordcount’ WorkflowExample • Non-Oozie (single map-reduce job) From Gateway, [yourid@gwgd2211 ~]$ hadoop jar hadoop-examples.jar wordcount -Dmapred.job.queue.name=queue_name inputDir outputDir • Oozie MapReduce OK Start wordcount End Workflow.xml ERROR Kill
  • 38.
    Example: shell-action workflow.xml <shell xmlns="uri:oozie:shell-action:0.1"> <!– skip lines --> Shell action <exec>${SCRIPT}</exec> <argument>-classpath</argument> <argument>./${SCRIPTFILE}:$CLASSPATH</argument> <argument>script</argument> <file>${SCRIPTFILE}#${SCRIPTFILE}</file> <capture-output/> wf:actionData </shell> matches? false <decision name="decision1"> true <switch><case to="end">${wf:actionData('shell- sh')['PATH1'] == 'Reset'}</case> <default to="fail" /> end </switch> kill </decision>
  • 39.
    Integration tests • Compatiblewith other components • No system failures, e.g., NN, JT, Hcat_server • Run standalone utility to narrow down issues – For example, pig, distcp • Check oozie’s launcher log on Jobtracker
  • 40.
    Production environment • Totalnumber of nodes: 42K+ • Total number of Clusters: 25+ – 1 oozie server per cluster • Total number of processed jobs ≈ 750K/month