Your SlideShare is downloading. ×
Sge
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sge

605
views

Published on

A lab talk on Sun Grid Engine.

A lab talk on Sun Grid Engine.


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
605
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Scaling, Grid Engine and Running UIMA on the Cluster
    Chris Roeder 11/2010
  • 2. The Scaling Problem
    “Does the solution scale?” asks if larger versions of the problem (often more data) can be dealt with by a given piece of software.
    “Scaling” is a loose collection of techniques to improve or implement a solution’s scalability.
    The choice of techniques depends on the critical resource: cpu, memory or i/o and how easily the task is broken into pieces.
    This talk focusses on Scaling as it applies to UIMA NLP processing (not withstanding OpenDMAPv2).
    It is a work in progress.
  • 3. Scaling NLP
    Processing a file is independent of processing another file:Text in, annotations out.
    Multi-threaded
    More than one thread of execution in one process
    pipelines share memory and can step on each other.
    Ex. Stanford crashes because of concurrency issues
    “was not an issue in 2001”
    <casProcessors casPoolSize=“4" processingUnitThreadCount=“2">
    Multi-process
    Separate JVM’s, each with a single thread
    Memory is not shared, no crushed toes
    <casProcessors casPoolSize="3" processingUnitThreadCount=“1">
    Overhead of repeated JVM and pipeline does cost, but it works.
    Many machines
    More memory, more cores
    Independence means they won’t miss being on the same machine
    Independent machines (Cluster) are cheaper than integrated (Enki)
  • 4. Hardware
    Local Cluster (Colfax)
    A rack of machines with software (SGE) to integrate
    Integrated CPUs (Enki)
    Much like a rack, but motherboards are tied together and can share memory
    Gigabit ethernet delivers on the order of 300Mb/sec
    Motherboard runs up to 4.8GB/sec
    Virtual Cluster
    Virtualization software allows for a single machine to appear as many, offers flexibility, security
    Cloud
    A virtual cluster on the net: Amazon EC2
  • 5. Hardware: CCP’s Colfax Cluster
    Runs Linux (Fedora/Red Hat)
    6 machines (amc-colfax, amc-colfaxnd[1-5])
    2 cpus (Intel), 4 cores each, 48 cores total
    Intel motherboard
    16GB memory each, 96 GB total
    5TB shared (over NFS) disk array, RAID5
    Named after the assembler: Colfax International
  • 6. (Sun|Oracle) Grid Engine (SGE)
    Manages a queue of jobs, optimizing resources utilization
    Starts individual processes for a job
    Often used with Message Passing Interface (MPI) for processes that cooperate
    Used here to start “Array Jobs”
    Each job processes a portion of a large array of work to be done.
  • 7. SGE Job
    An SGE job is a script and a command line
    Command line specifies resources for scheduling
    Memory
    others
    Script is run once for each process started
    Is not pure shell, but more/less a shell script (next slide)
    Job is assigned an ID number
  • 8. more/less a shell script?
    Put these lines at top for SGE:
    #$ -N stanford_out
    Standard out goes to a file with this prefix
    #$ -S /bin/bash
    The shell to use (no “she-bang”: #!/bin/sh)
    #$ -cwd
    Runs from the current directory
    #$ -j y
    Merge stdout and stderr to one file
  • 9. Submit a Job: qsub
    Qsub –t 1-200000:20000 sge_stanford_out.sh
    -t Index Range
    Do array items from 1 to 200 thousand, by 20k: 10 processes
    Do this with the sge_stanford_out.sh script
    How does the script know what files to process?
    $SGE_TASK_ID (first file number to run)
    $SGE_TASK_STEPSIZE
    A task will get values of 0,19999,20000 for example
  • 10. Sge_stanford_out.sh
    Will evolve into generic UIMA job submission script
    Script modifies a template CPE file, creates a CPE for each process
    CPE specifies starting document number and number to process
    http://wikis.sun.com/display/gridengine62u2/How+to+Submit+an+Array+Job+From+the+Command+Line
    [roederc@amc-colfax sge_scripts]$ qsub -t 1-50:3 sge_stanford_out.sh
    Your job-array 130.1-50:3 ("stanford_out") has been submitted
  • 11. qstat
    [roederc@amc-colfax sge_scripts]$ qstat
    job-ID prior name user state submit/start at queue slots ja-task-ID
    -----------------------------------------------------------------------------------------------------------------
    130 0.00000 stanford_o roederc qw 11/02/2010 12:39:01 1 1-49:3
    [roederc@amc-colfax sge_scripts]$ qmon
    [roederc@amc-colfax sge_scripts]$ qstat
    job-ID prior name user state submit/start at queue slots ja-task-ID
    -----------------------------------------------------------------------------------------------------------------
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p 1 4
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p 1 7
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p 1 10
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p 1 13
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd1.ucdenver.p 1 16
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p 1 19
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p 1 22
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p 1 25
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfax.ucdenver.pvt 1 28
    130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p 1 31
  • 12. Qdel command
    Use to kill a command
    Qdel <job num>
  • 13. Failures?
    Q:What if a job fails?
    (A: it stops)
    Open problem
    For now, that process dies leaving unprocessed jobs
    Need to cull unprocessed files and try again
    Usually not enough memory
    Future: db-driven collection reader with cas-consumer that reports completion
  • 14. Example 1:
    Distribute a simple script on cluster:
    Test_sge.sh
    Qsub test_sge.sh
    Runs it once
    Qsub test_sge.sh –t 1-5:1
    Runs it five times
    Qsub test_sge.sh –t 100-500:100
    Also runs it five times
    Gives index starts spaced by 100
  • 15. Example 2:Run UIMA on Cluster
    Sge_stanford_out.sh:
    Calls a script with a template CPE and index range:
    run_cpe_cluster_stanford_out.sh
    Modifies CPE template, creating a CPE for each sub-range
    Sets up environment, calls SimpleRunCPE (java)
    Note temp_cpe_<n>.xml in ../desc/cpe
    Start a number of terminals, run “top” in each to see cpu and memory usage.
  • 16. Hadoop
    Inspired by Lisp’s map/reduce
    Map: apply a function to each element of a hash
    Reduce: combine hashes into one
    Known for optimizing by moving processing rather than data
    Similar code used by Google.
    Hadoop is open source, used by Yahoo, Amazon.
    Specialized interfaces make it more suited to greenfield development
  • 17. What about “The Cloud”
    Amazon’s Elastic Compute Cloud (EC2) is a cluster on the internet that can be rented by the hour
    Very Dynamic
    Set up nodes when you start using them
    Expect them to dissapper when you stop
    Must have machine configuration management sussed. You have to re-install everything.
    Use S3 for long-term storage
    Starts at $0.10/hour
  • 18. Colfax Cluster
    6 CPUs
    5TB disk array
  • 19. Enki
    8TB RAID
    CPU

×