Overview of Spark for HPC

2,671 views

Published on

I originally gave this presentation as an internal briefing at SDSC based on my experiences in working with Spark to solve scientific problems.

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,671
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • groupByKey: group the values for each key in the RDD into a single sequence
    mapValues: apply map function to all values of key/value pairs without modifying keys (or their partitioning)
    collect: return a list containing all elements of the RDD

    def computeContribs(urls, rank):
    """Calculates URL contributions to the rank of other URLs."""
    num_urls = len(urls)
    for url in urls:
    yield (url, rank / num_urls)
  • Overview of Spark for HPC

    1. 1. Introduction to Spark Glenn K. Lockwood July 2014 SAN DIEGO SUPERCOMPUTER CENTER
    2. 2. Outline I. Hadoop/MapReduce Recap and Limitations II. Complex Workflows and RDDs III. The Spark Framework IV. Spark on Gordon V. Practical Limitations of Spark SAN DIEGO SUPERCOMPUTER CENTER
    3. 3. Map/Reduce Parallelism Data Data SAN DIEGO SUPERCOMPUTER CENTER Data Data Data taDsakt a0 task 5 task 4 task 3 task 1 task 2
    4. 4. Magic of HDFS SAN DIEGO SUPERCOMPUTER CENTER
    5. 5. Hadoop Workflow SAN DIEGO SUPERCOMPUTER CENTER
    6. 6. Shuffle/Sort SAN DIEGO SUPERCOMPUTER CENTER MapReduce Disk Spill 1. Map – convert raw input into key/value pairs. Output to local disk ("spill") 2. Shuffle/Sort – All reducers retrieve all spilled records from all mappers over network 3. Reduce – For each unique key, do something with all the corresponding values. Output to HDFS Map Map Map Reduce Reduce Reduce
    7. 7. 2. Full* data dump to disk SAN DIEGO SUPERCOMPUTER CENTER MapReduce: Two Fundamental Limitations 1. MapReduce prescribes workflow. • You map, then you reduce. • You cannot reduce, then map... • ...or anything else. See first point. Map Map Map Reduce Reduce Reduce between workflow steps. • Mappers deliver output on local disk (mapred.local.dir) • Reducers pull input over network from other nodes' local disks • Output goes right back to local * Combiners do local reductions to prevent a full, unreduced dump of data to local disk disks via HDFS Shuffle/Sort
    8. 8. Beyond MapReduce • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator SAN DIEGO SUPERCOMPUTER CENTER
    9. 9. Beyond MapReduce: Complex Workflows • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce How can you do this without flushing intermediate results to disk after every operation? • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator How can you ensure fault tolerance for all of these baked-in operations? SAN DIEGO SUPERCOMPUTER CENTER
    10. 10. SAN DIEGO SUPERCOMPUTER CENTER MapReduce Fault Tolerance Map Map Map Reduce Reduce Reduce Mapper Failure: 1. Re-run map task and spill to disk 2. Block until finished 3. Reducers proceed as normal Reducer Failure: 1. Re-fetch spills from all mappers' disks 2. Re-run reducer task
    11. 11. Performing Complex Workflows How can you do complex workflows without flushing intermediate results to disk after every operation? 1. Cache intermediate results in-memory 2. Allow users to specify persistence in memory and partitioning of dataset across nodes How can you ensure fault tolerance? 1. Coarse-grained atomicity via partitions (transform chunks of data, not record-by-record) 2. Use transaction logging--forget replication SAN DIEGO SUPERCOMPUTER CENTER
    12. 12. Resilient Distributed Dataset (RDD) • Comprised of distributed, atomic partitions of elements • Apply transformations to generate new RDDs • RDDs are immutable (read-only) • RDDs can only be created from persistent storage (e.g., HDFS, POSIX, S3) or by transforming other RDDs # Create an RDD from a file on HDFS text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') # Transform the RDD of lines into an RDD of words (one word per element) words = text.flatMap( lambda line: line.split() ) # Transform the RDD of words into an RDD of key/value pairs keyvals = words.map( lambda word: (word, 1) ) sc is a SparkContext object that describes our Spark cluster lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) SAN DIEGO SUPERCOMPUTER CENTER
    13. 13. Potential RDD Workflow SAN DIEGO SUPERCOMPUTER CENTER
    14. 14. RDD Transformation vs. Action • Transformations are lazy: nothing actually happens when this code is evaluated • RDDs are computed only when an action is called on them, e.g., • Calculate statistics over the elements of an RDD (count, mean) • Save the RDD to a file (saveAsTextFile) • Reduce elements of an RDD into a single object or value (reduce) • Allows you to define partitioning/caching behavior after defining the RDD but before calculating its contents SAN DIEGO SUPERCOMPUTER CENTER
    15. 15. RDD Transformation vs. Action • Must insert an action here to get pipeline to execute. • Actions create files or objects: # The saveAsTextFile action dumps the contents of an RDD to disk >>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') # The count action returns the number of elements in an RDD >>> num_elements = rdd.count(); num_elements; type(num_elements) SAN DIEGO SUPERCOMPUTER CENTER 215136 <type 'int'>
    16. 16. Resiliency: The 'R' in 'RDD' • No replication of in-memory data • Restrict transformations to coarse granularity • Partition-level operations simplifies data lineage SAN DIEGO SUPERCOMPUTER CENTER
    17. 17. Resiliency: The 'R' in 'RDD' • Reconstruct missing data from its lineage • Data in RDDs are deterministic since partitions are immutable and atomic SAN DIEGO SUPERCOMPUTER CENTER
    18. 18. Resiliency: The 'R' in 'RDD' • Long lineages or complex interactions (reductions, shuffles) can be checkpointed • RDD immutability  nonblocking (background) SAN DIEGO SUPERCOMPUTER CENTER
    19. 19. Introduction to Spark SPARK: AN IMPLEMENTATION OF RDDS SAN DIEGO SUPERCOMPUTER CENTER
    20. 20. Spark Framework • Master/worker Model • Spark Master is analogous to Hadoop Jobtracker (MRv1) or Application Master (MRv2) • Spark Worker is analogous to Hadoop Tasktracker • Relies on "3rd party" storage for RDD generation (hdfs://, s3n://, file://, http://) • Spark clusters take three forms: • Standalone mode - workers communicate directly with master via spark://master:7077 URI • Mesos - mesos://master:5050 URI • YARN - no HA; complicated job launch SAN DIEGO SUPERCOMPUTER CENTER
    21. 21. Spark on Gordon: Configuration 1. Standalone mode is the simplest configuration and execution model (similar to MRv1) 2. Leverage existing HDFS support in myHadoop for storage 3. Combine #1 and #2 to extend myHadoop to support Spark: $ export HADOOP_CONF_DIR=/home/glock/hadoop.conf $ myhadoop-configure.sh ... myHadoop: Enabling experimental Spark support myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark myHadoop: To use Spark, you will want to type the following commands:" source /home/glock/hadoop.conf/spark/spark-env.sh myspark start SAN DIEGO SUPERCOMPUTER CENTER
    22. 22. Spark on Gordon: Storage • Spark can use HDFS $ start-dfs.sh # after you run myhadoop-configure.sh, of course ... $ pyspark >>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') >>> mydata.count() 982394 • Spark can use POSIX file systems too $ pyspark >>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') >>> mydata.count() 982394 • S3 Native (s3n://) and HTTP (http://) also work • file:// input will be served in chunks to Spark workers via the Spark driver's built-in httpd SAN DIEGO SUPERCOMPUTER CENTER
    23. 23. Spark on Gordon: Running Spark treats several languages as first-class citizens: Feature Scala Java Python Interactive YES NO YES Shark (SQL) YES YES YES Streaming YES YES NO MLlib YES YES YES GraphX YES YES NO R is a second-class citizen; basic RDD API is available outside of CRAN (http://amplab-extras.github.io/SparkR-pkg/) SAN DIEGO SUPERCOMPUTER CENTER
    24. 24. myHadoop/Spark on Gordon (1/2) #!/bin/bash #PBS -l nodes=2:ppn=16:native:flash #PBS -l walltime=00:30:00 #PBS -q normal ### Environment setup for Hadoop export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH module load hadoop/2.2.0 export HADOOP_CONF_DIR=$HOME/mycluster.conf myhadoop-configure.sh ### Start HDFS. Starting YARN isn't necessary since Spark will be running in ### standalone mode on our cluster. start-dfs.sh ### Load in the necessary Spark environment variables source $HADOOP_CONF_DIR/spark/spark-env.sh ### Start the Spark masters and workers. Do NOT use the start-all.sh provided ### by Spark, as they do not correctly honor $SPARK_CONF_DIR myspark start SAN DIEGO SUPERCOMPUTER CENTER
    25. 25. myHadoop/Spark on Gordon (2/2) ### Run our example problem. ### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home ### dir by default which is different from Hadoop 1.x!) hdfs dfs -mkdir -p /user/$USER hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt ### Step 2. Run our Python Spark job. Note that Spark implicitly requires ### Python 2.6 (some features, like MLLib, require 2.7) module load python scipy /home/glock/hadoop/run/wordcount-spark.py ### Step 3. Copy output back out hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ ### Shut down Spark and HDFS myspark stop stop-dfs.sh ### Clean up myhadoop-cleanup.sh SAN DIEGO SUPERCOMPUTER CENTER Wordcount submit script and Python code online: https://github.com/glennklockwood/sparktutorial
    26. 26. Introduction to Spark PRACTICAL LIMITATIONS SAN DIEGO SUPERCOMPUTER CENTER
    27. 27. Major Problems with Spark 1. Still smells like a CS project 2. Debugging is a dark art 3. Not battle-tested at scale SAN DIEGO SUPERCOMPUTER CENTER
    28. 28. #1: Spark Smells Like CS • Components are constantly breaking • Graph.partitionBy broken in 1.0.0 (SPARK-1931) • Some components never worked • SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) • stop-master.sh doesn't work • Spark with YARN will break with large data sets (SPARK-2398) • spark-submit for standalone mode doesn't work (SPARK-2260) SAN DIEGO SUPERCOMPUTER CENTER
    29. 29. #1: Spark Smells Like CS • Really obvious usability issues: >>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') >>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 scala.MatchError: 0 (of class java.lang.Integer) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) SAN DIEGO SUPERCOMPUTER CENTER ... at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Read an RDD, then write it out = unhandled exception with cryptic Scala errors from Python (SPARK-1690)
    30. 30. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in saveAsTextFile keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- src.zip/py4j/java_gateway.py", line 537, in __call__ File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) Cause: Spark built against Hadoop 2 DFS trying to access data on Hadoop 1 DFS SAN DIEGO SUPERCOMPUTER CENTER
    31. 31. #2: Debugging is a Dark Art >>> data.count() 14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. self.serializer.dump_stream(self._batched(iterator), stream) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for item in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin if acc is None: TypeError: an integer is required at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
    32. 32. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin if not isinstance(x, basestring): SystemError: unknown opcode at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
    33. 33. #2: Spark Debugging Tips • $SPARK_LOG_DIR/app-* contains master/worker logs with failure information • Try to find the salient error amidst the stack traces • Google that error--odds are, it is a known issue • Stick any required environment variables ($PATH, $PYTHONPATH, $JAVA_HOME) in $SPARK_CONF_DIR/spark-env.sh to rule out these problems • All else fails, look at Spark source code SAN DIEGO SUPERCOMPUTER CENTER
    34. 34. #3: Spark Isn't Battle Tested • Companies (Cloudera, SAP, etc) jumping on the Spark bandwagon with disclaimers about scaling • Spark does not handle multitenancy well at all. Wait scheduling is considered best way to achieve memory/disk data locality • Largest Spark clusters ~ hundreds of nodes SAN DIEGO SUPERCOMPUTER CENTER
    35. 35. Spark Take-Aways SAN DIEGO SUPERCOMPUTER CENTER • FACTS • Data is represented as resilient distributed datasets (RDDs) which remain in-memory and read-only • RDDs are comprised of elements • Elements are distributed across physical nodes in user-defined groups called partitions • RDDs are subject to transformations and actions • Fault tolerance achieved by lineage, not replication • Opinions • Spark is still in its infancy but its progress is promising • Good for evaluating--good for Gordon, Comet
    36. 36. Introduction to Spark PAGERANK EXAMPLE (INCOMPLETE) SAN DIEGO SUPERCOMPUTER CENTER
    37. 37. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations Start every webpage with a rank R = 1.0 1. For each webpage linking in N neighbor webpages, have it "contribute" R/N to each of its N neighbors 2. Then, for each webpage, set its rank R to (0.15 + 0.85 * contributions) SAN DIEGO SUPERCOMPUTER CENTER 3. Repeat insert flow diagram here
    38. 38. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') # Load key/value pairs of (url, link), eliminate duplicates, and partition them such # that all common keys are kept together. Then retain this RDD in memory. links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() # Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 ranks = links.map(lambda (url, neighbors): (url, 1.0)) # Calculate and update URL rank for iteration in range(10): # Calculate URL contributions to their neighbors contribs = links.join(ranks).flatMap( lambda (url, (urls, rank)): computeContribs(urls, rank)) # Recalculate URL ranks based on neighbor contributions ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) # Print all URLs and their ranks for (link, rank) in ranks.collect(): print '%s has rank %s' % (link, rank) SAN DIEGO SUPERCOMPUTER CENTER

    ×