Interpreting the Data:Parallel Analysis with Sawzall


Published on

Interpreting the Data: Parallel Analysis with Sawzall

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Sawzall is an interpreted, procedural, domain-specific programming language , used specifically by Google , to handle huge quantities of data.Assume certain things about the problem spaceHide details about:The machines that are to be usedHow the machines are coordinatedHow the machines exchanges dataTo ease implementationRestrict the data typesRestrict the control structuresRun native code on executing target
  • American psychologist and philosopher
  • 5
  • Awk and Python are not panaceas; for instance, they have no inherent facilities for processing data on multiple machinesThe analysis in the first phase is expressed in anew procedural programming language that executes one record at a time, in isolation, to calculatequery results for each record. The second phase is restricted to a set of predefined aggregatorsthat process the intermediate results generated by the first phase.querying operations are commutative across records, the order in which the records areprocessed is unimportant. We can therefore work through the input in arbitrary order.
  • Five racks of 50-55 working computers each, with four disks per machine.Such a configuration might have a hundred terabytes of data to be processed, distributed across some or all of the machines.
  • Evaluate each record individuallyAssociative (A+B)+C=A+(B+C)
  • Few line codeVery high throughput
  • MethodologyAggregators-collate & reduce the intermediate values to create the final resultsIn a typical run, the majority of machines will run Sawzall and a smaller fraction will run the aggregators
  • The first three lines declare the aggregators count, total, and sum of squaresconverts the input record from its external representation into a native floating-point number, whichis stored in a local variable x.aggregators are called tables in Sawzall
  • DDL-Data Deecription Language----definingdata structures, especially database schemas.
  • An integral tag to identify a field in binary representation.
  • Example
  • Each chunk is replicated, usually 3 times, ondifferent machines so GFS can recover seamlessly from disk or machine failure.
  • The Workqueue is similar to several other systems such as CondorWe often overlay a Workqueue cluster and a GFS cluster on the same set of machinesSince GFS is a storage system,its CPUs are often lightly loaded, and the free computing cycles can be used to run Workqueuejobs.
  • On a thousand-machine cluster, a MapReduceimplementation can sort a terabyte of data at an aggregate rate of a gigabyte per secondOur data processing system is built on top of MapReduce
  • Our data processing system is built on top of MapReduce
  • Sawzall, the resulting code is much simpler and shorter—by afactor of ten or more—than the corresponding C++ code in MapReduce.The syntax of statements and expressions is borrowed largely from C; for loops, while loops, ifstatements and so on take their familiar formThe time type has microsecond resolution and the libraries include convenient functionsfor decomposing and manipulating these valuesThe fingerprint type represents aninternally-computed hash of another value, which makes it easy to construct data structures suchas aggregators indexed by the fingerprint of a datuminteger - a 64-bit signed valuefloat-a doubleprecision IEEE valuespecial integer-like types called time and fingerprintbytes, similar to a C array of unsigned charstring, which is defined to hold characters from the Unicode character setThere is no “character” type;Arrays areindexed by integer values, while maps are like associative arrays or Python dictionaries and maybe indexed by any type, with the indices unordered and the storage for the elements created ondemand. Finally, tuples represent arbitrary groupings of data, like a C struct or Pascal record. Atypedef-like mechanism allows any type to be given a shorthand name
  • Although at the statement level Sawzall is a fairly ordinary language, it has two very unusualfeatures, both in some sense outside the language itself:Emit :which sends data to an external aggregator that gathers the results from each record and correlates and processes theresult.
  • take the input variable,parse it into a data structure by a conversion operation,examine the data, and emit some valuesGiven a set of logs of the submissions to our sourcecode management system, this program will show how the rate of submission varies through theweek, at one-minute resolution:
  • Occasionally, ideas arise for new aggregators to be supported by Sawzall. Adding new aggregatorsis fairly easy, although not trivial.
  • The command is called saw and its basic parameters define the name of the Sawzall program to berun, a pattern that matches the names of the files that hold the data records to be processed, and aset of files to receive the output. A typical job might be instantiated like this:
  • Roughly speaking, which is the most linked-to page?
  • Query distribution.The output of this program is an array of values suitable for creating a map
  • This means it has no memory of other records it has processed (except through values emitted to the aggregators, outside the language).
  • Sawzall does not provide any form of exception processingA predicate, def(), can be used to test if a value is defined
  • Interpreted LanguageLimited by I/O boundMandelbrot set, to measure basic arithmetic and loop performancerecursive functionStill slower than JavaSawzall is about 1.6times slower than interpreted Java, 21 times slower than compiled Java, and 51 times slower thancompiled C++
  • The key performance metric for the system is not the single-machine speedhow the speed scales as we add machines when processing a large data set
  • Performance scales well as we add machines
  • Sawzall has become one of the most widely used programming languages at Google.Most Sawzall programs are of modest size, but the largest are several thousand lines long and generate dozens of multi-dimensional tables in a single run.
  • Interpreting the Data:Parallel Analysis with Sawzall

    1. 1. Tilani Gunawardena
    2. 2. “When the only tool you own is a hammer,every problem begins to resemble a nail.” -Abraham Harold Maslow
    3. 3. InputTime Chubby Bigtable GFSProcessingTime Sawzall MapReduceOutputTime Bigtable GFS
    4. 4. Large data set(large,dynamic, unwieldy)Flat but regular structureSpan multiple disks and machinesBottleneck lies in I/O, not CPUsParalyze to improve throughputTask division and distributionKeep computation near to dataTolerance of kinds of failures
    5. 5. Process data records that are present on manymachines.Distribute the calculation across all themachines to achieve high throughput.Two phases for calculation -Analysis Phase -Aggregation Phase
    6. 6. Five racks of 50-55 working computers each, with four disks permachine.Such a configuration might have a hundred terabytesof data to be processed, distributed across some or all of themachines.
    7. 7. Query is commutative then can process in any orderIf Aggregation is commutative and associative then intermediatevalues can be grouped arbitrarily or even aggregated in stagesThe overall flow of filtering, aggregating, and collating.Each stage typically involves less data than the previous
    8. 8. Gigabyte to many terabytes of dataHundreds or even thousands of machines inparallelSome executing the query while othersaggregate the resultsAn analysis may consume months of CPU time,but with a thousand machines that will onlytake a few hours of real time.
    9. 9. system’s design is influenced by twoobservations. If the querying operations are commutative across records, the order in which the records are process unimportant If the aggregation operations are commutative, the order in which the intermediate values are processed is unimportant
    10. 10. Input is divided into pieces & processed separatelySawzall interpreter is instantiated for each piece ofdataSawzall program operates on each input recordindividuallyThe output of the program for each record is theintermediate value.The intermediate value is combined with valuesfrom other records
    11. 11. Input: Set of files that contain records where eachof the records contain one floating-point number.Output:Number of records, sum of the values andsum of the squares of the values.count: table sum of int;total: table sum of float;sum_of_squares: table sum of float;x: float=input;emit count<—1;emit total<—x;emit sum_of_squares<- x*x;
    12. 12. Protocol BuffersGoogle File SystemWorkqueueMapReduce
    13. 13. Protocol Buffers are used-To define the messages communicated betweenservers-To describe the format of permanent records storedon diskDDL describes protocol buffers and defines thecontent of the messagesProtocol compiler takes the DDL and generates codeto manipulate the protocol buffers
    14. 14. The generated code is compiled and linked withthe applicationProtocol buffer types are roughly analogous to Cstructs but the DDL has two additionalproperties A distinguishing integral tag An indication of whether a field is necessary or optional
    15. 15. The following describes a protocol buffer with two requiredfields. Field x has tag 1,and field y has tag 2 parsed message Point { required int32 x = 1; required int32 y = 2; };To extend this two-dimensional point, one can add a new,optional field with a new tag. All existing Points stored on diskremain readable; they are compatible with the new definitionsince the new field is optional. parsed message Point { required int32 x = 1; required int32 y = 2; optional string label = 3; };
    16. 16. The data sets are often stored in GFSGFS provides a reliable distributed storagesystemIt can grow to petabyte scale by keeping data in64MB “chunks”Each chunk is replicated, usually 3 times, ondifferent machinesstored on disks spread across thousands ofmachinesGFS runs as an application-level file system witha traditional hierarchical naming scheme
    17. 17. Software that handles the scheduling of a jobthat runs on a cluster of machines.It creates a large-scale time sharing system froman array of computers and their disks.It schedules jobs, allocates resources,reportsstatus, and collects the results
    18. 18. MapReduce is a software library for applicationsthat run on the Workqueue.Primary Services It provides an execution model for programs that operate on many data items in parallel. It isolates the application from the details of running a distributed program, When possible it schedules the computations so each unit runs on the machine or rack that holds its GFS data, reducing the load on the network.
    19. 19. High level languageSoftware librariesScheduling software Application file system
    20. 20. Basic types Integer (int) Floating point (float) Time Fingerprint (represents an internally computed hash of another value) Bytes and StringCompound type Arrays, Maps and Tuples
    21. 21. A Sawzall program defines the operations to beperformed on a single record of the data There is nothing in the language to enable examining multiple input records simultaneouslyThe only output primitive in the language is theemit statement
    22. 22. Given a set of logs of the submissions to our source code management system, this program will show how the rate of submission varies through the week, at one- minute resolution:proto “p4stat.proto”submitsthroughweek: table sum[minute: int] of count: int;log: P4ChangelistStats = input;t: time = log.time; # microsecondsminute: int = minuteof(t)+60*(hourof(t)+24*(dayofweek(t)-1));emit submitsthroughweek[minute] <- 1;submitsthroughweek[0] = 27submitsthroughweek[1] = 31submitsthroughweek[2] = 52submitsthroughweek[3] = 41...submitsthroughweek[10079] = 34
    23. 23. Frequency of submits to the source coderepository through the week.
    24. 24. List of Aggregators Collection Sample Sum Maximum Quantile Top Unique
    25. 25. System operates in a batch-execution style-User submits a job-Job runs on a fixed set of files-User collects the output The input format and output destination are given as arguments to the command that submits the job. The command is called “saw” The parameters for saw are as follows saw --program code.szl --workqueue testing --input_files /gfs/cluster1/2005-02-0[1-7]/submits.* --destination /gfs/cluster2/$USER/output@100
    26. 26. User collects the data using the following:dump --source/gfs/cluster2/$USER/output@100 --format csv The program merges the output data and prints the final results.
    27. 27. Implementation -Sawzall works in the map phase -Aggregators work in the reduce phase
    28. 28. Chaining-The output from a Sawzall job is sent as aninput to another
    29. 29. Task: Process a web document repository toknow for each web domain, which page has thehighest page rank proto "document.proto" max_pagerank_url: table maximum(1) [domain: string] of url: string weight pagerank: int; doc: Document = input; emit max_pagerank_url[domain(doc.url)] <- doc.url weight doc.pagerank;
    30. 30. Task: To look at a set of search query logs andconstruct a map showing how the queries aredistributed around the globe proto "querylog.proto" queries_per_degree: table sum[lat: int][lon: int] of int; log_record: QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(][int(loc.lon)] <- 1;
    31. 31. It runs on one record at a timeIt is routine for a Sawzall job to be executing ona thousand machines simultaneously, yet thesystem requires no explicit communicationbetween those machines
    32. 32. Sawzall is statically typed The main reason is dependability. Sawzall programs can consume hours, even months, of CPU time in a single run,and a late-arising dynamic type error can be expensive.Handle undefined valuesEx: A division by zero Conversion errors I/O problems Using def() predicate Run-time flag is setHandle logical quantifiers
    33. 33. Similar to C and PascalType-safe scripting languageCode is much shorter than C++Pure value semantics, no reference typesStatically typedNo exception processing
    34. 34. Compare the single CPU speed of the Sawzall interpreter. Sawzall is faster than Python, Ruby and Perl. But slower than interpreted Java, compiled Java and compiled C++ The following table shows the microbenchmark. Sawzall Python Ruby Perl Mandelbrot 12.09s 45.42s 73.59s 38.68sruntime Factor 1.00 3.75 6.09 3.20 Fibonacci 11.73s 38.12s 47.18s 75.73sruntime Factor 1.00 3.24 4.02 6.46
    35. 35. The main measurement is not single-CPU speed.The main measurement is aggregate systemspeed as machines are added to process largedatasets.Experiment: 450GB sample of compressedquery log data to count the occurrences ofcertain words using Sawzall program.
    36. 36. Test run on 50-600 , 2.4GHz Xeon computersScales up At 600 machines the aggregate throughput was 1.06GB/s of compressed data or about 3.2GB/s of raw input Additional machines add .98 machine throughput
    37. 37. -The solid line iselapsed time; thedashed line is theproduct ofmachines andelapsed time.
    38. 38. Why put a language above MapReduce? MapReduce is very effective; what’s missing?Why not just attach an existing language such as Python to MapReduce?To make programs clearer, more compact, and more expressiveOriginal motivation :parallelism Separating out the aggregators provides a model for distributed processing Awk or or Python –user have to write the aggregatorsCapture the aggregators in the language(& its environment) ->means thatthe programmer never has to provide one, unlike when using MapReduceSawzall programs tend to be around 10 to 20 times shorter than theequivalent MapReduce programs in C++ & significantly easier to writeAbility to add domain-specific features, custom debugging and profilinginterfaces,etc
    39. 39. how much data processing it does ?March 2005. Workqueue cluster 1500 Xeon CPUs 32,580 Sawzall jobs launched using an average of 220 machines each 18,636 failures The jobs read a total of 3.2×1015 bytes of data (2.8PB) and wrote 9.9×1012 bytes (9.3TB) The average job therefore processed about 100GB
    40. 40. Traditional data processing is done by storing theinformation in a relational database andprocessing it with SQL queries. Sawzall -data sets are usually too large to fit in a relational database Sawzall is very different from SQL(combining a fairly traditional procedural language with an interface to efficient aggregators) SQL is excellent at database join operations, while Sawzall is notBrook - language for data processing, specificallygraphical image processing ;Brook [3]
    41. 41. Aurora-stream processing system that supportsa (potentially large) set of standing queries onstreams of data ; Aurora [4]Hancock-stream processing system; Hancock [7] concentrates on efficient operation of a single thread instead of massive parallelism.
    42. 42. Some of the larger or more complex analyses would be helped bymore aggressive compilation, perhaps to native machine code.Interface to query an external databaseMore direct support of join operations Some analyses join data from multiple input sources, Joining is supported but requires extra chaining stepsA more radical system model To eliminate the batch-processing mode entirely
    43. 43. Provide an expressive interface to a novel set of aggregators thatcapture many common data processing and data reductionproblemsAbility to write short, clear programs that are guaranteed towork well on thousands of machines in parallel user needs to know nothing about parallel programming;CPU time is rarely the limiting factor; most programs are smalland spend most of their time in I/O and native run-time codeScalability-linear growth in performance as we add machines Big data sets need lots of machines; it’s gratifying that lots of machines can translate into big throughput.
    44. 44. #/bin/bashlimit=100sum=0i=0while [ "$i" -le $limit ]do sum=`expr $sum + $i` i=`expr $i + 1`doneecho $sum 50
    45. 45. #define NUMCALC 25#include <stdio.h>#include <mpi.h>int main (int argc, char *argv[]){ int myrank, sum, n, partial; MPI_Status status; sum = 0; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &myrank); if (myrank == 0) { printf ("starting...n"); for (n=1; n<=NUMCALC; n++) { MPI_Recv (&partial, 1, MPI_INT, n, 0, MPI_COMM_WORLD, &status); sum += partial; } printf ("sum is: %dn", sum); } else { for (n=1; n<=4; n++) sum += ((myrank-1)*4+n); MPI_Send (&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); return 0; 51}