• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Megadata With Python and Hadoop
 

Megadata With Python and Hadoop

on

  • 1,656 views

July 2010 Triangle Hadoop Users Group presentation

July 2010 Triangle Hadoop Users Group presentation

Statistics

Views

Total Views
1,656
Views on SlideShare
1,297
Embed Views
359

Actions

Likes
0
Downloads
23
Comments
0

1 Embed 359

http://www.trihug.org 359

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This template can be used as a starter file for a photo album.

Megadata With Python and Hadoop Megadata With Python and Hadoop Presentation Transcript

  • Processing Megadata
    With Python and Hadoop
    July 2010 TriHUG
    Ryan Cox
    www.asciiarmor.com
  • 0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
    0029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF104991999999999999999999
    0029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF108991999999999999999999
    0029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF108991999999999999999999
    0029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF108991999999999999999999
    0029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF108991999999999999999999
    0029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF106991999999999999999999
    0029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF108991999999999999999999
    0029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF108991999999999999999999
    0029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF108991999999999999999999
    0029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF108991999999999999999999
    0029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF108991999999999999999999
    0029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF104991999999999999999999
    0029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF104991999999999999999999
    0029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF104991999999999999999999
    0029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF103991999999999999999999
    0029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF100991999999999999999999
    0029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF100991999999999999999999
    0029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF100991999999999999999999
    0029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF100991999999999999999999
    0029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF100991999999999999999999
    0029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF108991999999999999999999
    0029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF108991999999999999999999
    0035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW1701
    0029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF108991999999999999999999
    0029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF108991999999999999999999
    0029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF108991999999999999999999
    0029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF100991999999999999999999
    0029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF100991999999999999999999
    0029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF100991999999999999999999
    0029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF100991999999999999999999
    0029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF100991999999999999999999
    0029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF100991999999999999999999
    0029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF104991999999999999999999
    0029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF103991999999999999999999
    0029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF100991999999999999999999
    0029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF100991999999999999999999
    0029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF100991999999999999999999
    0029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF100991999999999999999999
    0029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF100991999999999999999999
    0029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF100991999999999999999999
    0029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF100991999999999999999999
    0029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF108991999999999999999999
    0029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF108991999999999999999999
    0035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721
    ~130 GB NCDC climate Dataset
  • high_temp=0
    forline inopen('1901'):
    line =line.strip()
    (year, temp, quality) =
    (line[15:19], line[87:92], line[92:93])
    if(temp !="+9999"and quality in"01459"):
    high_temp=max(high_temp,float(temp))
    printhigh_temp
    How can we make this scale?
    ( and do more interesting things )
  • JeffREY DEAN – Google - 2004
    “Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
  • defmapper(line):
    line =line.strip()
    (year, temp, quality) =
    (line[15:19], line[87:92], line[92:93])
    if(temp !="+9999"and quality in"01459"):
    returnfloat(temp)
    returnNone
    output =map(mapper,open('1901'))
    printreduce(max,output)
    Mapreduce in pure python
  • for line in sys.stdin:
    val = line.strip()
    (year, temp, q) = (val[15:19], val[87:92], val[92:93])
    if (temp != "+9999" and re.match("[01459]", q)):
    print "%s %s" % (year, temp)
    mapper.py
    (last_key, max_val) = (None, 0)
    for line in sys.stdin:
    (key, val) = line.strip().split(" ")
    if last_key and last_key != key:
    print "%s %s" % (last_key, max_val)
    (last_key, max_val) = (key, int(val))
    else:
    (last_key, max_val) = (key, max(max_val, int(val)))
    if last_key:
    print "%s %s" % (last_key, max_val)
    reduer.py
    cat dataFile | mapper.py | sort | reducer.py
    Hadoop Streaming
  • Dumbo
  • def mapper(key,value):
    line = value.strip()
    (year, temp, quality) =
    (line[15:19], line[87:92], line[92:93])
    if (temp != "+9999" and quality in "01459"):
    yield year, int(temp)
    def reducer(key,values):
    yield key,max(values)
    if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper,reducer,reducer)
    Dumbo
    • Ability to pass around Python objects
    • Job / Iteration Abstraction
    • Counter / Status Abstraction
    • Simplified Joining mechanism
    • Ability to use non-Java combiners
    • Built-in library of mappers / reducers
    • Excellent way to model MR algorithms
    Dumbo
  • CLI – API – Web Console
    Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
    Elastic Map REduce
  • Elastic Map REduce
  • Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’
    Elastic Map REduce
  • CloudWatch metrics
    Elastic Map REduce
  • CloudWatch metrics
    Elastic Map REduce
  • Quiz: How would you do this?
  • Map Reduce Algorithms ARE different
    BFS(G, s) // G is the graph and s is the starting node
    for each vertex u ∈ V [G] - {s}
    do color[u] ← WHITE // color of vertex u
    d[u] ← ∞ // distance from source s to vertex u
    π[u] ← NIL // predecessor of u
    color[s] ← GRAY
    d[s] ← 0
    π[s] ← NIL
    Q ← Ø // Q is a FIFO - queue
    ENQUEUE(Q, s)
    while Q ≠ Ø // iterates as long as there are gray vertices.
    do u ← DEQUEUE(Q)
    for each v ∈ Adj[u]
    do if color[v] = WHITE // discover the undiscovered adjacent vertices
    then color[v] ← GRAY // enqueued whenever painted gray
    d[v] ← d[u] + 1
    π[v] ← u
    ENQUEUE(Q, v)
    color[u] ← BLACK // painted black whenever dequeued
  • > m = function() { emit(this.user_id, 1); }
    > r = function(k,vals) { return 1; }
    > res = db.events.mapReduce(m, r, { query : {type:'sale'} });
    > db[res.result].find().limit(2)
    { "_id" : 8321073716060 , "value" : 1 }
    { "_id" : 7921232311289 , "value" : 1 }
    MongoDB
    > {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},                             {<<"groceries">>, <<"yours">>}],                            [{'map', {'qfun', Count}, 'none', false},                             {'reduce', {'qfun', Merge}, 'none', true}]).
    Riak
    Map Reduce Elsewhere
  • Definitive Guide Hadoop
    http://www.hadoopbook.com
    Dumbo
    http://dumbotics.com/
    http://github.com/klbostee/dumbo/
    Elastic Map Reduce
    http://aws.amazon.com/
    Boto
    http://github.com/boto
    Getting Started Slides
    http://www.slideshare.net/pacoid/getting-started-on-hadoop
    Learn MOre
  • DEMO