Big Data com Python
Upcoming SlideShare
Loading in...5
×
 

Big Data com Python

on

  • 8,607 views

Python Workshop on-line at Mutirao Python on-line via pycursos.com

Python Workshop on-line at Mutirao Python on-line via pycursos.com

https://www.youtube.com/watch?feature=player_embedded&v=DFh6l-h6-gw

Statistics

Views

Total Views
8,607
Views on SlideShare
2,159
Embed Views
6,448

Actions

Likes
5
Downloads
61
Comments
0

63 Embeds 6,448

http://aimotion.blogspot.com.br 1900
http://aimotion.blogspot.com 1888
http://aimotion.blogspot.in 361
http://aimotion.blogspot.co.uk 262
http://aimotion.blogspot.de 196
http://aimotion.blogspot.fr 157
http://aimotion.blogspot.ca 127
http://aimotion.blogspot.it 109
http://aimotion.blogspot.com.es 108
http://www.aimotion.blogspot.com.br 103
http://aimotion.blogspot.com.au 88
http://aimotion.blogspot.sg 84
http://aimotion.blogspot.jp 78
http://aimotion.blogspot.ru 69
http://aimotion.blogspot.kr 59
http://aimotion.blogspot.se 59
http://aimotion.blogspot.nl 55
http://aimotion.blogspot.pt 51
http://cloud.feedly.com 50
http://aimotion.blogspot.tw 45
http://aimotion.blogspot.hk 45
http://aimotion.blogspot.be 44
http://aimotion.blogspot.mx 43
http://aimotion.blogspot.co.il 41
http://aimotion.blogspot.gr 40
http://aimotion.blogspot.com.tr 38
http://aimotion.blogspot.ie 34
http://aimotion.blogspot.fi 32
http://aimotion.blogspot.cz 30
http://www.feedspot.com 28
http://aimotion.blogspot.ch 23
http://aimotion.blogspot.com.ar 23
http://aimotion.blogspot.dk 21
http://aimotion.blogspot.ro 14
http://aimotion.blogspot.hu 14
http://feedreader.com 13
http://aimotion.blogspot.co.nz 13
http://aimotion.blogspot.no 12
http://digg.com 11
http://127.0.0.1 10
http://www.aimotion.blogspot.fr 9
http://aimotion.blogspot.co.at 9
http://www.aimotion.blogspot.com 8
http://aimotion.blogspot.ae 7
http://feedly.com 4
http://aimotion.blogspot.sk 3
http://10.70.141.85 3
http://www.aimotion.blogspot.de 3
http://www.aimotion.blogspot.co.uk 3
http://www.aimotion.blogspot.ca 3
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data com Python Big Data com Python Presentation Transcript

  • BigData with PythonA gentle and simple introductionMarcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the Crab recsys project,works with Python for 6 years, interested at mobile,education, machine learning and dataaaaa!Recife, Brazil - http://aimotion.blogspot.com
  • https://class.coursera.org/datasci-001/DisclaimerAlguns slides foramretirados das aptsdo curso de Bill Howe -Introduction to DataScience
  • About meCo-founder of Crab - Python recsys libraryCientist Chief at Atepassar, e-learning social networkCo-Founder and Instructor of PyCursos, teaching Python on-lineCo-Founder of Pingmind, on-line infrastructure for MOOC’sInterested at Python, mobile, e-learning and machine learning!
  • What is BigData ?
  • Big Data“Big Data is any data that is expensive to manageand hard to extract value from.”Michael Franklin Thomas M. SiebelProfessor of Computer Science Director of the Algorithms,Machines and People Lab University of Berkeley
  • Big DataErik Larson, 1989, Harper’s magazine“The keepers of big data say they do it for theconsumer’s benefit. But data have a way ofbeing used for purposes other than originallyintended.”
  • !"#$%"$#&()")(*$+,(!"#$%"$#&()")(!"#$%&(%$)*%+%),-+.),#$$)*#/(#*)01.#2%03)!41-%$)*%+%5)6$70)5)1$-18)0+9#%25):%1.-(#)7#(#9%+#*5);2$)#+1<3)Big Data, muitos dados.
  • ChallengesVolumethe size of the dataVelocitythe latency of data processing relative to thegrowing demand for interactivityVarietythe diversity of sources, formats, quality,structures
  • !"#$%&&$%"()*+",*+$-./0$.,12()$$.&3"4$ .)1,5"4$.&12)$.&12)$
  • Where does bigdata comes from ?“data exhaust” from customersnew and pervasive sensorsthe ability to “keep everything”
  • !"#$%&()*"+$#"!"#$%$&()*+,"-.%(/%"01(!"#$%&"()%*+%2"-&*3&(.4.+*(/"5607-8(,"-./%(-0"&12%3/&124.(1%500%6&216%./7%!.281&%0#.(1%91:.-/*1(9-.%7/,(;-(5*5"+(/"5607-8(9211/%7.&.%(1/&21%:#37%6&.&1%723;1%<#.6)%!1-2=%,>!?@%:(3.#%AB!%!"#$%*(.-.%7/,(<$8(=.&.(
  • Where does bigdata comes from ?photo in public domain
  • Where does bigdata comes from ?4/28/13 Bill Howe, UW 11
  • !"#$%&()#*+$$,-.#(/$$$011$*2))23$-.45#$67#&7$-7$8$9-&."$:;<:=$$<$23$>$?3@#&3#@$67#&7$"-5#$-$,-.#(/$$-..63@$$$$9&#$@"-3$>;$(2))23$A2#.#7$8$.3@#3@$BC#($$)23/7=$3#C7$7@&2#7=$()D$A7@7=$3@#7=$A"@$$-)(6*7=$#@.EF$7"-&#G$#-."$*3@"E$$$H)G7$>;%I$8$G-@-$8&$-3-)J727=$-GG7$<:$!I$8$$.*A&#77#G$G-@-$G-2)J$
  • !"#$%&!!!())!*$++$,-!./&/0!12)!*$++$,-!34$+5!6#&&6/!!789!:$++$,-!/&4;<!=.&$&/!4!345!!>!"?!3464!@,!4-4+5/$/!A&-&46&3!34$+5!!!"<&!B,:+&*C!
  • !"#$%&&(!!!)*+!#,-.&/0!.$(1!!2++!#,-.&/0!3#$4!!5$6-44$/0!3#$4!0&378!9-(!!*+++!.$(!!):!;<!0&#&!9-(!&/&783!=$/$(&#$0!0&378!!!>(&03?-/&7!0&#&!#-(&=$1!#$6,/3@.$!A!&/&783!#--7!B.#!0-!/-#!C-(D!&#!#,$$!6&7$!E!!>,$!F(-G7$4H!
  • Scalability
  • What doesScalable mean ?Operationally:In the past: Works even if data doesn’t fit in main memoryNow: Can make use of 1000s of cheap computersAlgorithmically:In the past: If you have N data items, you must do no morethan Nm operations -- polynomial time algorithmsNow: If you have N data items, you must do no more than Nm/k operations,for some large k -- “polynomial time algorithms must be parallelized”Soon: If you have N data items, you should do no more than N * log(N) operations
  • ExampleGiven a set of DNA sequences, find allsequences equal to GATTACGATATTA .
  • GATTACGATATTATACCTGCCGTAA
  • TACCTGCCGTAA = GATTACGATATTA?GATTACGATATTAtime = 0No.Time = 0
  • Time = 1CCCCCAATGAC = GATTACGATATTA?GATTACGATATTAtime = 1No.
  • Time = 17GATTACGATATTA contains GATTACGATATTA?GATTACGATATTAtime = 17Yes!Send it to the output.
  • GATTACGATATTA40 records, 40 comparionsN records, N comparisonsThe algorithmic complexity is order N: O(N)
  • GATTACGATATTAAAAATCCTGCAAAACGCCTGCATTTACGTCAATTTTCGTAATTWhat if we sort the sequences?What if we sort the sequences ?
  • CTGTACACAACCT < GATTACGATATTAGATTACGATATTANo match.Skip to 75% markStart at the 50% marktime = 00% 100%CTGTACACAACCTTime = 0
  • GGATACACATTTA > GATTACGATATTAGATTACGATATTANo match.0% 100%GGATACACATTTA
  • GATATTTTAAGC < GATTACGATATTAGATTACGATATTANo match.Skip back to 68.75% mark0% 100%GATATTTTAAGC
  • GATTACGATATTA = GATTACGATATTAGATTACGATATTA0% 100%Match!Walk through the records until wefail to match.
  • GATTACGATATTA0% 100%How many comparisons did we do?40 records, only 4 comparisonsN records, log(N) comparisonsThis algorithm is O(log(N)) Far better scalability
  • RelationalDatabasesRelational Databases•  Databases are good at Needle in Haystack problems:–  Extracting small results from big datasets–  Transparently provide old style scalability–  Your query will always* finish, regardless of dataset size.–  Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq);SELECT seqFROM sequenceWHERE seq = GATTACGATATTA ;
  • ExampleNew task: Read TrimmingGiven a set of DNA sequences,trim the final n bps of each sequenceGenerate a new dataset
  • GATTACGATATTATACCTGCCGTAA
  • TACCTGCCGTAA becomes TACCTtime = 0Time = 0
  • Time = 1CCCCCAATGAC becomes CCCCCtime = 1
  • Time = 17GATTACGATATTA becomes GATTAtime = 17
  • Can we use an index?No. We have to touch every record no matter what.The task is fundamentally O(N)Can we do any better?
  • Time = 0
  • Time = 1
  • Time = 2
  • Time = 3
  • Time = 7time = 7How much time did this take?7 cycles40 records, 6 workersO(N/k)time = 7How much time did this ta7 cycles40 records, 6 workersO(N/k)
  • Schematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed readsSchematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed readsSchematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed readsSchematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed reads
  • You are given TIFFimagesDistribute the imagesamong k computersf f f f f ff is a function toconvert TIFF toPNG; apply it toevery itemNow we have a bigdistributed set ofconverted imagesNew Task: Convert 405k TIFF images to PNGhttp://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/Convert 405k TIFF images to PNG
  • You have sets ofparameters forthousands of smallsimulationsDivide the parametersets among kcomputersf f f f f ff is runs thesimulation andproduces someoutput; apply it toevery itemNow we have a bigdistributed set ofsimulation resultsNew Task: Run thousands of simulationshttp://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloudRun thousands of simulations
  • You have millionsof documentsDistribute thedocuments among kcomputersf f f f f ff finds the mostcommon word in asingle documentNow we have a bigdistributed list of(doc_id, word) pairsFind the most common word in each documentFind the most common word in eachdocument
  • Consider a slightly more general programto compute the word frequency of everyword in a singe documentConsider a slightly more general program to compute theword frequency of every word in a single documentAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.(people, 2)(government, 6)(assume, 1)(history, 2)…
  • You have millionsof documentsDistribute thedocuments among kcomputersf f f f f fFor each documentf returns a set of(word, freq) pairsCompute the word frequency of 5M documentsNow we have a bigdistributed list ofsets of word freqs.
  • There’s a pattern...There’s a pattern here….•  A function that maps a read to a trimmed read•  A function that maps a TIFF image to a PNG image•  A function that maps a set of parameters to a simulation result•  A function that maps a document to its most common word•  A function that maps a document to a histogram of wordfrequencies
  • What if we want to compute the wordfrequency across all documents?US Constitution Declaration ofIndependenceArticles ofConfederation(people, 78)(government, 123)(assume, 23)(history, 38)…
  • You have millionsof documentsDistribute thedocuments amongk computersmap map map map map mapFor each document,return a set of (word,freq) pairsCompute the word frequency across 5M documentsNow what?•  But we don’t want a bunch of little histograms –we want one big histogram.•  How can we make sure that a single computerhas access to every occurrence of a given wordregardless of which document it appeared in?Compute the word frequency across 5M docs
  • Compute the word frequency across 5M docsDistribute thedocuments among kcomputersmap map map map map mapFor each document,return a set of(word, freq) pairsNow we have a bigdistributed list ofsets of word freqs.reduce reduce reduce reduceNow just count theoccurrences of each word44 3We have ourdistributed histogramCompute the word frequency across 5M documents
  • Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 11Some distributed algorithm…Map(Shuffle)Reduce
  • Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 12MapReduce Programming Model•  Input & Output: each a set of key/value pairs•  Programmer specifies two functions:–  Processes input key/value pair–  Produces set of intermediate pairs–  Combines all intermediate values for a particular key–  Produces a set of merged output values (usually just one)map (in_key, in_value) -> list(out_key, intermediate_value)reduce (out_key, list(intermediate_value)) -> list(out_value)Inspired by primitives from functional programminglanguages such as Lisp, Scheme, and Haskellslide source: Google, Inc.
  • Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 13Example: What does this do?map(String input_key, String input_value):// input_key: document name// input_value: document contentsfor each word w in input_value:EmitIntermediate(w, 1);reduce(String output_key, Iterator intermediate_values):// output_key: word// output_values: ????int result = 0;for each v in intermediate_values:result += v;Emit(result);slide source: Google, Inc.
  • Example3/12/09 Bill Howe, eScience Institute 14Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.Example: Document Processing
  • Word length histogram example3/12/09 Bill Howe, eScience Institute 14Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.Example: Document Processing3/12/09 Bill Howe, eScience Institute 15Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.Example: Word length histogramHow many big , medium , and small words are used?
  • Word length histogram exampleAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Big = Yellow = 10+ lettersMedium = Red = 5..9 lettersSmall = Blue = 2..4 lettersTiny = Pink = 1 letterExample: Word length histogram
  • Word length histogram exampleAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Split the document intochunks and process eachchunk on a differentcomputerChunk 1Chunk 2Example: Word length histogram
  • Word length histogram example(yellow, 20)(red, 71)(blue, 93)(pink, 6 )Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Map Task 1(204 words)Map Task 2(190 words)(key, value)(yellow, 17)(red, 77)(blue, 107)(pink, 3)Example: Word length histogram
  • Word length histogram example3/12/09 Bill Howe, eScience Institute 19(yellow, 17)(red, 77)(blue, 107)(pink, 3)(yellow, 20)(red, 71)(blue, 93)(pink, 6 )Reduce tasks(yellow, 17)(yellow, 20)(red, 77)(red, 71)(blue, 93)(blue, 107)(pink, 6)(pink, 3)Example: Word length histogramA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Map task 1Map task 2Shuffle step(yellow, 37)(red, 148)(blue, 200)(pink, 9)
  • MR Phases!"#$%&(")$*+&,&MR Phases•  Each Map and Reduce task has multiple phases:
  • MR PhasesMAP REDUCE(w1,1)(w2,1)(w1,1)…(w1,1)(w2,1)…(did1,v1)(did2,v2)(did3,v3). . . .(w1, (1,1,1,…,1))(w2, (1,1,…))(w3,(1…))…………(w1, 25)(w2, 77)(w3, 12)…………Shuffle!"#$%&()%"**$"(+%,&-.$/%%012%3,%45+,%+$3)%6&78%9:;%
  • MR PhasesHadoop inOne Slidesrc: Huy Vo,NYU Poly
  • Taxonomy ofParallel Architectures5/15/13 Bill Howe, eScience Institute 22Taxonomy of Parallel ArchitecturesEasiest to program,but $$$$Scales to 1000s of computers
  • 5/15/13 Bill Howe, UW 32
  • 5/15/13 Bill Howe, UW 25
  • 5/15/13 Bill Howe, eScience Institute 26Large-Scale Data Processing•  Many tasks process big data, produce big data•  Want to use hundreds or thousands of CPUs–  ... but this needs to be easy–  Parallel databases exist, but they are expensive, difficult to setup, and do not necessarily scale to hundreds of nodes.•  MapReduce is a lightweight framework, providing:–  Automatic parallelization and distribution–  Fault-tolerance–  I/O scheduling–  Status and monitoring
  • 5/15/13 Bill Howe, eScience Institute 35MapReduce Contemporaries•  Dryad (Microsoft)–  Relational Algebra•  Pig (Yahoo)–  Near Relational Algebra over MapReduce•  HIVE (Facebook)–  SQL over MapReduce•  Cascading–  Relational Algebra•  Clustera–  U of Wisconsin•  Hbase–  Indexing on HDFSMapReduce Contemporaries
  • Talk is cheap,Show me the code
  • ExamplesMore Examples: Build an Inverted Index5/15/13 Bill Howe, UW 14Input:tweet1, (“I love pancakes for breakfast”)tweet2, (“I dislike pancakes”)tweet3, (“What should I eat for breakfast?”)tweet4, (“I love to eat”)Desired output:“pancakes”, (tweet1, tweet2)“breakfast”, (tweet1, tweet3)“eat”, (tweet3, tweet4)“love”, (tweet1, tweet4)…
  • ExamplesMore Examples: Relational Join5/15/13 Bill Howe, UW 15Name SSNSue 999999999Tony 777777777EmpSSN DepName999999999 Accounts777777777 Sales777777777 MarketingEmplyee ⨝ Assigned DepartmentsEmployeeAssigned DepartmentsName SSN EmpSSN DepNameSue 999999999 999999999 AccountsTony 777777777 777777777 SalesTony 777777777 777777777 Marketing
  • ExamplesRelational Join in MapReduce: Before Map Phase5/15/13 Bill Howe, UW 16Name SSNSue 999999999Tony 777777777EmpSSN DepName999999999 Accounts777777777 Sales777777777 MarketingEmployeeAssigned DepartmentsKey idea: Lump all the tuplestogether into one datasetEmployee, Sue, 999999999Employee, Tony, 777777777Department, 999999999, AccountsDepartment, 777777777, SalesDepartment, 777777777, MarketingWhat is this for?
  • ExamplesRelational Join in MapReduce: Map Phase5/15/13 Bill Howe, UW 17Employee, Sue, 999999999Employee, Tony, 777777777Department, 999999999, AccountsDepartment, 777777777, SalesDepartment, 777777777, Marketingkey=999999999, value=(Employee, Sue, 999999999)key=777777777, value=(Employee, Tony, 777777777)key=999999999, value=(Department, 999999999, Accounts)key=777777777, value=(Department, 777777777, Sales)key=777777777, value=(Department, 777777777, Marketing)why do we use this as the key?
  • ExamplesRelational Join in MapReduce: Reduce Phase5/15/13 Bill Howe, UW 18key=999999999, values=[(Employee, Sue, 999999999),(Department, 999999999, Accounts)]key=777777777, values=[(Employee, Tony, 777777777),(Department, 777777777, Sales),(Department, 777777777, Marketing)]Sue, 999999999, 999999999, AccountsTony, 777777777, 777777777, SalesTony, 777777777, 777777777, Marketing
  • ExamplesRelational Join in MapReduce, again5/15/13 Bill Howe, Data Science, Autumn 2012 19Order(orderid, account, date)1, aaa, d12, aaa, d23, bbb, d3LineItem(orderid, itemid, qty)1, 10, 11, 20, 32, 10, 52, 50, 1003, 20, 1tagged withrelation nameOrder1, aaa, d1 ! 1 : “Order”, (1,aaa,d1)2, aaa, d2 ! 2 : “Order”, (2,aaa,d2)3, bbb, d3 ! 3 : “Order”, (3,bbb,d3)Line1, 10, 1 ! 1 : “Line”, (1, 10, 1)1, 20, 3 ! 1 : “Line”, (1, 20, 3)2, 10, 5 ! 2 : “Line”, (2, 10, 5)2, 50, 100 ! 2 : “Line”, (2, 50, 100)3, 20, 1 ! 3 : “Line”, (3, 20, 1)MapReducer for key 1“Order”, (1,aaa,d1)“Line”, (1, 10, 1)“Line”, (1, 20, 3)(1, aaa, d1, 1, 10, 1)(1, aaa, d1, 1, 20, 3)
  • ExamplesSimple Social Network Analysis:Count FriendsJim, SueSue, JimLin, JoeJoe, LinJim, KaiKai, JimJim, LinLin, JimInput Desired OutputJim, 3Lin, 2Sue, 1Kai, 1Joe, 1Jim, 1Sue, 1Lin, 1Joe, 1Jim, 1Kai, 1Jim, 1Lin, 1MAP REDUCEJim, (1, 1, 1)Lin, (1, 1)Sue, (1)Kai, (1)Joe, (1)SHUFFLE
  • ExamplesMatrix Multiplication!" #" $" %&"" &" %#" !"!" %&"$" #"%#" %&"(" $"X =!" %)"&#" $"
  • ExamplesMatrix Multiply in MapReduceC = A X BA has dimensions L,MB has dimensions M,N•  In the map phase:–  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N–  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L•  In the reduce phase, emit–  key = (i,k)–  value = Sumj (A[i,j] * B[j,k])5/15/13 Bill Howe, Data Science, Autumn 2012 21
  • ExamplesABXA B=•  One reducer per output cell
  • Meeting Hadoop
  • Cluster Computing•  Large number of commodity servers,connected by high speed, commoditynetwork•  Rack: holds a small number of servers•  Data center: holds many racks!"#$%&(#))%&*"&+!"#$%&()*+(",-(+./0"123145"67$%8("1",-(+./09":;1;<"/0=>4""/?"#@0@0A"/?"#$99@B("C$8$9(89;"D>"E$F$$G$0"$0)"H==G$0""),,(-.,(/0123,(456,7852(9,:3;-,(
  • Cluster Computing•  Massive parallelism:– 100s, or 1000s, or 10000s servers– Many hours•  Failure:– If medium-time-between-failure is 1 year– Then 10000 servers have one failure / hour24slide src: Dan Suciu and Magda Balazinska
  • Distributed File System (DFS)•  For very large files: TBs, PBs•  Each file is partitioned into chunks,typically 64MB•  Each chunk is replicated several times(!3), on different racks, for faulttolerance•  Implementations:– Google’s DFS: GFS, proprietary– Hadoop’s DFS: HDFS, open source25slide src: Dan Suciu and Magda Balazinska
  • Hadoop
  • !"#$%&%(#)**+%,%!"#$%&"#()*(+(%"(&"#(,-.%/#-/0,#12,"(,3#4-("#*%4/,%&0/#*-#$."5,2-#44%)32)()#/62,721-2882*%/9.(,*6(,#:
  • !"#$%&()"*&+&*",)-&$(!"#$%%!&((#)&#&*!+)(,-%..//$)0(,123(),45(%#4)0&)6(%&4(,)07&4(&$)484(&94:1(%*#,7&0))6432",)-&$(4;#%))%#4)0&)6(%&2177&4(104(#**#<)0=#>))"?300107@///4)64&,+&,4)0=#>))"
  • !"#$%&()*+),)!"##$%&"()*+,-.&/01&*23$4,5%&678193:&1:08&5&93;/"##$%&<= )&,819>;$3;&-&./0&)01)2.(00$)$&034/)
  • !"#$%&"#"$$()&*$+"$,&-../$0"#-$1.2$3+456.%+&7$-&*&$&8&79"+"$:#;*$<+8+84=$/&>#28"$"#&2%)$?&%)+8#$7.4$&8&79"+"$@#.A"/&%+B&7$&8&79"+"$@#8.<#$C8&79"+"$D204$D+"%.E#29$F2&0-$&8-$%.</7+&8%#$<&8&4#<#8*$G+-#.$&8-$+<&4#$&8&79"+"$:2#8-$C8&79"+"$
  • !"#$%&&$()*##+$,$-#./$-0&1$2$34)5#.637$$2$8)9:##;$$2$<##/-$$2$=>?$$2$@0&.A$2$B)&1CD4$$2$EF$G#H;$I04&$$2$G)"##J$$2$IF0KH$2$B0.;*$0.$$$$$
  • !"#$$%&($)*)+,&-&!"#$%& #()%*+,%-!./0&12+34#&!56%& 758&96:;&<;;=%%(%:&0>;;(&/?+@%& A;B5%& C25::&
  • !"#$$%&#()*+,-$.&/&
  • !"#$%&(&!""#$%&()*+&,#-&.&/&010#
  • !"#$%&()&&!""#!"#$$%&#()*+,)-#&./-&(0()-1&
  • !"#$$%&$(%$)*)+,&-&.$/&+0"12*0&3",2&30"12*0&4"(*&4$#*&5"%&6*#71*&!89:&&&&
  • MapReducehttp://labs.google.com/papers/mapreduce.html
  • Exemplo de MapReducefrom mrjob.job import MRJobimport reWORD_RE = re.compile(r"[w]+")class MRWordFreqCount(MRJob):def mapper(self, _, line):for word in WORD_RE.findall(line):yield (word.lower(), 1)def reducer(self, word, counts):yield (word, sum(counts))if __name__ == __main__:MRWordFreqCount().run()
  • Projeto MrJobCriado pela Equipe de Engenharia doYelpTotalmente Open-SourceTodo em PythonUtiliza Map-Reduce para ProcessamentoPermite rodar tanto no Amazon EMR como no Hadoop
  • Objetivos do MrJobsSe você quer aprender MapReduce, ele é para vocêSe você tem um problema cavalar e precisa de muitoprocessamento e não está afim de mexer em HadoopSe você já tem um cluster Hadoop e quer rodar scriptsPythonSe você quer migrar seu código Python do Hadooppara o EMRSe você não quer escrever Python (Impossível!), não épara você!
  • Passos importantessudo easy_install mrjob
  • Vamos a uma demo...Texto da posse de Obama em 2009.
  • Desempenho MapReduce
  • Mais informaçõeshttp://packages.python.org/mrjob/https://github.com/Yelp/mrjob
  • Distributed Computing with mrJobhttps://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • Distributed Computing with mrJobhttps://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • Atepassar Recommendationshttp://atepassar.comProblema: Como recomendar novos amigos ?
  • Atepassar Recommendationshttp://atepassar.com
  • Atepassar Recommendationshttp://atepassar.commarcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,patricia,maria,marcelcarol;maria,jose,patriciafabiola;mariapaula;fabio,amandapatricia;amanda,caroljose;marcel,caroljonas;marcel,fabiofabio;jonas,paulacarla"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]"fabio" [["amanda", 1], ["marcel", 1]]"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
  • Atepassar Recommendationshttp://atepassar.commarcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,patricia,maria,marcelcarol;maria,jose,patriciafabiola;mariapaula;fabio,amandapatricia;amanda,caroljose;marcel,caroljonas;marcel,fabiofabio;jonas,paulacarla"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]"fabio" [["amanda", 1], ["marcel", 1]]"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.datmarcel;jonas,maria,jose,amandacarol;maria,jose,patriciafabio;jonas,paula...
  • Atepassar Recommendationshttp://atepassar.comfsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b,https://gist.github.com/3970945#file_atepassar_recommender.py
  • Atepassar Recommendationshttp://atepassar.comCelery - para agendamento dos jobs coletores eexecutores.mrJob - para mapreduce e acesso ao HadoopMongoDb - para armazenamento das recomendaçõesBoto - acesso aos files do S3.
  • A melhor parte!
  • Projetos interessantesEstrutura de dados para manipulação rápida- slicing- indexing- subsetingHandling missing dataAgregações, Séries TemporaisPandas: a data analysis library for Python,poised to give R a run for its money… http://pandas.pydata.org/
  • Projetos interessantesOutro framework para computação distribuída comPython com MapReduce.Criado pelo Instituto Nokia.Backend dele é escrito em Erlang (funcional,concorrente e bem escalável!)Não utiliza o FileSystem mais usado HDFS e sim umnovo padrão por eles (DDFS).http://discoproject.org/
  • Projetos interessanteshttp://api.mongodb.org/python/2.0/examples/map_reduce.htmlPython & MongoDbMongoDb - Banco de Dados Não relacional (NoSQL)Possui suporte nativo built-in para fazer MapReduce.Escrever o código em JS e não é muito legível e fica preso aoMongo ...>>> reduce = Code("function (key, values) {"... " var total = 0;"... " for (var i = 0; i < values.length; i++) {"... " total += values[i];"... " }"... " return total;"... "}")
  • Projetos interessanteshttps://github.com/klbostee/dumbo/wiki/Short-tutorialDumboUma das primeiras bibliotecas em cima doMapReduce e Python.Complicado para começar e está desatualizada :(
  • Projetos interessantesUm wrapper em Python em cima do Hadooppara computação distribuída.http://pydoop.sourceforge.net/docs/index.htmlLegal, mas dá um trabalho para configurar.
  • Projetos interessantesMapreduce com Python na Google AppEnginehttps://developers.google.com/appengine/docs/python/dataprocessing/Ainda experimental e fica “preso” à plataformaAppEngine.
  • Projetos interessantesAlgoritmos de aprendizagem de máquinaSupervisionados & Não supervisionadosPré-processamento, extração de dadosAvaliação de classificadores, Pipeline,seleção de atributos.http://scikit-learn.org/stable/
  • Projetos interessantesProcessamento de linguagem naturalVárias ferramentas para tokenização,pos tagging, named entity recognition,classificadores, etc.Vários corpus disponíveis!http://nltk.org/
  • Projetos interessantesProcessamento de linguagem naturalVárias ferramentas para tokenização,pos tagging, named entity recognition,classificadores, etc.Vários corpus disponíveis!http://nltk.org/Fiquem de olho...Pipeline for distributed Natural Language Processing, made in Pythonhttps://github.com/NAMD/pypln
  • Projetos interessantesOur Project’s Home Pagehttp://muricoca.github.com/crab
  • Future ReleasesPlanned Release 0.13New home for python-recsys:https://github.com/python-recsys/crabPlanned Release 0.14Support to Item-Based Recommenders using MapReduce with MrJobNew commiters: vinnigracindo, ocelma, fcurella
  • Join us!1. Read our Wiki Pagehttps://github.com/muricoca/crab/wiki/Developer-Resources2. Check out our current sprints and open issueshttps://github.com/muricoca/crab/issues3. Forks, Pull Requests mandatory4. Join us at irc.freenode.net #muricoca or at ourdiscussion listhttp://groups.google.com/group/scikit-crab
  • Vários outros ...MatplotlibStatsModelsScipyNumpyMatplotlibNetworkXPyBrainmilkOrange
  • Livros recomendados
  • Livros recomendadoshttp://shop.oreilly.com/product/0636920022640.do?cmp=il-radar-ebooks-big-data-now-radarFor free...
  • Livros recomendadosFor free...http://infolab.stanford.edu/~ullman/mmds/book.pdf
  • Artigos recomendadosFor free...http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.htmlhttp://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html
  • Artigos recomendadosFor free...http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.htmlhttp://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.htmlhttps://github.com/marcelcaraciolo/recsys-mapreduce-mrjob
  • BigData with PythonA gentle and simple introductionMarcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the Crab recsys project,works with Python for 6 years, interested at mobile,education, machine learning and dataaaaa!Recife, Brazil - http://aimotion.blogspot.com