BigData with PythonA gentle and simple introductionMarcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the...
https://class.coursera.org/datasci-001/DisclaimerAlguns slides foramretirados das aptsdo curso de Bill Howe -Introduction ...
About meCo-founder of Crab - Python recsys libraryCientist Chief at Atepassar, e-learning social networkCo-Founder and Ins...
What is BigData ?
Big Data“Big Data is any data that is expensive to manageand hard to extract value from.”Michael Franklin Thomas M. Siebel...
Big DataErik Larson, 1989, Harper’s magazine“The keepers of big data say they do it for theconsumer’s benefit. But data ha...
!"#$%"$#&()")(*$+,(!"#$%"$#&()")(!"#$%&(%$)*%+%),-+.),#$$)*#/(#*)01.#2%03)!41-%$)*%+%5)6$70)5)1$-18)0+9#%25):%1.-(#)7#(#9%...
ChallengesVolumethe size of the dataVelocitythe latency of data processing relative to thegrowing demand for interactivity...
!"#$%&&$%"()*+",*+$-./0$.,12()$$.&3"4$ .)1,5"4$.&12)$.&12)$
Where does bigdata comes from ?“data exhaust” from customersnew and pervasive sensorsthe ability to “keep everything”
!"#$%&()*"+$#"!"#$%$&()*+,"-.%(/%"01(!"#$%&"()%*+%2"-&*3&(.4.+*(/"5607-8(,"-./%(-0"&12%3/&124.(1%500%6&216%./7%!.281&%0#.(...
Where does bigdata comes from ?photo in public domain
Where does bigdata comes from ?4/28/13 Bill Howe, UW 11
!"#$%&()#*+$$,-.#(/$$$011$*2))23$-.45#$67#&7$-7$8$9-&."$:;<:=$$<$23$>$?3@#&3#@$67#&7$"-5#$-$,-.#(/$$-..63@$$$$9&#$@"-3$>;$...
!"#$%&!!!())!*$++$,-!./&/0!12)!*$++$,-!34$+5!6#&&6/!!789!:$++$,-!/&4;<!=.&$&/!4!345!!>!"?!3464!@,!4-4+5/$/!A&-&46&3!34$+5!...
!"#$%&&(!!!)*+!#,-.&/0!.$(1!!2++!#,-.&/0!3#$4!!5$6-44$/0!3#$4!0&378!9-(!!*+++!.$(!!):!;<!0&#&!9-(!&/&783!=$/$(&#$0!0&378!!...
Scalability
What doesScalable mean ?Operationally:In the past: Works even if data doesn’t fit in main memoryNow: Can make use of 1000s...
ExampleGiven a set of DNA sequences, find allsequences equal to GATTACGATATTA .
GATTACGATATTATACCTGCCGTAA
TACCTGCCGTAA = GATTACGATATTA?GATTACGATATTAtime = 0No.Time = 0
Time = 1CCCCCAATGAC = GATTACGATATTA?GATTACGATATTAtime = 1No.
Time = 17GATTACGATATTA contains GATTACGATATTA?GATTACGATATTAtime = 17Yes!Send it to the output.
GATTACGATATTA40 records, 40 comparionsN records, N comparisonsThe algorithmic complexity is order N: O(N)
GATTACGATATTAAAAATCCTGCAAAACGCCTGCATTTACGTCAATTTTCGTAATTWhat if we sort the sequences?What if we sort the sequences ?
CTGTACACAACCT < GATTACGATATTAGATTACGATATTANo match.Skip to 75% markStart at the 50% marktime = 00% 100%CTGTACACAACCTTime = 0
GGATACACATTTA > GATTACGATATTAGATTACGATATTANo match.0% 100%GGATACACATTTA
GATATTTTAAGC < GATTACGATATTAGATTACGATATTANo match.Skip back to 68.75% mark0% 100%GATATTTTAAGC
GATTACGATATTA = GATTACGATATTAGATTACGATATTA0% 100%Match!Walk through the records until wefail to match.
GATTACGATATTA0% 100%How many comparisons did we do?40 records, only 4 comparisonsN records, log(N) comparisonsThis algorit...
RelationalDatabasesRelational Databases•  Databases are good at Needle in Haystack problems:–  Extracting small results fr...
ExampleNew task: Read TrimmingGiven a set of DNA sequences,trim the final n bps of each sequenceGenerate a new dataset
GATTACGATATTATACCTGCCGTAA
TACCTGCCGTAA becomes TACCTtime = 0Time = 0
Time = 1CCCCCAATGAC becomes CCCCCtime = 1
Time = 17GATTACGATATTA becomes GATTAtime = 17
Can we use an index?No. We have to touch every record no matter what.The task is fundamentally O(N)Can we do any better?
Time = 0
Time = 1
Time = 2
Time = 3
Time = 7time = 7How much time did this take?7 cycles40 records, 6 workersO(N/k)time = 7How much time did this ta7 cycles40...
Schematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistrib...
You are given TIFFimagesDistribute the imagesamong k computersf f f f f ff is a function toconvert TIFF toPNG; apply it to...
You have sets ofparameters forthousands of smallsimulationsDivide the parametersets among kcomputersf f f f f ff is runs t...
You have millionsof documentsDistribute thedocuments among kcomputersf f f f f ff finds the mostcommon word in asingle doc...
Consider a slightly more general programto compute the word frequency of everyword in a singe documentConsider a slightly ...
You have millionsof documentsDistribute thedocuments among kcomputersf f f f f fFor each documentf returns a set of(word, ...
There’s a pattern...There’s a pattern here….•  A function that maps a read to a trimmed read•  A function that maps a TIFF...
What if we want to compute the wordfrequency across all documents?US Constitution Declaration ofIndependenceArticles ofCon...
You have millionsof documentsDistribute thedocuments amongk computersmap map map map map mapFor each document,return a set...
Compute the word frequency across 5M docsDistribute thedocuments among kcomputersmap map map map map mapFor each document,...
Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 11Some distributed algorithm…Map(Shuffle)Re...
Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 12MapReduce Programming Model•  Input & Out...
Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 13Example: What does this do?map(String inp...
Example3/12/09 Bill Howe, eScience Institute 14Abridged Declaration of IndependenceA Declaration By the Representatives of...
Word length histogram example3/12/09 Bill Howe, eScience Institute 14Abridged Declaration of IndependenceA Declaration By ...
Word length histogram exampleAbridged Declaration of IndependenceA Declaration By the Representatives of the United States...
Word length histogram exampleAbridged Declaration of IndependenceA Declaration By the Representatives of the United States...
Word length histogram example(yellow, 20)(red, 71)(blue, 93)(pink, 6 )Abridged Declaration of IndependenceA Declaration By...
Word length histogram example3/12/09 Bill Howe, eScience Institute 19(yellow, 17)(red, 77)(blue, 107)(pink, 3)(yellow, 20)...
MR Phases!"#$%&(")$*+&,&MR Phases•  Each Map and Reduce task has multiple phases:
MR PhasesMAP REDUCE(w1,1)(w2,1)(w1,1)…(w1,1)(w2,1)…(did1,v1)(did2,v2)(did3,v3). . . .(w1, (1,1,1,…,1))(w2, (1,1,…))(w3,(1…...
MR PhasesHadoop inOne Slidesrc: Huy Vo,NYU Poly
Taxonomy ofParallel Architectures5/15/13 Bill Howe, eScience Institute 22Taxonomy of Parallel ArchitecturesEasiest to prog...
5/15/13 Bill Howe, UW 32
5/15/13 Bill Howe, UW 25
5/15/13 Bill Howe, eScience Institute 26Large-Scale Data Processing•  Many tasks process big data, produce big data•  Want...
5/15/13 Bill Howe, eScience Institute 35MapReduce Contemporaries•  Dryad (Microsoft)–  Relational Algebra•  Pig (Yahoo)–  ...
Talk is cheap,Show me the code
ExamplesMore Examples: Build an Inverted Index5/15/13 Bill Howe, UW 14Input:tweet1, (“I love pancakes for breakfast”)tweet...
ExamplesMore Examples: Relational Join5/15/13 Bill Howe, UW 15Name SSNSue 999999999Tony 777777777EmpSSN DepName999999999 A...
ExamplesRelational Join in MapReduce: Before Map Phase5/15/13 Bill Howe, UW 16Name SSNSue 999999999Tony 777777777EmpSSN De...
ExamplesRelational Join in MapReduce: Map Phase5/15/13 Bill Howe, UW 17Employee, Sue, 999999999Employee, Tony, 777777777De...
ExamplesRelational Join in MapReduce: Reduce Phase5/15/13 Bill Howe, UW 18key=999999999, values=[(Employee, Sue, 999999999...
ExamplesRelational Join in MapReduce, again5/15/13 Bill Howe, Data Science, Autumn 2012 19Order(orderid, account, date)1, ...
ExamplesSimple Social Network Analysis:Count FriendsJim, SueSue, JimLin, JoeJoe, LinJim, KaiKai, JimJim, LinLin, JimInput ...
ExamplesMatrix Multiplication!" #" $" %&"" &" %#" !"!" %&"$" #"%#" %&"(" $"X =!" %)"&#" $"
ExamplesMatrix Multiply in MapReduceC = A X BA has dimensions L,MB has dimensions M,N•  In the map phase:–  for each eleme...
ExamplesABXA B=•  One reducer per output cell
Meeting Hadoop
Cluster Computing•  Large number of commodity servers,connected by high speed, commoditynetwork•  Rack: holds a small numb...
Cluster Computing•  Massive parallelism:– 100s, or 1000s, or 10000s servers– Many hours•  Failure:– If medium-time-between...
Distributed File System (DFS)•  For very large files: TBs, PBs•  Each file is partitioned into chunks,typically 64MB•  Eac...
Hadoop
!"#$%&%(#)**+%,%!"#$%&"#()*(+(%"(&"#(,-.%/#-/0,#12,"(,3#4-("#*%4/,%&0/#*-#$."5,2-#44%)32)()#/62,721-2882*%/9.(,*6(,#:
!"#$%&()"*&+&*",)-&$(!"#$%%!&((#)&#&*!+)(,-%..//$)0(,123(),45(%#4)0&)6(%&4(,)07&4(&$)484(&94:1(%*#,7&0))6432",)-&$(4;#%))%...
!"#$%&()*+),)!"##$%&"()*+,-.&/01&*23$4,5%&678193:&1:08&5&93;/"##$%&<= )&,819>;$3;&-&./0&)01)2.(00$)$&034/)
!"#$%&"#"$$()&*$+"$,&-../$0"#-$1.2$3+456.%+&7$-&*&$&8&79"+"$:#;*$<+8+84=$/&>#28"$"#&2%)$?&%)+8#$7.4$&8&79"+"$@#.A"/&%+B&7$...
!"#$%&&$()*##+$,$-#./$-0&1$2$34)5#.637$$2$8)9:##;$$2$<##/-$$2$=>?$$2$@0&.A$2$B)&1CD4$$2$EF$G#H;$I04&$$2$G)"##J$$2$IF0KH$2$...
!"#$$%&($)*)+,&-&!"#$%& #()%*+,%-!./0&12+34#&!56%& 758&96:;&<;;=%%(%:&0>;;(&/?+@%& A;B5%& C25::&
!"#$$%&#()*+,-$.&/&
!"#$%&(&!""#$%&()*+&,#-&.&/&010#
!"#$%&()&&!""#!"#$$%&#()*+,)-#&./-&(0()-1&
!"#$$%&$(%$)*)+,&-&.$/&+0"12*0&3",2&30"12*0&4"(*&4$#*&5"%&6*#71*&!89:&&&&
MapReducehttp://labs.google.com/papers/mapreduce.html
Exemplo de MapReducefrom mrjob.job import MRJobimport reWORD_RE = re.compile(r"[w]+")class MRWordFreqCount(MRJob):def mapp...
Projeto MrJobCriado pela Equipe de Engenharia doYelpTotalmente Open-SourceTodo em PythonUtiliza Map-Reduce para Processame...
Objetivos do MrJobsSe você quer aprender MapReduce, ele é para vocêSe você tem um problema cavalar e precisa de muitoproce...
Passos importantessudo easy_install mrjob
Vamos a uma demo...Texto da posse de Obama em 2009.
Desempenho MapReduce
Mais informaçõeshttp://packages.python.org/mrjob/https://github.com/Yelp/mrjob
Distributed Computing with mrJobhttps://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collecti...
Distributed Computing with mrJobhttps://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collecti...
Atepassar Recommendationshttp://atepassar.comProblema: Como recomendar novos amigos ?
Atepassar Recommendationshttp://atepassar.com
Atepassar Recommendationshttp://atepassar.commarcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,p...
Atepassar Recommendationshttp://atepassar.commarcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,p...
Atepassar Recommendationshttp://atepassar.comfsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b,https://gist.github.com/3970945#file_a...
Atepassar Recommendationshttp://atepassar.comCelery - para agendamento dos jobs coletores eexecutores.mrJob - para mapredu...
A melhor parte!
Projetos interessantesEstrutura de dados para manipulação rápida- slicing- indexing- subsetingHandling missing dataAgregaç...
Projetos interessantesOutro framework para computação distribuída comPython com MapReduce.Criado pelo Instituto Nokia.Back...
Projetos interessanteshttp://api.mongodb.org/python/2.0/examples/map_reduce.htmlPython & MongoDbMongoDb - Banco de Dados N...
Projetos interessanteshttps://github.com/klbostee/dumbo/wiki/Short-tutorialDumboUma das primeiras bibliotecas em cima doMa...
Projetos interessantesUm wrapper em Python em cima do Hadooppara computação distribuída.http://pydoop.sourceforge.net/docs...
Projetos interessantesMapreduce com Python na Google AppEnginehttps://developers.google.com/appengine/docs/python/dataproc...
Projetos interessantesAlgoritmos de aprendizagem de máquinaSupervisionados & Não supervisionadosPré-processamento, extraçã...
Projetos interessantesProcessamento de linguagem naturalVárias ferramentas para tokenização,pos tagging, named entity reco...
Projetos interessantesProcessamento de linguagem naturalVárias ferramentas para tokenização,pos tagging, named entity reco...
Projetos interessantesOur Project’s Home Pagehttp://muricoca.github.com/crab
Future ReleasesPlanned Release 0.13New home for python-recsys:https://github.com/python-recsys/crabPlanned Release 0.14Sup...
Join us!1. Read our Wiki Pagehttps://github.com/muricoca/crab/wiki/Developer-Resources2. Check out our current sprints and...
Vários outros ...MatplotlibStatsModelsScipyNumpyMatplotlibNetworkXPyBrainmilkOrange
Livros recomendados
Livros recomendadoshttp://shop.oreilly.com/product/0636920022640.do?cmp=il-radar-ebooks-big-data-now-radarFor free...
Livros recomendadosFor free...http://infolab.stanford.edu/~ullman/mmds/book.pdf
Artigos recomendadosFor free...http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.htmlhttp://a...
Artigos recomendadosFor free...http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.htmlhttp://a...
BigData with PythonA gentle and simple introductionMarcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the...
Big Data com Python
Big Data com Python
Upcoming SlideShare
Loading in...5
×

Big Data com Python

10,134

Published on

Python Workshop on-line at Mutirao Python on-line via pycursos.com

https://www.youtube.com/watch?feature=player_embedded&v=DFh6l-h6-gw

Published in: Technology, Education

Big Data com Python

  1. 1. BigData with PythonA gentle and simple introductionMarcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the Crab recsys project,works with Python for 6 years, interested at mobile,education, machine learning and dataaaaa!Recife, Brazil - http://aimotion.blogspot.com
  2. 2. https://class.coursera.org/datasci-001/DisclaimerAlguns slides foramretirados das aptsdo curso de Bill Howe -Introduction to DataScience
  3. 3. About meCo-founder of Crab - Python recsys libraryCientist Chief at Atepassar, e-learning social networkCo-Founder and Instructor of PyCursos, teaching Python on-lineCo-Founder of Pingmind, on-line infrastructure for MOOC’sInterested at Python, mobile, e-learning and machine learning!
  4. 4. What is BigData ?
  5. 5. Big Data“Big Data is any data that is expensive to manageand hard to extract value from.”Michael Franklin Thomas M. SiebelProfessor of Computer Science Director of the Algorithms,Machines and People Lab University of Berkeley
  6. 6. Big DataErik Larson, 1989, Harper’s magazine“The keepers of big data say they do it for theconsumer’s benefit. But data have a way ofbeing used for purposes other than originallyintended.”
  7. 7. !"#$%"$#&()")(*$+,(!"#$%"$#&()")(!"#$%&(%$)*%+%),-+.),#$$)*#/(#*)01.#2%03)!41-%$)*%+%5)6$70)5)1$-18)0+9#%25):%1.-(#)7#(#9%+#*5);2$)#+1<3)Big Data, muitos dados.
  8. 8. ChallengesVolumethe size of the dataVelocitythe latency of data processing relative to thegrowing demand for interactivityVarietythe diversity of sources, formats, quality,structures
  9. 9. !"#$%&&$%"()*+",*+$-./0$.,12()$$.&3"4$ .)1,5"4$.&12)$.&12)$
  10. 10. Where does bigdata comes from ?“data exhaust” from customersnew and pervasive sensorsthe ability to “keep everything”
  11. 11. !"#$%&()*"+$#"!"#$%$&()*+,"-.%(/%"01(!"#$%&"()%*+%2"-&*3&(.4.+*(/"5607-8(,"-./%(-0"&12%3/&124.(1%500%6&216%./7%!.281&%0#.(1%91:.-/*1(9-.%7/,(;-(5*5"+(/"5607-8(9211/%7.&.%(1/&21%:#37%6&.&1%723;1%<#.6)%!1-2=%,>!?@%:(3.#%AB!%!"#$%*(.-.%7/,(<$8(=.&.(
  12. 12. Where does bigdata comes from ?photo in public domain
  13. 13. Where does bigdata comes from ?4/28/13 Bill Howe, UW 11
  14. 14. !"#$%&()#*+$$,-.#(/$$$011$*2))23$-.45#$67#&7$-7$8$9-&."$:;<:=$$<$23$>$?3@#&3#@$67#&7$"-5#$-$,-.#(/$$-..63@$$$$9&#$@"-3$>;$(2))23$A2#.#7$8$.3@#3@$BC#($$)23/7=$3#C7$7@&2#7=$()D$A7@7=$3@#7=$A"@$$-)(6*7=$#@.EF$7"-&#G$#-."$*3@"E$$$H)G7$>;%I$8$G-@-$8&$-3-)J727=$-GG7$<:$!I$8$$.*A&#77#G$G-@-$G-2)J$
  15. 15. !"#$%&!!!())!*$++$,-!./&/0!12)!*$++$,-!34$+5!6#&&6/!!789!:$++$,-!/&4;<!=.&$&/!4!345!!>!"?!3464!@,!4-4+5/$/!A&-&46&3!34$+5!!!"<&!B,:+&*C!
  16. 16. !"#$%&&(!!!)*+!#,-.&/0!.$(1!!2++!#,-.&/0!3#$4!!5$6-44$/0!3#$4!0&378!9-(!!*+++!.$(!!):!;<!0&#&!9-(!&/&783!=$/$(&#$0!0&378!!!>(&03?-/&7!0&#&!#-(&=$1!#$6,/3@.$!A!&/&783!#--7!B.#!0-!/-#!C-(D!&#!#,$$!6&7$!E!!>,$!F(-G7$4H!
  17. 17. Scalability
  18. 18. What doesScalable mean ?Operationally:In the past: Works even if data doesn’t fit in main memoryNow: Can make use of 1000s of cheap computersAlgorithmically:In the past: If you have N data items, you must do no morethan Nm operations -- polynomial time algorithmsNow: If you have N data items, you must do no more than Nm/k operations,for some large k -- “polynomial time algorithms must be parallelized”Soon: If you have N data items, you should do no more than N * log(N) operations
  19. 19. ExampleGiven a set of DNA sequences, find allsequences equal to GATTACGATATTA .
  20. 20. GATTACGATATTATACCTGCCGTAA
  21. 21. TACCTGCCGTAA = GATTACGATATTA?GATTACGATATTAtime = 0No.Time = 0
  22. 22. Time = 1CCCCCAATGAC = GATTACGATATTA?GATTACGATATTAtime = 1No.
  23. 23. Time = 17GATTACGATATTA contains GATTACGATATTA?GATTACGATATTAtime = 17Yes!Send it to the output.
  24. 24. GATTACGATATTA40 records, 40 comparionsN records, N comparisonsThe algorithmic complexity is order N: O(N)
  25. 25. GATTACGATATTAAAAATCCTGCAAAACGCCTGCATTTACGTCAATTTTCGTAATTWhat if we sort the sequences?What if we sort the sequences ?
  26. 26. CTGTACACAACCT < GATTACGATATTAGATTACGATATTANo match.Skip to 75% markStart at the 50% marktime = 00% 100%CTGTACACAACCTTime = 0
  27. 27. GGATACACATTTA > GATTACGATATTAGATTACGATATTANo match.0% 100%GGATACACATTTA
  28. 28. GATATTTTAAGC < GATTACGATATTAGATTACGATATTANo match.Skip back to 68.75% mark0% 100%GATATTTTAAGC
  29. 29. GATTACGATATTA = GATTACGATATTAGATTACGATATTA0% 100%Match!Walk through the records until wefail to match.
  30. 30. GATTACGATATTA0% 100%How many comparisons did we do?40 records, only 4 comparisonsN records, log(N) comparisonsThis algorithm is O(log(N)) Far better scalability
  31. 31. RelationalDatabasesRelational Databases•  Databases are good at Needle in Haystack problems:–  Extracting small results from big datasets–  Transparently provide old style scalability–  Your query will always* finish, regardless of dataset size.–  Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq);SELECT seqFROM sequenceWHERE seq = GATTACGATATTA ;
  32. 32. ExampleNew task: Read TrimmingGiven a set of DNA sequences,trim the final n bps of each sequenceGenerate a new dataset
  33. 33. GATTACGATATTATACCTGCCGTAA
  34. 34. TACCTGCCGTAA becomes TACCTtime = 0Time = 0
  35. 35. Time = 1CCCCCAATGAC becomes CCCCCtime = 1
  36. 36. Time = 17GATTACGATATTA becomes GATTAtime = 17
  37. 37. Can we use an index?No. We have to touch every record no matter what.The task is fundamentally O(N)Can we do any better?
  38. 38. Time = 0
  39. 39. Time = 1
  40. 40. Time = 2
  41. 41. Time = 3
  42. 42. Time = 7time = 7How much time did this take?7 cycles40 records, 6 workersO(N/k)time = 7How much time did this ta7 cycles40 records, 6 workersO(N/k)
  43. 43. Schematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed readsSchematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed readsSchematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed readsSchematic of a Parallel “Read Trimming” TaskYou are given short“reads”: genomicsequences about35-75 characters eachDistribute the readsamong k computersf f f f f ff is a function totrim a read; apply itto every itemNow we have a bigdistributed set oftrimmed reads
  44. 44. You are given TIFFimagesDistribute the imagesamong k computersf f f f f ff is a function toconvert TIFF toPNG; apply it toevery itemNow we have a bigdistributed set ofconverted imagesNew Task: Convert 405k TIFF images to PNGhttp://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/Convert 405k TIFF images to PNG
  45. 45. You have sets ofparameters forthousands of smallsimulationsDivide the parametersets among kcomputersf f f f f ff is runs thesimulation andproduces someoutput; apply it toevery itemNow we have a bigdistributed set ofsimulation resultsNew Task: Run thousands of simulationshttp://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloudRun thousands of simulations
  46. 46. You have millionsof documentsDistribute thedocuments among kcomputersf f f f f ff finds the mostcommon word in asingle documentNow we have a bigdistributed list of(doc_id, word) pairsFind the most common word in each documentFind the most common word in eachdocument
  47. 47. Consider a slightly more general programto compute the word frequency of everyword in a singe documentConsider a slightly more general program to compute theword frequency of every word in a single documentAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.(people, 2)(government, 6)(assume, 1)(history, 2)…
  48. 48. You have millionsof documentsDistribute thedocuments among kcomputersf f f f f fFor each documentf returns a set of(word, freq) pairsCompute the word frequency of 5M documentsNow we have a bigdistributed list ofsets of word freqs.
  49. 49. There’s a pattern...There’s a pattern here….•  A function that maps a read to a trimmed read•  A function that maps a TIFF image to a PNG image•  A function that maps a set of parameters to a simulation result•  A function that maps a document to its most common word•  A function that maps a document to a histogram of wordfrequencies
  50. 50. What if we want to compute the wordfrequency across all documents?US Constitution Declaration ofIndependenceArticles ofConfederation(people, 78)(government, 123)(assume, 23)(history, 38)…
  51. 51. You have millionsof documentsDistribute thedocuments amongk computersmap map map map map mapFor each document,return a set of (word,freq) pairsCompute the word frequency across 5M documentsNow what?•  But we don’t want a bunch of little histograms –we want one big histogram.•  How can we make sure that a single computerhas access to every occurrence of a given wordregardless of which document it appeared in?Compute the word frequency across 5M docs
  52. 52. Compute the word frequency across 5M docsDistribute thedocuments among kcomputersmap map map map map mapFor each document,return a set of(word, freq) pairsNow we have a bigdistributed list ofsets of word freqs.reduce reduce reduce reduceNow just count theoccurrences of each word44 3We have ourdistributed histogramCompute the word frequency across 5M documents
  53. 53. Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 11Some distributed algorithm…Map(Shuffle)Reduce
  54. 54. Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 12MapReduce Programming Model•  Input & Output: each a set of key/value pairs•  Programmer specifies two functions:–  Processes input key/value pair–  Produces set of intermediate pairs–  Combines all intermediate values for a particular key–  Produces a set of merged output values (usually just one)map (in_key, in_value) -> list(out_key, intermediate_value)reduce (out_key, list(intermediate_value)) -> list(out_value)Inspired by primitives from functional programminglanguages such as Lisp, Scheme, and Haskellslide source: Google, Inc.
  55. 55. Compute the word frequency across 5M docs5/15/13 Bill Howe, eScience Institute 13Example: What does this do?map(String input_key, String input_value):// input_key: document name// input_value: document contentsfor each word w in input_value:EmitIntermediate(w, 1);reduce(String output_key, Iterator intermediate_values):// output_key: word// output_values: ????int result = 0;for each v in intermediate_values:result += v;Emit(result);slide source: Google, Inc.
  56. 56. Example3/12/09 Bill Howe, eScience Institute 14Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.Example: Document Processing
  57. 57. Word length histogram example3/12/09 Bill Howe, eScience Institute 14Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.Example: Document Processing3/12/09 Bill Howe, eScience Institute 15Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in General Congress Assembled.When in the course of human events it becomes necessary for a people to advance from that subordination inwhich they have hitherto remained, and to assume among powers of the earth the equal and independent stationto which the laws of nature and of natures god entitle them, a decent respect to the opinions of mankindrequires that they should declare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, andthe pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive ofthese ends, it is the right of the people to alter or to abolish it, and to institute new government, laying itsfoundation on such principles and organizing its power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history of his present majesty is a history of unremitting injuries and usurpations, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objectthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehood.Example: Word length histogramHow many big , medium , and small words are used?
  58. 58. Word length histogram exampleAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Big = Yellow = 10+ lettersMedium = Red = 5..9 lettersSmall = Blue = 2..4 lettersTiny = Pink = 1 letterExample: Word length histogram
  59. 59. Word length histogram exampleAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Split the document intochunks and process eachchunk on a differentcomputerChunk 1Chunk 2Example: Word length histogram
  60. 60. Word length histogram example(yellow, 20)(red, 71)(blue, 93)(pink, 6 )Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Map Task 1(204 words)Map Task 2(190 words)(key, value)(yellow, 17)(red, 77)(blue, 107)(pink, 3)Example: Word length histogram
  61. 61. Word length histogram example3/12/09 Bill Howe, eScience Institute 19(yellow, 17)(red, 77)(blue, 107)(pink, 3)(yellow, 20)(red, 71)(blue, 93)(pink, 6 )Reduce tasks(yellow, 17)(yellow, 20)(red, 77)(red, 71)(blue, 93)(blue, 107)(pink, 6)(pink, 3)Example: Word length histogramA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of naturesgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying its foundation on such principles and organizing its power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed willdictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.Map task 1Map task 2Shuffle step(yellow, 37)(red, 148)(blue, 200)(pink, 9)
  62. 62. MR Phases!"#$%&(")$*+&,&MR Phases•  Each Map and Reduce task has multiple phases:
  63. 63. MR PhasesMAP REDUCE(w1,1)(w2,1)(w1,1)…(w1,1)(w2,1)…(did1,v1)(did2,v2)(did3,v3). . . .(w1, (1,1,1,…,1))(w2, (1,1,…))(w3,(1…))…………(w1, 25)(w2, 77)(w3, 12)…………Shuffle!"#$%&()%"**$"(+%,&-.$/%%012%3,%45+,%+$3)%6&78%9:;%
  64. 64. MR PhasesHadoop inOne Slidesrc: Huy Vo,NYU Poly
  65. 65. Taxonomy ofParallel Architectures5/15/13 Bill Howe, eScience Institute 22Taxonomy of Parallel ArchitecturesEasiest to program,but $$$$Scales to 1000s of computers
  66. 66. 5/15/13 Bill Howe, UW 32
  67. 67. 5/15/13 Bill Howe, UW 25
  68. 68. 5/15/13 Bill Howe, eScience Institute 26Large-Scale Data Processing•  Many tasks process big data, produce big data•  Want to use hundreds or thousands of CPUs–  ... but this needs to be easy–  Parallel databases exist, but they are expensive, difficult to setup, and do not necessarily scale to hundreds of nodes.•  MapReduce is a lightweight framework, providing:–  Automatic parallelization and distribution–  Fault-tolerance–  I/O scheduling–  Status and monitoring
  69. 69. 5/15/13 Bill Howe, eScience Institute 35MapReduce Contemporaries•  Dryad (Microsoft)–  Relational Algebra•  Pig (Yahoo)–  Near Relational Algebra over MapReduce•  HIVE (Facebook)–  SQL over MapReduce•  Cascading–  Relational Algebra•  Clustera–  U of Wisconsin•  Hbase–  Indexing on HDFSMapReduce Contemporaries
  70. 70. Talk is cheap,Show me the code
  71. 71. ExamplesMore Examples: Build an Inverted Index5/15/13 Bill Howe, UW 14Input:tweet1, (“I love pancakes for breakfast”)tweet2, (“I dislike pancakes”)tweet3, (“What should I eat for breakfast?”)tweet4, (“I love to eat”)Desired output:“pancakes”, (tweet1, tweet2)“breakfast”, (tweet1, tweet3)“eat”, (tweet3, tweet4)“love”, (tweet1, tweet4)…
  72. 72. ExamplesMore Examples: Relational Join5/15/13 Bill Howe, UW 15Name SSNSue 999999999Tony 777777777EmpSSN DepName999999999 Accounts777777777 Sales777777777 MarketingEmplyee ⨝ Assigned DepartmentsEmployeeAssigned DepartmentsName SSN EmpSSN DepNameSue 999999999 999999999 AccountsTony 777777777 777777777 SalesTony 777777777 777777777 Marketing
  73. 73. ExamplesRelational Join in MapReduce: Before Map Phase5/15/13 Bill Howe, UW 16Name SSNSue 999999999Tony 777777777EmpSSN DepName999999999 Accounts777777777 Sales777777777 MarketingEmployeeAssigned DepartmentsKey idea: Lump all the tuplestogether into one datasetEmployee, Sue, 999999999Employee, Tony, 777777777Department, 999999999, AccountsDepartment, 777777777, SalesDepartment, 777777777, MarketingWhat is this for?
  74. 74. ExamplesRelational Join in MapReduce: Map Phase5/15/13 Bill Howe, UW 17Employee, Sue, 999999999Employee, Tony, 777777777Department, 999999999, AccountsDepartment, 777777777, SalesDepartment, 777777777, Marketingkey=999999999, value=(Employee, Sue, 999999999)key=777777777, value=(Employee, Tony, 777777777)key=999999999, value=(Department, 999999999, Accounts)key=777777777, value=(Department, 777777777, Sales)key=777777777, value=(Department, 777777777, Marketing)why do we use this as the key?
  75. 75. ExamplesRelational Join in MapReduce: Reduce Phase5/15/13 Bill Howe, UW 18key=999999999, values=[(Employee, Sue, 999999999),(Department, 999999999, Accounts)]key=777777777, values=[(Employee, Tony, 777777777),(Department, 777777777, Sales),(Department, 777777777, Marketing)]Sue, 999999999, 999999999, AccountsTony, 777777777, 777777777, SalesTony, 777777777, 777777777, Marketing
  76. 76. ExamplesRelational Join in MapReduce, again5/15/13 Bill Howe, Data Science, Autumn 2012 19Order(orderid, account, date)1, aaa, d12, aaa, d23, bbb, d3LineItem(orderid, itemid, qty)1, 10, 11, 20, 32, 10, 52, 50, 1003, 20, 1tagged withrelation nameOrder1, aaa, d1 ! 1 : “Order”, (1,aaa,d1)2, aaa, d2 ! 2 : “Order”, (2,aaa,d2)3, bbb, d3 ! 3 : “Order”, (3,bbb,d3)Line1, 10, 1 ! 1 : “Line”, (1, 10, 1)1, 20, 3 ! 1 : “Line”, (1, 20, 3)2, 10, 5 ! 2 : “Line”, (2, 10, 5)2, 50, 100 ! 2 : “Line”, (2, 50, 100)3, 20, 1 ! 3 : “Line”, (3, 20, 1)MapReducer for key 1“Order”, (1,aaa,d1)“Line”, (1, 10, 1)“Line”, (1, 20, 3)(1, aaa, d1, 1, 10, 1)(1, aaa, d1, 1, 20, 3)
  77. 77. ExamplesSimple Social Network Analysis:Count FriendsJim, SueSue, JimLin, JoeJoe, LinJim, KaiKai, JimJim, LinLin, JimInput Desired OutputJim, 3Lin, 2Sue, 1Kai, 1Joe, 1Jim, 1Sue, 1Lin, 1Joe, 1Jim, 1Kai, 1Jim, 1Lin, 1MAP REDUCEJim, (1, 1, 1)Lin, (1, 1)Sue, (1)Kai, (1)Joe, (1)SHUFFLE
  78. 78. ExamplesMatrix Multiplication!" #" $" %&"" &" %#" !"!" %&"$" #"%#" %&"(" $"X =!" %)"&#" $"
  79. 79. ExamplesMatrix Multiply in MapReduceC = A X BA has dimensions L,MB has dimensions M,N•  In the map phase:–  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N–  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L•  In the reduce phase, emit–  key = (i,k)–  value = Sumj (A[i,j] * B[j,k])5/15/13 Bill Howe, Data Science, Autumn 2012 21
  80. 80. ExamplesABXA B=•  One reducer per output cell
  81. 81. Meeting Hadoop
  82. 82. Cluster Computing•  Large number of commodity servers,connected by high speed, commoditynetwork•  Rack: holds a small number of servers•  Data center: holds many racks!"#$%&(#))%&*"&+!"#$%&()*+(",-(+./0"123145"67$%8("1",-(+./09":;1;<"/0=>4""/?"#@0@0A"/?"#$99@B("C$8$9(89;"D>"E$F$$G$0"$0)"H==G$0""),,(-.,(/0123,(456,7852(9,:3;-,(
  83. 83. Cluster Computing•  Massive parallelism:– 100s, or 1000s, or 10000s servers– Many hours•  Failure:– If medium-time-between-failure is 1 year– Then 10000 servers have one failure / hour24slide src: Dan Suciu and Magda Balazinska
  84. 84. Distributed File System (DFS)•  For very large files: TBs, PBs•  Each file is partitioned into chunks,typically 64MB•  Each chunk is replicated several times(!3), on different racks, for faulttolerance•  Implementations:– Google’s DFS: GFS, proprietary– Hadoop’s DFS: HDFS, open source25slide src: Dan Suciu and Magda Balazinska
  85. 85. Hadoop
  86. 86. !"#$%&%(#)**+%,%!"#$%&"#()*(+(%"(&"#(,-.%/#-/0,#12,"(,3#4-("#*%4/,%&0/#*-#$."5,2-#44%)32)()#/62,721-2882*%/9.(,*6(,#:
  87. 87. !"#$%&()"*&+&*",)-&$(!"#$%%!&((#)&#&*!+)(,-%..//$)0(,123(),45(%#4)0&)6(%&4(,)07&4(&$)484(&94:1(%*#,7&0))6432",)-&$(4;#%))%#4)0&)6(%&2177&4(104(#**#<)0=#>))"?300107@///4)64&,+&,4)0=#>))"
  88. 88. !"#$%&()*+),)!"##$%&"()*+,-.&/01&*23$4,5%&678193:&1:08&5&93;/"##$%&<= )&,819>;$3;&-&./0&)01)2.(00$)$&034/)
  89. 89. !"#$%&"#"$$()&*$+"$,&-../$0"#-$1.2$3+456.%+&7$-&*&$&8&79"+"$:#;*$<+8+84=$/&>#28"$"#&2%)$?&%)+8#$7.4$&8&79"+"$@#.A"/&%+B&7$&8&79"+"$@#8.<#$C8&79"+"$D204$D+"%.E#29$F2&0-$&8-$%.</7+&8%#$<&8&4#<#8*$G+-#.$&8-$+<&4#$&8&79"+"$:2#8-$C8&79"+"$
  90. 90. !"#$%&&$()*##+$,$-#./$-0&1$2$34)5#.637$$2$8)9:##;$$2$<##/-$$2$=>?$$2$@0&.A$2$B)&1CD4$$2$EF$G#H;$I04&$$2$G)"##J$$2$IF0KH$2$B0.;*$0.$$$$$
  91. 91. !"#$$%&($)*)+,&-&!"#$%& #()%*+,%-!./0&12+34#&!56%& 758&96:;&<;;=%%(%:&0>;;(&/?+@%& A;B5%& C25::&
  92. 92. !"#$$%&#()*+,-$.&/&
  93. 93. !"#$%&(&!""#$%&()*+&,#-&.&/&010#
  94. 94. !"#$%&()&&!""#!"#$$%&#()*+,)-#&./-&(0()-1&
  95. 95. !"#$$%&$(%$)*)+,&-&.$/&+0"12*0&3",2&30"12*0&4"(*&4$#*&5"%&6*#71*&!89:&&&&
  96. 96. MapReducehttp://labs.google.com/papers/mapreduce.html
  97. 97. Exemplo de MapReducefrom mrjob.job import MRJobimport reWORD_RE = re.compile(r"[w]+")class MRWordFreqCount(MRJob):def mapper(self, _, line):for word in WORD_RE.findall(line):yield (word.lower(), 1)def reducer(self, word, counts):yield (word, sum(counts))if __name__ == __main__:MRWordFreqCount().run()
  98. 98. Projeto MrJobCriado pela Equipe de Engenharia doYelpTotalmente Open-SourceTodo em PythonUtiliza Map-Reduce para ProcessamentoPermite rodar tanto no Amazon EMR como no Hadoop
  99. 99. Objetivos do MrJobsSe você quer aprender MapReduce, ele é para vocêSe você tem um problema cavalar e precisa de muitoprocessamento e não está afim de mexer em HadoopSe você já tem um cluster Hadoop e quer rodar scriptsPythonSe você quer migrar seu código Python do Hadooppara o EMRSe você não quer escrever Python (Impossível!), não épara você!
  100. 100. Passos importantessudo easy_install mrjob
  101. 101. Vamos a uma demo...Texto da posse de Obama em 2009.
  102. 102. Desempenho MapReduce
  103. 103. Mais informaçõeshttp://packages.python.org/mrjob/https://github.com/Yelp/mrjob
  104. 104. Distributed Computing with mrJobhttps://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  105. 105. Distributed Computing with mrJobhttps://github.com/Yelp/mrjobElsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  106. 106. Atepassar Recommendationshttp://atepassar.comProblema: Como recomendar novos amigos ?
  107. 107. Atepassar Recommendationshttp://atepassar.com
  108. 108. Atepassar Recommendationshttp://atepassar.commarcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,patricia,maria,marcelcarol;maria,jose,patriciafabiola;mariapaula;fabio,amandapatricia;amanda,caroljose;marcel,caroljonas;marcel,fabiofabio;jonas,paulacarla"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]"fabio" [["amanda", 1], ["marcel", 1]]"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
  109. 109. Atepassar Recommendationshttp://atepassar.commarcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,patricia,maria,marcelcarol;maria,jose,patriciafabiola;mariapaula;fabio,amandapatricia;amanda,caroljose;marcel,caroljonas;marcel,fabiofabio;jonas,paulacarla"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]"fabio" [["amanda", 1], ["marcel", 1]]"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.datmarcel;jonas,maria,jose,amandacarol;maria,jose,patriciafabio;jonas,paula...
  110. 110. Atepassar Recommendationshttp://atepassar.comfsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b,https://gist.github.com/3970945#file_atepassar_recommender.py
  111. 111. Atepassar Recommendationshttp://atepassar.comCelery - para agendamento dos jobs coletores eexecutores.mrJob - para mapreduce e acesso ao HadoopMongoDb - para armazenamento das recomendaçõesBoto - acesso aos files do S3.
  112. 112. A melhor parte!
  113. 113. Projetos interessantesEstrutura de dados para manipulação rápida- slicing- indexing- subsetingHandling missing dataAgregações, Séries TemporaisPandas: a data analysis library for Python,poised to give R a run for its money… http://pandas.pydata.org/
  114. 114. Projetos interessantesOutro framework para computação distribuída comPython com MapReduce.Criado pelo Instituto Nokia.Backend dele é escrito em Erlang (funcional,concorrente e bem escalável!)Não utiliza o FileSystem mais usado HDFS e sim umnovo padrão por eles (DDFS).http://discoproject.org/
  115. 115. Projetos interessanteshttp://api.mongodb.org/python/2.0/examples/map_reduce.htmlPython & MongoDbMongoDb - Banco de Dados Não relacional (NoSQL)Possui suporte nativo built-in para fazer MapReduce.Escrever o código em JS e não é muito legível e fica preso aoMongo ...>>> reduce = Code("function (key, values) {"... " var total = 0;"... " for (var i = 0; i < values.length; i++) {"... " total += values[i];"... " }"... " return total;"... "}")
  116. 116. Projetos interessanteshttps://github.com/klbostee/dumbo/wiki/Short-tutorialDumboUma das primeiras bibliotecas em cima doMapReduce e Python.Complicado para começar e está desatualizada :(
  117. 117. Projetos interessantesUm wrapper em Python em cima do Hadooppara computação distribuída.http://pydoop.sourceforge.net/docs/index.htmlLegal, mas dá um trabalho para configurar.
  118. 118. Projetos interessantesMapreduce com Python na Google AppEnginehttps://developers.google.com/appengine/docs/python/dataprocessing/Ainda experimental e fica “preso” à plataformaAppEngine.
  119. 119. Projetos interessantesAlgoritmos de aprendizagem de máquinaSupervisionados & Não supervisionadosPré-processamento, extração de dadosAvaliação de classificadores, Pipeline,seleção de atributos.http://scikit-learn.org/stable/
  120. 120. Projetos interessantesProcessamento de linguagem naturalVárias ferramentas para tokenização,pos tagging, named entity recognition,classificadores, etc.Vários corpus disponíveis!http://nltk.org/
  121. 121. Projetos interessantesProcessamento de linguagem naturalVárias ferramentas para tokenização,pos tagging, named entity recognition,classificadores, etc.Vários corpus disponíveis!http://nltk.org/Fiquem de olho...Pipeline for distributed Natural Language Processing, made in Pythonhttps://github.com/NAMD/pypln
  122. 122. Projetos interessantesOur Project’s Home Pagehttp://muricoca.github.com/crab
  123. 123. Future ReleasesPlanned Release 0.13New home for python-recsys:https://github.com/python-recsys/crabPlanned Release 0.14Support to Item-Based Recommenders using MapReduce with MrJobNew commiters: vinnigracindo, ocelma, fcurella
  124. 124. Join us!1. Read our Wiki Pagehttps://github.com/muricoca/crab/wiki/Developer-Resources2. Check out our current sprints and open issueshttps://github.com/muricoca/crab/issues3. Forks, Pull Requests mandatory4. Join us at irc.freenode.net #muricoca or at ourdiscussion listhttp://groups.google.com/group/scikit-crab
  125. 125. Vários outros ...MatplotlibStatsModelsScipyNumpyMatplotlibNetworkXPyBrainmilkOrange
  126. 126. Livros recomendados
  127. 127. Livros recomendadoshttp://shop.oreilly.com/product/0636920022640.do?cmp=il-radar-ebooks-big-data-now-radarFor free...
  128. 128. Livros recomendadosFor free...http://infolab.stanford.edu/~ullman/mmds/book.pdf
  129. 129. Artigos recomendadosFor free...http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.htmlhttp://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html
  130. 130. Artigos recomendadosFor free...http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.htmlhttp://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.htmlhttps://github.com/marcelcaraciolo/recsys-mapreduce-mrjob
  131. 131. BigData with PythonA gentle and simple introductionMarcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the Crab recsys project,works with Python for 6 years, interested at mobile,education, machine learning and dataaaaa!Recife, Brazil - http://aimotion.blogspot.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×