• Save
The Big Data Developer (@pavlobaron)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The Big Data Developer (@pavlobaron)

  • 119,432 views
Uploaded on

Slides of the talk I gave at the BigDataCon 2012 in Mayence, Germany and in Hamburg at the SEACON in 2012

Slides of the talk I gave at the BigDataCon 2012 in Mayence, Germany and in Hamburg at the SEACON in 2012

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
119,432
On Slideshare
118,823
From Embeds
609
Number of Embeds
42

Actions

Shares
Downloads
0
Comments
11
Likes
62

Embeds 609

http://www.scoop.it 157
http://311asanjorge.wordpress.com 91
http://www.codecentric.de 67
https://twitter.com 58
https://www.codecentric.de 36
http://dieteticabarandiaran.wordpress.com 33
http://eventifier.co 21
http://www.unionbusextremadura.com 19
http://localhost 14
http://www.ccjk.com 12
http://p3.teensintech.com 12
http://airinplay.tumblr.com 11
http://artdeliveryroom.tistory.com 10
http://coderwall.com 9
http://beta.studentgenius.com 7
http://venture-lab.org 5
http://365.knihovna.cz 5
http://tweetedtimes.com 3
http://us-w1.rockmelt.com 3
http://www.weebly.com 3
http://moodle.k19.com.br 3
http://local.vivastream.com 2
http://10.2.2.119 2
http://www.vcasmo.com 2
http://pult.io 2
https://si0.twimg.com 2
http://dev.mahara.schlumbumbel.wgtn.cat-it.co.nz 2
http://wclouder.com 2
http://uprhpracticeteaching.blogspot.com 2
http://qa2.aegon.fortunecookie.co.uk 2
https://www.google.de 1
http://www.twylah.com 1
http://accessibility_checker.siteimprove.com 1
http://www.instacurate.com 1
http://127.0.0.1 1
http://unionbusextremadura.com 1
http://tecnologiacangas.blogspot.com.es 1
http://www.edmodo.com 1
http://chapters.cupahr.org 1
http://embed.ly 1
http://www.blogger.com 1
http://140.113.87.191 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Big Big Big Bigdata developer
  • 2. Pavlo Baronpavlo.baron@codecentric.de @pavlobaron
  • 3. Disclaimer:I will not flame-war,rant, bash or doanything similar aroundconcrete tools.Im tired of doingthat
  • 4. Hey, dude.Im a big datadeveloper. We haveenormous data.I continuouslyread articles onhighscalability.com
  • 5. Big datais easy.Its like normal data,but, well,even bigger.
  • 6. Aha. So thisis not big, right?
  • 7. Thats big, huh?
  • 8. Peta bytes of data every houron different continents, withcomplex relations and withthe need to analyze themin almost real time for anomaliesand to visualize them foryour management
  • 9. You can easily call this big data.Everything below this you cancall Mickey Mouse data
  • 10. Good news is: you canintentionally grow – collectmore data. And you should!
  • 11. np, dude!Data is data.Same principles
  • 12. Really?Lets consider storage, ok?
  • 13. np, dude!Storage is just adatabase
  • 14. Really?Storage capacity of onesingle box, no matter how bigit is, is limited
  • 15. np, dude!Youre talkingabout SQL, right?SQL isboring grandpastuff anddoesnt scale
  • 16. Oh.When you expect big data, youneed to scale very far and thusbuild on distribution andcombine theoreticallyunlimited amount of machines toone single distributed storage
  • 17. There is no way around.Except you invent the BlackHoleDB
  • 18. np, dude!NoSQL scales and iscool.They achieve thisthrough sharding,ya know?Sharding hidesthis distributionstuff
  • 19. Hem.Building upon distribution ismuch harder than anything youveseen or done before
  • 20. Except you fed a crowd with7 breads and walked upon thewater
  • 21. np, dude!From three ofCAP, Ill justpick two.As easy as this
  • 22. The only thing that is absolutelycertain about distributed systems isthat parts of them will fail and youwill have no idea where and whatthe hell is going on
  • 23. So your P must be a given in adistributed system. And you want toplay with C vs. A, not just take blackor white
  • 24. np, dude!Sharding worksseamlessly. Idont need to takecare of anything
  • 25. Seriously?For example, one of the hardestchallenges with big datais to distribute/shard parts overseveral machines still having fasttraversals and reads, thuskeeping related data together.Valid for graph and any other datastore, also NoSQL, kind of
  • 26. Another hard challenge with shardingis to avoid naive hashing.Naive hashing would make you dependon the number of nodes and would notallow you to easily add or removenodes to/from the system
  • 27. And still, the trade-off between datalocality, consistency, availability,read/write/search speed, latency etc.is hard
  • 28. np, dude!NoSQL wouldwrite asynchronouslyand do map/reduceto find data
  • 29. Of course.You will love eventualconsistency, especially when you needa transaction around a complexmoney transfer
  • 30. np, dude!I dont have moneyto transfer. But Ineed to store lotsof data.I can throw anyamount of it atNoSQL, and itwill just work
  • 31. Really?So youd just throwsomething into your databaseand hope it works?
  • 32. What if you throw andmiss the target?
  • 33. Data locality, redundancy,consistent hashing and eventualconsistency combined with usecase driven storage designare key principles in succeedingwith a huge distributed data storage.Thats big data development
  • 34. How about data provisioning?
  • 35. np, dude!Its databasebeing the bottleneck, not myweb servers
  • 36. When you have thousands or millionsparallel requests per second, beggingfor data, the first mile will (also) quicklybecome the bottle neck.Requests will get queued and discardedas soon as your server doesnt bringdata fast enough through the pipe
  • 37. np, dude!Ill get me somebad-ass sexyhardware
  • 38. I bet you will.But under high load, your hardwarewill more or less quickly start to crack
  • 39. Youll burn your hard disks, boards andcards. And wires. And youll heat upto a maximum
  • 40. Its not about sexy hardware, butabout being able to quickly replace it.Ideally while the system keeps running
  • 41. But anyway.To keep the first mile scalable andfast, would lead to some expensivenetwork infrastructure.You need to get the maximum outof your servers in order to reducetheir number
  • 42. np, dude!I will use an eventdriven C10Kproblem solvingawesome webserver. Or Ill writeone on my own
  • 43. Maybe.But when your users are coming fromall over the world, it wont help youmuch since the network latency fromthem to your server will kill them
  • 44. You would have to go for a CDN oneday, statically pre-computing content.You would use their infrastructureand reduce the number of hits onyour own servers to a minimum
  • 45. np, dude!Ill push my wholeplatform out to thecloud. Its evenmore flexible andscales like hell
  • 46. Well.You cannot reliably predict on whichphysical machine and actually howclose to the data you program will run.Whenever virtual machines orstorage fragments get moved, yourworld stops
  • 47. You can easily force data locality andshorter stop-the-world-phasesby paying higher bills
  • 48. Data locality, geographic spatiality,dedicated virtualization and contentpre-computability combined with usecase driven cloudificationare key principles in succeedingwith provisioning of huge data amounts.Thats big data development
  • 49. Lets talk about processing, ok?
  • 50. np, dude!All easily doneby a map/reducetool
  • 51. Almost agreed.map/reduce has two basic phases:even “map” and “reduce”
  • 52. The slowest of those twois definitely “split”.Moving data from one huge pile toanother before map/reduce isdamn expensive
  • 53. np, dude!Ill write my datastraight to thestorage of mymap/reduce tool.It will then tear
  • 54. It can.But what if you need to search duringthe map phase – full-text, meta?
  • 55. np, dude!Ill use a coolindexing searchengine or library.It can find my datain a snap
  • 56. Would it?A very hard challenge is to partitionthe index and to couple its relatedparts to the corresponding data.With data locality of course, havingindex pieces on the related machines
  • 57. Data and index locality and direct fillingof data pots as data flies by combinedwith use case driven technologyusage are key principles in succeedingwith processing of huge data amounts.Thats big data development
  • 58. So, how about analytics, dude?
  • 59. np, dude!Its classic use casefor map/reduce.I can do thisafterwards and onthe fly
  • 60. Are you sure?So, you expect one tool to do both,real-time and post fact analytics?
  • 61. What did you smoke last night?
  • 62. You dont want to believein map/reduce in (near) real-time,dont you?
  • 63. np, dude!Ill get me somerocket fasthardware
  • 64. Im sure you will. But:You cannot predict and fix themap/reduce time.You cannot ensurethe completeness of data.You cannotguarantee causality knowledge
  • 65. Distribution has got your soul
  • 66. If you need to predict better,to be able to know about data/eventcausality, to be fast you need to CEPdata streams as data flies by.There is no (simple, fast) way around
  • 67. But the most important thing is:None of the BI tools you know willadequately support your NoSQLdata store, so youre all alonein the world of proprietaryimmature tool combinations.The world of pain.
  • 68. np, dude!My map/reducetool can even hidemath from me, soI can concentrateon doing stuff
  • 69. There is no point in fearingmath/statistics. You just need it
  • 70. Separation of immediate andpost fact analytics and CEP ofdata streams as data flies by combinedwith use case driven technologyusage and statistical knowledgeare key principles in succeedingwith analytics of huge data amounts.Thats big data development
  • 71. Oh, we forgot visualization
  • 72. np, dude!I just have noidea about it
  • 73. Me neither.I just know that you cant visualizehuge data amounts using classicspreadsheets. There are better ways,tools, ideas to do this – find themThats big data development
  • 74. hem, dude...Youre a smart-ass.Is it that you wantto say?
  • 75. Almost.In one of my humble moments I wouldsuggest you to do the following:
  • 76. Stop thinking you gain adequately deepknowledge through reading half-bakedblog posts. Get yourself some of those:
  • 77. Statistics, Visualization Distribution Network Different languages Tools, chainsKnow and use full stack Data stores Different platforms OS Storage Machine Algorithms Math
  • 78. Know your point of pain.You must be Twitter, Facebook orGoogle to have them all same time.If youre none of them, you can haveone or two. Or even none.Go for them with the right chain tool
  • 79. First and the most important tool in thechain is your brain
  • 80. Thank you
  • 81. Most images originate from istockphoto.com except few ones takenfrom Wikipedia or Flickr (CC) and product pages or generated through public online generators