19. Hem.Building upon distribution ismuch harder than anything youveseen or done before
20. Except you fed a crowd with7 breads and walked upon thewater
21. np, dude!From three ofCAP, Ill justpick two.As easy as this
22. The only thing that is absolutelycertain about distributed systems isthat parts of them will fail and youwill have no idea where and whatthe hell is going on
23. So your P must be a given in adistributed system. And you want toplay with C vs. A, not just take blackor white
24. np, dude!Sharding worksseamlessly. Idont need to takecare of anything
25. Seriously?For example, one of the hardestchallenges with big datais to distribute/shard parts overseveral machines still having fasttraversals and reads, thuskeeping related data together.Valid for graph and any other datastore, also NoSQL, kind of
26. Another hard challenge with shardingis to avoid naive hashing.Naive hashing would make you dependon the number of nodes and would notallow you to easily add or removenodes to/from the system
27. And still, the trade-off between datalocality, consistency, availability,read/write/search speed, latency etc.is hard
28. np, dude!NoSQL wouldwrite asynchronouslyand do map/reduceto find data
29. Of course.You will love eventualconsistency, especially when you needa transaction around a complexmoney transfer
30. np, dude!I dont have moneyto transfer. But Ineed to store lotsof data.I can throw anyamount of it atNoSQL, and itwill just work
31. Really?So youd just throwsomething into your databaseand hope it works?
32. What if you throw andmiss the target?
33. Data locality, redundancy,consistent hashing and eventualconsistency combined with usecase driven storage designare key principles in succeedingwith a huge distributed data storage.Thats big data development
34. How about data provisioning?
35. np, dude!Its databasebeing the bottleneck, not myweb servers
36. When you have thousands or millionsparallel requests per second, beggingfor data, the first mile will (also) quicklybecome the bottle neck.Requests will get queued and discardedas soon as your server doesnt bringdata fast enough through the pipe
37. np, dude!Ill get me somebad-ass sexyhardware
38. I bet you will.But under high load, your hardwarewill more or less quickly start to crack
39. Youll burn your hard disks, boards andcards. And wires. And youll heat upto a maximum
40. Its not about sexy hardware, butabout being able to quickly replace it.Ideally while the system keeps running
41. But anyway.To keep the first mile scalable andfast, would lead to some expensivenetwork infrastructure.You need to get the maximum outof your servers in order to reducetheir number
42. np, dude!I will use an eventdriven C10Kproblem solvingawesome webserver. Or Ill writeone on my own
43. Maybe.But when your users are coming fromall over the world, it wont help youmuch since the network latency fromthem to your server will kill them
44. You would have to go for a CDN oneday, statically pre-computing content.You would use their infrastructureand reduce the number of hits onyour own servers to a minimum
45. np, dude!Ill push my wholeplatform out to thecloud. Its evenmore flexible andscales like hell
46. Well.You cannot reliably predict on whichphysical machine and actually howclose to the data you program will run.Whenever virtual machines orstorage fragments get moved, yourworld stops
47. You can easily force data locality andshorter stop-the-world-phasesby paying higher bills
48. Data locality, geographic spatiality,dedicated virtualization and contentpre-computability combined with usecase driven cloudificationare key principles in succeedingwith provisioning of huge data amounts.Thats big data development
49. Lets talk about processing, ok?
50. np, dude!All easily doneby a map/reducetool
51. Almost agreed.map/reduce has two basic phases:even “map” and “reduce”
52. The slowest of those twois definitely “split”.Moving data from one huge pile toanother before map/reduce isdamn expensive
53. np, dude!Ill write my datastraight to thestorage of mymap/reduce tool.It will then tear
54. It can.But what if you need to search duringthe map phase – full-text, meta?
55. np, dude!Ill use a coolindexing searchengine or library.It can find my datain a snap
56. Would it?A very hard challenge is to partitionthe index and to couple its relatedparts to the corresponding data.With data locality of course, havingindex pieces on the related machines
57. Data and index locality and direct fillingof data pots as data flies by combinedwith use case driven technologyusage are key principles in succeedingwith processing of huge data amounts.Thats big data development
58. So, how about analytics, dude?
59. np, dude!Its classic use casefor map/reduce.I can do thisafterwards and onthe fly
60. Are you sure?So, you expect one tool to do both,real-time and post fact analytics?
61. What did you smoke last night?
62. You dont want to believein map/reduce in (near) real-time,dont you?
63. np, dude!Ill get me somerocket fasthardware
64. Im sure you will. But:You cannot predict and fix themap/reduce time.You cannot ensurethe completeness of data.You cannotguarantee causality knowledge
65. Distribution has got your soul
66. If you need to predict better,to be able to know about data/eventcausality, to be fast you need to CEPdata streams as data flies by.There is no (simple, fast) way around
67. But the most important thing is:None of the BI tools you know willadequately support your NoSQLdata store, so youre all alonein the world of proprietaryimmature tool combinations.The world of pain.
68. np, dude!My map/reducetool can even hidemath from me, soI can concentrateon doing stuff
69. There is no point in fearingmath/statistics. You just need it
70. Separation of immediate andpost fact analytics and CEP ofdata streams as data flies by combinedwith use case driven technologyusage and statistical knowledgeare key principles in succeedingwith analytics of huge data amounts.Thats big data development
71. Oh, we forgot visualization
72. np, dude!I just have noidea about it
73. Me neither.I just know that you cant visualizehuge data amounts using classicspreadsheets. There are better ways,tools, ideas to do this – find themThats big data development
74. hem, dude...Youre a smart-ass.Is it that you wantto say?
75. Almost.In one of my humble moments I wouldsuggest you to do the following:
76. Stop thinking you gain adequately deepknowledge through reading half-bakedblog posts. Get yourself some of those:
77. Statistics, Visualization Distribution Network Different languages Tools, chainsKnow and use full stack Data stores Different platforms OS Storage Machine Algorithms Math
78. Know your point of pain.You must be Twitter, Facebook orGoogle to have them all same time.If youre none of them, you can haveone or two. Or even none.Go for them with the right chain tool
79. First and the most important tool in thechain is your brain
80. Thank you
81. Most images originate from istockphoto.com except few ones takenfrom Wikipedia or Flickr (CC) and product pages or generated through public online generators