The Big Data Developer (@pavlobaron)Presentation Transcript
Big Big Big Big Bigdata developer
Pavlo Baronpavlo.baron@codecentric.de @pavlobaron
Disclaimer:I will not flame-war,rant, bash or doanything similar aroundconcrete tools.Im tired of doingthat
Hey, dude.Im a big datadeveloper. We haveenormous data.I continuouslyread articles onhighscalability.com
Big datais easy.Its like normal data,but, well,even bigger.
Aha. So thisis not big, right?
Thats big, huh?
Peta bytes of data every houron different continents, withcomplex relations and withthe need to analyze themin almost real time for anomaliesand to visualize them foryour management
You can easily call this big data.Everything below this you cancall Mickey Mouse data
Good news is: you canintentionally grow – collectmore data. And you should!
np, dude!Data is data.Same principles
Really?Lets consider storage, ok?
np, dude!Storage is just adatabase
Really?Storage capacity of onesingle box, no matter how bigit is, is limited
Oh.When you expect big data, youneed to scale very far and thusbuild on distribution andcombine theoreticallyunlimited amount of machines toone single distributed storage
There is no way around.Except you invent the BlackHoleDB
Hem.Building upon distribution ismuch harder than anything youveseen or done before
Except you fed a crowd with7 breads and walked upon thewater
np, dude!From three ofCAP, Ill justpick two.As easy as this
The only thing that is absolutelycertain about distributed systems isthat parts of them will fail and youwill have no idea where and whatthe hell is going on
So your P must be a given in adistributed system. And you want toplay with C vs. A, not just take blackor white
np, dude!Sharding worksseamlessly. Idont need to takecare of anything
Seriously?For example, one of the hardestchallenges with big datais to distribute/shard parts overseveral machines still having fasttraversals and reads, thuskeeping related data together.Valid for graph and any other datastore, also NoSQL, kind of
Another hard challenge with shardingis to avoid naive hashing.Naive hashing would make you dependon the number of nodes and would notallow you to easily add or removenodes to/from the system
And still, the trade-off between datalocality, consistency, availability,read/write/search speed, latency etc.is hard
np, dude!NoSQL wouldwrite asynchronouslyand do map/reduceto find data
Of course.You will love eventualconsistency, especially when you needa transaction around a complexmoney transfer
np, dude!I dont have moneyto transfer. But Ineed to store lotsof data.I can throw anyamount of it atNoSQL, and itwill just work
Really?So youd just throwsomething into your databaseand hope it works?
What if you throw andmiss the target?
Data locality, redundancy,consistent hashing and eventualconsistency combined with usecase driven storage designare key principles in succeedingwith a huge distributed data storage.Thats big data development
How about data provisioning?
np, dude!Its databasebeing the bottleneck, not myweb servers
When you have thousands or millionsparallel requests per second, beggingfor data, the first mile will (also) quicklybecome the bottle neck.Requests will get queued and discardedas soon as your server doesnt bringdata fast enough through the pipe
np, dude!Ill get me somebad-ass sexyhardware
I bet you will.But under high load, your hardwarewill more or less quickly start to crack
Youll burn your hard disks, boards andcards. And wires. And youll heat upto a maximum
Its not about sexy hardware, butabout being able to quickly replace it.Ideally while the system keeps running
But anyway.To keep the first mile scalable andfast, would lead to some expensivenetwork infrastructure.You need to get the maximum outof your servers in order to reducetheir number
np, dude!I will use an eventdriven C10Kproblem solvingawesome webserver. Or Ill writeone on my own
Maybe.But when your users are coming fromall over the world, it wont help youmuch since the network latency fromthem to your server will kill them
You would have to go for a CDN oneday, statically pre-computing content.You would use their infrastructureand reduce the number of hits onyour own servers to a minimum
np, dude!Ill push my wholeplatform out to thecloud. Its evenmore flexible andscales like hell
Well.You cannot reliably predict on whichphysical machine and actually howclose to the data you program will run.Whenever virtual machines orstorage fragments get moved, yourworld stops
You can easily force data locality andshorter stop-the-world-phasesby paying higher bills
Data locality, geographic spatiality,dedicated virtualization and contentpre-computability combined with usecase driven cloudificationare key principles in succeedingwith provisioning of huge data amounts.Thats big data development
Lets talk about processing, ok?
np, dude!All easily doneby a map/reducetool
Almost agreed.map/reduce has two basic phases:even “map” and “reduce”
The slowest of those twois definitely “split”.Moving data from one huge pile toanother before map/reduce isdamn expensive
np, dude!Ill write my datastraight to thestorage of mymap/reduce tool.It will then tear
It can.But what if you need to search duringthe map phase – full-text, meta?
np, dude!Ill use a coolindexing searchengine or library.It can find my datain a snap
Would it?A very hard challenge is to partitionthe index and to couple its relatedparts to the corresponding data.With data locality of course, havingindex pieces on the related machines
Data and index locality and direct fillingof data pots as data flies by combinedwith use case driven technologyusage are key principles in succeedingwith processing of huge data amounts.Thats big data development
So, how about analytics, dude?
np, dude!Its classic use casefor map/reduce.I can do thisafterwards and onthe fly
Are you sure?So, you expect one tool to do both,real-time and post fact analytics?
What did you smoke last night?
You dont want to believein map/reduce in (near) real-time,dont you?
np, dude!Ill get me somerocket fasthardware
Im sure you will. But:You cannot predict and fix themap/reduce time.You cannot ensurethe completeness of data.You cannotguarantee causality knowledge
Distribution has got your soul
If you need to predict better,to be able to know about data/eventcausality, to be fast you need to CEPdata streams as data flies by.There is no (simple, fast) way around
But the most important thing is:None of the BI tools you know willadequately support your NoSQLdata store, so youre all alonein the world of proprietaryimmature tool combinations.The world of pain.
np, dude!My map/reducetool can even hidemath from me, soI can concentrateon doing stuff
There is no point in fearingmath/statistics. You just need it
Separation of immediate andpost fact analytics and CEP ofdata streams as data flies by combinedwith use case driven technologyusage and statistical knowledgeare key principles in succeedingwith analytics of huge data amounts.Thats big data development
Oh, we forgot visualization
np, dude!I just have noidea about it
Me neither.I just know that you cant visualizehuge data amounts using classicspreadsheets. There are better ways,tools, ideas to do this – find themThats big data development
hem, dude...Youre a smart-ass.Is it that you wantto say?
Almost.In one of my humble moments I wouldsuggest you to do the following:
Stop thinking you gain adequately deepknowledge through reading half-bakedblog posts. Get yourself some of those:
Statistics, Visualization Distribution Network Different languages Tools, chainsKnow and use full stack Data stores Different platforms OS Storage Machine Algorithms Math
Know your point of pain.You must be Twitter, Facebook orGoogle to have them all same time.If youre none of them, you can haveone or two. Or even none.Go for them with the right chain tool
First and the most important tool in thechain is your brain
Thank you
Most images originate from istockphoto.com except few ones takenfrom Wikipedia or Flickr (CC) and product pages or generated through public online generators
1–10 of 11 previous next Post a comment
1–10 of 11 previous next