100% Big Data, 0% Hadoop, 0% Java (@pavlobaron)


Talk I gave at GoTo Aarhus 2012, QCon San Francisco 2012 and TechMesh London 2012

100% Big Data, 0% Hadoop, 0% Java (@pavlobaron)

  100% Big Data0% Hadoop0% JavaPavlo Baron, codecentric AG
  
  So here is the short story...
  sitting there, listening...
  presented as Houdini magic...
  so you telling's smoke and mirrors?
  Smells like a bunch of queues,pipes and filters...
  Looks like some NLP...
  Sounds like some math...
  Seems like some basic ML...
  methinks: I can tinker that.I have 2 nights in the hotel...
  Fire!
  Know the use cases...
  Consume a feed where peoplesay what they think beforethey think what they say...
  Drink Big Data warm, straightfrom the fire hose...
  Then fork for immediatenotification and batchanalytics...
  Some bubblesqueuefeeds filtersentiment analysisformalizestoreaggregatereportreactalertqueuequeuereactmap/reduceforkqueue
  Some techLanguages: Python, ErlangFeeds: Tweepy, crawlers, feed readersQueueing: RabbitMQ through PikaStore: Riak through protobufsMap/reduce: modified Disco to run workers on Riak-nodes data-locallyActive event push to the browser through RabbitMQ'sWeb-STOMPjVectorMap, jQuery, stomp and sock.js in the browserwpcorpus built with RabbitMQ, indexed with PyTables
  Some mathAnalytics: NLP with NLTKAlgo training: nltk-trainer with pickle=trueAlgos: naive Bayes, decision tree, binaryclassification based on trigram frequenciessimple name and antiword filtering based onpublic and own corporasentiment analysis based on public and owncorporatroll filtering using Wikipedia as activecorpus
  Some numbers......'cause numbers are sexy
  When numbersbecome toosexy for your[hat|car|cat],they mutateintonumbers pr0n
  Some numbers, revisitedI'm not into numbers pr0nnumbers need to be just good enough forwhat you're trying to solve
  But it's still theeasiest way toimpress,especiallywithoutsolving aconcreteproblem
  So, finally, some numbers(on my MBA)Feed: ~10000 chaotic text msg/minStore: ~8000 formalized msg/min, N=3,quorum, 3 nodesAnalytics: ~7000 msg/min (filtered, pos/negaggregation, location based aggregation)Demo: ~1500000 tweets, pos/negaggregation, stream processing in ~7min,map/red in ~15sec
  Some lessons learned
  The Beliebers...
  More than 60% of theTwitter sample stream isuseless garbage...
  Further 20% are trolls...
  So I ended up implementingwpcorpus - active NLP corpusbased on Wikipedia, using itscategories and theircombination as classes andanti-classes
  Real names...
  Absurd profile bios...
  Location...
  Language...For trigrams in NLTK, useSpanish as "anti-class" to tellEnglish/German from the rest
  Disco workers on Riak nodes...PITA and a lot of tinkering, but necessaryfor data localityExtending Disco is relatively easy, butchanging it is hard...Flooding, asynchronous, separate key/valuelisting in low-level Riak goes very well withErlang port based Python/Erlang messageexchange in Disco. NotLow-level vnode-data-consume neededprobabilistic correction due to N=3Extended Disco to use RabbitMQ betweenthe worlds (h/t Dan North for the idea)
  Mixing Python and Erlang inone project...Forgetting punctuation in Erlang code all thetime when quickly switching from PythonTerribly missing pattern matching in PythonConsidering to embed Python in Erlang, butit might become a double PITA then
  Sentiment analysis...
  Well, actually, strongsentiment analysis...
  Very unreliable given thehuman nature...
  In addition to the NLTK'smovie reviews corpus, usethese for "neg" classification
  FAQ
  Q: Why the heck are youdoing this?
  ABecause I canBecause I wantBecause I want to learnBecause I want to go deep on low-levelBecause I value speed over abstractionBecause it's very interesting to combinecomputer science with math
  Q: Why not just use Hadoop?
  ABecause I didn't want to run this on the JVMBecause I have 2 use cases, and only one ofthem is suitable for batch map/reduce
  Q: Why didn't you want torun this on the JVM?
  A: well, technically seen,Big Data area is growingon the JVMHadoopPigStorm, Kafka, EsperMahoutOpenNLP
  A: but I didn't want thisBig Data on my drive~/.m2
  A: and I am evaluating somealternatives to the ecosystem
  Q: Why are you queueing atall? Others do gazillions ofmsg/sec without queues
  AI could, if instead of filters and batch analytics ofchaotic text, it would be just about building trivialsums from fixed-sized tickswith growable numbers like this, you want toprotect any sort of reliable data store from gettingflooded by writes, RDBMS or NoSQL storeBecause I need to do some pipes and filtersBecause I'm mixing and crossing borders of datasources and technologiesBecause (almost) all frameworks that you mightconsider also do some queueing or buffering
  Q: Why did you use Erlangand Python?
  ABecause reliability and distribution are builtinto the Erlang VM and I don't need separatecoordinators or to reinvent the wheelBecause both, Python and Erlang, are"functional" enough for what I need day-by-dayBecause Python has been for many years theplatform of choice for scientists, thus thereare available clever and mature mathlibrariesBecause Disco is on Python and Erlang, Riakand RabbitMQ are on Erlang
  Q: isn't Python slow like hell?
  Ait's not operating at the speed of lightyes, it is slower at some pointsI've also been testing PyPy to improveperformance for the case I should need it,'cause right now it works just fast enoughwithout explicit bottle-necks in the givenarchitecture, even on one single MBA
  Q: MBA is boring. Can youmake it real web scale?
  Awell, to be precise, I'm operating on webdataI can scale queues with RabbitMQI can scale storage with RiakI can scale the map/reduce supportedanalytics with Disco/RiakI can scale data sources/feeds, machines,hardware, networks, infrastructure, loginsetc. You name it
  Q: what's in the future?
  AI don't have my crystal ball with meI'm toying with the idea to write Pig Latinengine in Python called "Sau" (German forpig), to offer data scientists a comfortableinterface and to allow them to run existingPig scripts on this stackI could add more data sources, improvethroughput where necessary and work onsome low level Disco modifications to changethe way it utilizes Erlang in my caseI will integrate my Disco extensions withDisco upstream one day
  Q: what do we learn aboutBig Data here?
  ABig Data is about the "what", followed bythe "how" and enabled by the "what with"
  AIt's about gathering data, filtering out mostof it as garbage, analyzing it, gaining usefulinformation out of it immediately, finding newways to gather and use information andderiving steps for business improvements,strategy planning, doing soft intelligence akaenterprise level stalking or, even moreimportant, helping make the world a betterplace - it's up to you
  AIt's not about building SkyNet - even if thiswill be built one day, it will be prettyboring. It's about building recommender anddecision support systems, thus lettingmachines do stupid, repeated jobs fast andhuman beings make high quality decisions
  AIt's not about plain numbers. It's aboutnumbers that are good enough to carry thesolution. Not less, but also not more thanthatConsider the dilemma: if you want to be asprecise and as fast as possible in your "BigData", you don't crunch it blindly on tons ofmachines. Instead, you go at the low-leveland optimize the stack, filter upfrontIf it's not leading to value in near realtime,it's useless, though probably Big
  AIt's a huge field for geeks with aspiration tolearn new things, dig into math and computerscience, play with different platforms andtools and pick the right tool chain
  Oh, and did the demo run?
  Thank you!
  Most images originate fromistockphoto.comexcept few ones takenfrom Wikipedia or Flickr (CC)and product pagesor generated through publiconline generators