Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

100% Big Data, 0% Hadoop, 0% Java (@pavlobaron)


Published on

Talk I gave at GoTo Aarhus 2012, QCon San Francisco 2012 and TechMesh London 2012

Published in: Technology, News & Politics

100% Big Data, 0% Hadoop, 0% Java (@pavlobaron)

  1. 1. 100% Big Data0% Hadoop0% JavaPavlo Baron, codecentric AGSaturday, April 27, 13
  2. 2., April 27, 13
  3. 3. So here is the short story...Saturday, April 27, 13
  4. 4. sitting there, listening...Saturday, April 27, 13
  5. 5. presented as Houdini magic...Saturday, April 27, 13
  6. 6. so you telling’s smoke and mirrors?Saturday, April 27, 13
  7. 7. Smells like a bunch of queues,pipes and filters...Saturday, April 27, 13
  8. 8. Looks like some NLP...Saturday, April 27, 13
  9. 9. Sounds like some math...Saturday, April 27, 13
  10. 10. Seems like some basic ML...Saturday, April 27, 13
  11. 11. methinks: I can tinker that.I have 2 nights in the hotel...Saturday, April 27, 13
  12. 12. Fire!Saturday, April 27, 13
  13. 13. Know the use cases...Saturday, April 27, 13
  14. 14. Consume a feed where peoplesay what they think beforethey think what they say...Saturday, April 27, 13
  15. 15. Drink Big Data warm, straightfrom the fire hose...Saturday, April 27, 13
  16. 16. Then fork for immediatenotification and batchanalytics...Saturday, April 27, 13
  17. 17. Some bubblesqueuefeeds filtersentiment analysisformalizestoreaggregatereportreactalertqueuequeuereactmap/reduceforkqueueSaturday, April 27, 13
  18. 18. Some techLanguages: Python, ErlangFeeds: Tweepy, crawlers, feed readersQueueing: RabbitMQ through PikaStore: Riak through protobufsMap/reduce: modified Disco to run workers on Riak-nodes data-locallyActive event push to the browser through RabbitMQ’sWeb-STOMPjVectorMap, jQuery, stomp and sock.js in the browserwpcorpus built with RabbitMQ, indexed with PyTablesSaturday, April 27, 13
  19. 19. Some mathAnalytics: NLP with NLTKAlgo training: nltk-trainer with pickle=trueAlgos: naive Bayes, decision tree, binaryclassification based on trigram frequenciessimple name and antiword filtering based onpublic and own corporasentiment analysis based on public and owncorporatroll filtering using Wikipedia as activecorpusSaturday, April 27, 13
  20. 20. Some numbers......‘cause numbers are sexySaturday, April 27, 13
  21. 21. When numbersbecome toosexy for your[hat|car|cat],they mutateintonumbers pr0nSaturday, April 27, 13
  22. 22. Some numbers, revisitedI’m not into numbers pr0nnumbers need to be just good enough forwhat you’re trying to solveSaturday, April 27, 13
  23. 23. But it’s still theeasiest way toimpress,especiallywithoutsolving aconcreteproblemSaturday, April 27, 13
  24. 24. So, finally, some numbers(on my MBA)Feed: ~10000 chaotic text msg/minStore: ~8000 formalized msg/min, N=3,quorum, 3 nodesAnalytics: ~7000 msg/min (filtered, pos/negaggregation, location based aggregation)Demo: ~1500000 tweets, pos/negaggregation, stream processing in ~7min,map/red in ~15secSaturday, April 27, 13
  25. 25. Some lessons learnedSaturday, April 27, 13
  26. 26. The Beliebers...Saturday, April 27, 13
  27. 27. More than 60% of theTwitter sample stream isuseless garbage...Saturday, April 27, 13
  28. 28. Further 20% are trolls...Saturday, April 27, 13
  29. 29. So I ended up implementingwpcorpus - active NLP corpusbased on Wikipedia, using itscategories and theircombination as classes andanti-classesSaturday, April 27, 13
  30. 30. Real names...Saturday, April 27, 13
  31. 31. Absurd profile bios...Saturday, April 27, 13
  32. 32. Location...Saturday, April 27, 13
  33. 33. Language...For trigrams in NLTK, useSpanish as “anti-class” to tellEnglish/German from the restSaturday, April 27, 13
  34. 34. Disco workers on Riak nodes...PITA and a lot of tinkering, but necessaryfor data localityExtending Disco is relatively easy, butchanging it is hard...Flooding, asynchronous, separate key/valuelisting in low-level Riak goes very well withErlang port based Python/Erlang messageexchange in Disco. NotLow-level vnode-data-consume neededprobabilistic correction due to N=3Extended Disco to use RabbitMQ betweenthe worlds (h/t Dan North for the idea)Saturday, April 27, 13
  35. 35. Mixing Python and Erlang inone project...Forgetting punctuation in Erlang code all thetime when quickly switching from PythonTerribly missing pattern matching in PythonConsidering to embed Python in Erlang, butit might become a double PITA thenSaturday, April 27, 13
  36. 36. Sentiment analysis...Saturday, April 27, 13
  37. 37. Well, actually, strongsentiment analysis...Saturday, April 27, 13
  38. 38. Very unreliable given thehuman nature...Saturday, April 27, 13
  39. 39. In addition to the NLTK’smovie reviews corpus, usethese for “neg” classificationSaturday, April 27, 13
  40. 40. FAQSaturday, April 27, 13
  41. 41. Q: Why the heck are youdoing this?Saturday, April 27, 13
  42. 42. ABecause I canBecause I wantBecause I want to learnBecause I want to go deep on low-levelBecause I value speed over abstractionBecause it’s very interesting to combinecomputer science with mathSaturday, April 27, 13
  43. 43. Q: Why not just use Hadoop?Saturday, April 27, 13
  44. 44. ABecause I didn’t want to run this on the JVMBecause I have 2 use cases, and only one ofthem is suitable for batch map/reduceSaturday, April 27, 13
  45. 45. Q: Why didn’t you want torun this on the JVM?Saturday, April 27, 13
  46. 46. A: well, technically seen,Big Data area is growingon the JVMHadoopPigStorm, Kafka, EsperMahoutOpenNLPSaturday, April 27, 13
  47. 47. A: but I didn’t want thisBig Data on my drive~/.m2Saturday, April 27, 13
  48. 48. A: and I am evaluating somealternatives to the ecosystemSaturday, April 27, 13
  49. 49. Q: Why are you queueing atall? Others do gazillions ofmsg/sec without queuesSaturday, April 27, 13
  50. 50. AI could, if instead of filters and batch analytics ofchaotic text, it would be just about building trivialsums from fixed-sized tickswith growable numbers like this, you want toprotect any sort of reliable data store from gettingflooded by writes, RDBMS or NoSQL storeBecause I need to do some pipes and filtersBecause I’m mixing and crossing borders of datasources and technologiesBecause (almost) all frameworks that you mightconsider also do some queueing or bufferingSaturday, April 27, 13
  51. 51. Q: Why did you use Erlangand Python?Saturday, April 27, 13
  52. 52. ABecause reliability and distribution are builtinto the Erlang VM and I don’t need separatecoordinators or to reinvent the wheelBecause both, Python and Erlang, are“functional” enough for what I need day-by-dayBecause Python has been for many years theplatform of choice for scientists, thus thereare available clever and mature mathlibrariesBecause Disco is on Python and Erlang, Riakand RabbitMQ are on ErlangSaturday, April 27, 13
  53. 53. Q: isn’t Python slow like hell?Saturday, April 27, 13
  54. 54. Ait’s not operating at the speed of lightyes, it is slower at some pointsI’ve also been testing PyPy to improveperformance for the case I should need it,‘cause right now it works just fast enoughwithout explicit bottle-necks in the givenarchitecture, even on one single MBASaturday, April 27, 13
  55. 55. Q: MBA is boring. Can youmake it real web scale?Saturday, April 27, 13
  56. 56. Awell, to be precise, I’m operating on webdataI can scale queues with RabbitMQI can scale storage with RiakI can scale the map/reduce supportedanalytics with Disco/RiakI can scale data sources/feeds, machines,hardware, networks, infrastructure, loginsetc. You name itSaturday, April 27, 13
  57. 57. Q: what’s in the future?Saturday, April 27, 13
  58. 58. AI don’t have my crystal ball with meI’m toying with the idea to write Pig Latinengine in Python called “Sau” (German forpig), to offer data scientists a comfortableinterface and to allow them to run existingPig scripts on this stackI could add more data sources, improvethroughput where necessary and work onsome low level Disco modifications to changethe way it utilizes Erlang in my caseI will integrate my Disco extensions withDisco upstream one daySaturday, April 27, 13
  59. 59. Q: what do we learn aboutBig Data here?Saturday, April 27, 13
  60. 60. ABig Data is about the “what”, followed bythe “how” and enabled by the “what with”Saturday, April 27, 13
  61. 61. AIt’s about gathering data, filtering out mostof it as garbage, analyzing it, gaining usefulinformation out of it immediately, finding newways to gather and use information andderiving steps for business improvements,strategy planning, doing soft intelligence akaenterprise level stalking or, even moreimportant, helping make the world a betterplace - it’s up to youSaturday, April 27, 13
  62. 62. AIt’s not about building SkyNet - even if thiswill be built one day, it will be prettyboring. It’s about building recommender anddecision support systems, thus lettingmachines do stupid, repeated jobs fast andhuman beings make high quality decisionsSaturday, April 27, 13
  63. 63. AIt’s not about plain numbers. It’s aboutnumbers that are good enough to carry thesolution. Not less, but also not more thanthatConsider the dilemma: if you want to be asprecise and as fast as possible in your “BigData”, you don’t crunch it blindly on tons ofmachines. Instead, you go at the low-leveland optimize the stack, filter upfrontIf it’s not leading to value in near realtime,it’s useless, though probably BigSaturday, April 27, 13
  64. 64. AIt’s a huge field for geeks with aspiration tolearn new things, dig into math and computerscience, play with different platforms andtools and pick the right tool chainSaturday, April 27, 13
  65. 65. Oh, and did the demo run?Saturday, April 27, 13
  66. 66. Thank you!Saturday, April 27, 13
  67. 67. Most images originate fromistockphoto.comexcept few ones takenfrom Wikipedia or Flickr (CC)and product pagesor generated through publiconline generatorsSaturday, April 27, 13