Parsing binaries and protocols with erlang


Published on

Delivered by Bhasker V Kode at

Official talk page at

Erlang 's support for handling binaries and pattern matching make it a great choice for parsing everything from IPv4 packets, to payloads from the Memcached protocol, SWF files, or databases like Tokyo Cabinet. From a functional programming perspective, there are various ways of building these parsers, taking advantage of the concurrent and recursive nature that is inherent to the language and other challenges which have been gathered while validating the storage & retrieval options for our distributed crawler, and submitting patches to projects like Medici & Tora ( erlang based Tokyo Cabinet clients). The talk will also touch upon Tokyo cabinet's support for mapreduce with Lua, and notes from building your own custom formats & our internal mapreduce'esque and caching frameworks used in building a multi-million impression platform utilizing under a gig of RAM per node.

Notes on:
- trends in disk/memory/bandwidth
- why erlang, RAM, binaries
- garbage collection in the erlang VM
- message passing
- use-cases

Published in: Technology, News & Politics
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Parsing binaries and protocols with erlang

  1. 1. “Parsing binaries and protocols  with erlang ?!” Bhasker V Kode co­founder  & CTO at at December 4th, 2009
  2. 2. “WHY ... ?!”                                                                              
  3. 3. “BUT I'm BUILDING webapps !?!”                                                                              
  4. 4. “Everything's quick enough :D”                                                                              
  5. 5. “doh!”                                                                              
  6. 6. “ha! ofcourse i knew that... err.... but people scale... that's what they do .....  that's our way out !!!  scaling out ... scaling up ... auto scaling even...!!!   : O ”                                                                              
  7. 7. “scale UP ...! more RAM seems to stop those  stall those silly CPU­unit warnings  my hosting provider gives... bring on those infinite loops &  polling crons. RealTimeWeb FTW!”                                                                              
  8. 8. “scaling OUT , maybe with a  distributed filesystem and figure out a way for nodes to  talk, and... Replication... and  location transparency during  weekends... and  commodity  hardware which i can't pay for ”                                                                              
  9. 9. More data becoming archival  NOT by choice, but forced to.  Not pushed to handling streams of  data well ( even hadoop!) #bigdata  If you're not compromising, you're  not pushing enough. Disk's loss  must be some else's gain.    fixed­length eg's at fb, twitter, google                                                                              
  10. 10. Erlang for RAM on the web is the new Embedded C                                                                              
  11. 11. “THE NEWS TODAY. Once popular  retro format 'binary' continues to  go unnoticed after brief sightings  on wallpapers during the matrix  trilogy ....” pssst! in files of any mime/content type in db's that accept binary in RAM, via caching engines compact for n/w transfer & storage   the answer to unicode                                                                              
  12. 12. “fine! Binaries are everywhere,  disk's are not keeping up, and i've  got more cores on my nodes every  year.”                                                                              
  13. 13. “But i'm not still not going near a   strict, dynamically typed functional  programming language with  support for concurrency,  communication, and distribution,  automatic memory management &  supports multiple platforms !!!”                                                                              
  14. 14. Erlang!!! over­rated ?                      OR under­appreciated ? “ [ 87, 84, 70]  :O !”                                                                              
  15. 15. What happens when you start a erlang shell  . SMP did'nt exist before erlang build R11 ('06)                                                                              
  16. 16. “ahh... so processes are pseudo  threads in the erlang VM that are  light weight & the base of erlang  programs having their own heap or  message inbox & are meant for  message passing erlang  primitaves. Also the developer can  configure how many cores are  used based on # of schedulers,  which run process's.                                                                              
  17. 17. Max of 1024 schedulers can be set =>  your erlang src today should utilize box's upto 1024 cores                                                                              
  18. 18. Let M=  msgs to random users Let N= 100,000 users Route M msgs to right N users ! typical one­node approach :  for i to M   for j to N      if match, add_update actor approach:  N concurrent processes listening to all msgs  As new msg arrives, msg pass to all N pids in each concurrent process: if match, add_update                                                                              
  19. 19. 3 papers to rule them all & 1 garbage collection method to free them!                                                                              
  20. 20. 3 papers to rule them all & 1 garbage collection method to free them!                                                                              
  21. 21. 3 papers to rule them all & 1 garbage collection method to free them!                                                                              
  22. 22. EUREKA!!! we have a winner                                                                              
  23. 23. “ahh... so this is what the no  shared memory in erlang, or light  weight process's being garbage  collected easily since they dont  have references to data in each  other's process heap, & messages   copied or shared based on it's  size, likelihood of reuse and also  optimized for binary. tellmemore!!”                                                                              
  24. 24. “How do you spawn a process?”                                                                              
  25. 25. “Where can you spawn a  process?”                                                                              
  26. 26. “Can a spawned process talk back  to the callee?”                                                                              
  27. 27. “Can a spawned process listen as  long as i want it to?” “Can a spawned process stop  listening when I want it to?” “Can a spawned process spawn  more processes?”                                                                              
  28. 28. “So though erlang gives a library  called OTP & a db called mnesia for  making life easier ­ you can parse  or create binaries easily, make  client­server programs, distributed  rpc calls, tail­recursive servers,  message/priority queue's for  flowcontrol, talk to ports and other  lang's, or create any data structure  explicitly (a) in­memory (b)on­disk  of any connected node!                                                                              
  29. 29. “show me the demo's” ● Process related – Message queue's , Client – server – RPC , Timeouts ● Binary – Binary pattern matching, Parse swf/mp3 for metadata – Networking, comm. with C, Tokyocabinet client eg. ● Process + Binary! – Building a production ready in­memory CDN  consistently faster than Am4z0n cl0udfr0nt, in stages   open & gzip < concat js's < inmemory < streaming?                                                                              
  30. 30. “Binary pattern matching ?” <<Value:Size/Type­Signedness­Endianism­ unit:Unit>> <<1:32>> = <<0,0,0,1>. <<1:32/unsigned-little>> = <<1,0,0,0>. <<_:8,“mnesia”/binary>> = <<”Amnesia”>>. So <<Bin>> could be unicode characters  ( English, hindi, tamil ) or JPG's or http headers  or basically segments of binaries NewBinary=<<Segment1,Segment2>>.                                                                              
  31. 31. summary of tech at ● LYME stack since ~dec 07 , 4 (­1) nodes (64bit 4GB) ● python crawler + associated NLP parsers, index's now  in tokyo cabinet, inverted index's in erlang 's mnesia db  with binaries of 5 diff indian languages + multiple  content­types, cpu time­splicing algo's, priority queue's  for heat­seeking algo, flowcontrol, caching engines,  cyclic queues, map­reduces with non­blocking gathers,  headless­firefox for thumbnails, patches to  tokyocabinet client 'medici' ● Beta in Jan 09, 1 million hovers/month in May'09 ●   2­4 developers + several interns across ~2 years                                                                              
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.