Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ad-hoc Big-Data Analysis with Lua

1,475 views

Published on

Industrial big-data analysis tools like Hadoop are often unwieldy, cumbersome to use and expensive to maintain, especially if you're new to the field. It is well and good if you have a cluster set up and a handy team of engineers to maintain it. But what if you're a researcher, alone with a big dataset, a Linux machine (or, better, several) and no clue how to do Java?

In this talk I am sharing a few handy tricks on how to quickly process and run preliminary analysis on large datasets with nothing more than LuaJIT and some shell magic.

Published in: Data & Analytics
  • Be the first to comment

Ad-hoc Big-Data Analysis with Lua

  1. 1. Ad-hoc Big-Data Analysis with Lua And LuaJIT Alexander Gladysh <ag@logiceditor.com> @agladysh Lua Workshop 2015 Stockholm 1 / 32
  2. 2. Outline Introduction The Problem A Solution Assumptions Examples The Tools Notes Questions? 2 / 32
  3. 3. Alexander Gladysh CTO, co-owner at LogicEditor In löve with Lua since 2005 3 / 32
  4. 4. The Problem You have a dataset to analyze, which is too large for "small-data" tools, and have no resources to setup and maintain (or pay for) the Hadoop, Google Big Query etc. but you have some processing power available. 4 / 32
  5. 5. Goal Pre-process the data so it can be handled by R or Excel or your favorite analytics tool (or Lua!). If the data is dynamic, then learn to pre-process it and build a data processing pipeline. 5 / 32
  6. 6. An Approach Use Lua! And (semi-)standard tools, available on Linux. Go minimalistic while exploring, avoid frameworks, Then move on to an industrial solution that fits your newly understood requirements, Or roll your own ecosystem! ;-) 6 / 32
  7. 7. Assumptions 7 / 32
  8. 8. Data Format Plain text Column-based (csv-like), optionally with free-form data in the end Typical example: web-server log files 8 / 32
  9. 9. Data Format Example: Raw Data 2015/10/15 16:35:30 [info] 14171#0: *901195 [lua] index:14: 95c1c06e626b47dfc705f8ee6695091a 109.74.197.145 *.example.com GET 123456.gif?q=0&step=0&ref= HTTP/1.1 example.com NB: This is a single, tab-separated line from a time-sorted file. 9 / 32
  10. 10. Data Format Example: Intermediate Data alpha.example.com 5 beta.example.com 7 gamma.example.com 1 NB: These are several tab-separated lines from a key-sorted file. 10 / 32
  11. 11. Hardware As usual, more is better: Cores, cache, memory speed and size, HDD speeds, networking speeds... But even a modest VM (or several) can be helpful. Your fancy gaming laptop is good too ;-) 11 / 32
  12. 12. OS Linux (Ubuntu) Server. This approach will, of course, work for other setups. 12 / 32
  13. 13. Filesystem Ideally, have data copies on each processing node, using identical layouts. Fast network should work too. 13 / 32
  14. 14. Examples 14 / 32
  15. 15. Bash Script Example time pv /path/to/uid-time-url-post.gz | pigz -cdp 4 | cut -d$’t’ -f 1,3 | parallel --gnu --progress -P 10 --pipe --block=16M $(cat <<"EOF" luajit ~me/url-to-normalized-domain.lua EOF ) | LC_ALL=C sort -u -t$’t’ -k2 --parallel 6 -S20% | luajit ~me/reduce-value-counter.lua | LC_ALL=C sort -t$’t’ -nrk2 --parallel 6 -S20% | pigz -cp4 >/path/to/domain-uniqs_count-merged.gz 15 / 32
  16. 16. Lua Script Example: url-to-normalized-domain.lua for l in io.lines() do local key, value = l:match("^([^t]+)t(.*)") if value then value = url_to_normalized_domain(value) end if key and value then io.write(key, "t", value, "n") end end 16 / 32
  17. 17. Lua Script Example: reduce-value-counter.lua 1/3 -- Assumes input sorted by VALUE -- a foo --> foo 3 -- a foo bar 2 -- b foo quo 1 -- a bar -- c bar -- d quo 17 / 32
  18. 18. Lua Script Example: reduce-value-counter.lua 2/3 local last_key = nil, accum = 0 local flush = function(key) if last_key then io.write(last_key, "t", accum, "n") end accum = 0 last_key = key -- may be nil end 18 / 32
  19. 19. Lua Script Example: reduce-value-counter.lua 3/3 for l in io.lines() do -- Note reverse order! local value, key = l:match("^(.-)t(.*)$") assert(key and value) if key ~= last_key then flush(key) collectgarbage("step") end accum = accum + 1 end flush() 19 / 32
  20. 20. Tying It All Together Basically: You work with sorted data, mapping and reducing it line-by-line, in parallel where at all possible, while trying to use as much of available hardware resources as practical, and without running out of memory. 20 / 32
  21. 21. The Tools 21 / 32
  22. 22. The Tools parallel sort, uniq, grep cut, join, comm pv compression utilities LuaJIT 22 / 32
  23. 23. LuaJIT? Up to a point: 2.1 helps to speed things up, FFI bogs down development speed. Go plain Lua first (run it with LuaJIT), then roll your own ecosystem as needed ;-) 23 / 32
  24. 24. Parallel xargs for parallel computation can run your jobs in parallel on a single machine or on a "cluster" 24 / 32
  25. 25. Compression gzip: default, bad lz4: fast, large files pigz: fast, parallelizable xz: good compression, slow ...and many more, be on lookout for new formats! 25 / 32
  26. 26. GNU sort Tricks LC_ALL=C sort -t$’t’ --parallel 4 -S60% -k3,3nr -k2,2 -k1,1nr Disable locale. Specify delimiter. Note that parallel x4 with 60% memory will consume 0.6 * log(4) = 120% of memory. When doing multi-key sort, specify parameters after key number. 26 / 32
  27. 27. grep http://stackoverflow.com/questions/9066609/fastest-possible-grep 27 / 32
  28. 28. Notes and Remarks 28 / 32
  29. 29. Why Lua? Perl, AWK are traditional alternatives to Lua, but, if you’re not very disciplined and experienced, they are much less maintainable. 29 / 32
  30. 30. Start Small! Always run your scripts on small representative excerpts from your datasets, not only while developing them locally, but on actual data-processing nodes too. Saves time and helps you learn the bottlenecks. Sometimes large run still blows in your face though: Monitor resource utilization at run-time. 30 / 32
  31. 31. Discipline! Many moving parts, large turn-around times, hard to keep tabs. Keep journal: Write down what you run and what time it took. Store actual versions of your scripts in a source control system. Don’t forget to sanity-check the results you get! 31 / 32
  32. 32. Questions? Alexander Gladysh, ag@logiceditor.com 32 / 32

×