SlideShare a Scribd company logo
1 of 32
Download to read offline
Ad-hoc Big-Data Analysis with Lua
And LuaJIT
Alexander Gladysh <ag@logiceditor.com>
@agladysh
Lua Workshop 2015
Stockholm
1 / 32
Outline
Introduction
The Problem
A Solution
Assumptions
Examples
The Tools
Notes
Questions?
2 / 32
Alexander Gladysh
CTO, co-owner at LogicEditor
In löve with Lua since 2005
3 / 32
The Problem
You have a dataset to analyze,
which is too large for "small-data" tools,
and have no resources to setup and maintain (or pay for) the
Hadoop, Google Big Query etc.
but you have some processing power available.
4 / 32
Goal
Pre-process the data so it can be handled by R or Excel or
your favorite analytics tool (or Lua!).
If the data is dynamic, then learn to pre-process it and build a
data processing pipeline.
5 / 32
An Approach
Use Lua!
And (semi-)standard tools, available on Linux.
Go minimalistic while exploring, avoid frameworks,
Then move on to an industrial solution that fits your newly
understood requirements,
Or roll your own ecosystem! ;-)
6 / 32
Assumptions
7 / 32
Data Format
Plain text
Column-based (csv-like), optionally with free-form data in the
end
Typical example: web-server log files
8 / 32
Data Format Example: Raw Data
2015/10/15 16:35:30 [info] 14171#0: *901195
[lua] index:14: 95c1c06e626b47dfc705f8ee6695091a
109.74.197.145 *.example.com
GET 123456.gif?q=0&step=0&ref= HTTP/1.1 example.com
NB: This is a single, tab-separated line from a time-sorted file.
9 / 32
Data Format Example: Intermediate Data
alpha.example.com 5
beta.example.com 7
gamma.example.com 1
NB: These are several tab-separated lines from a key-sorted file.
10 / 32
Hardware
As usual, more is better: Cores, cache, memory speed and
size, HDD speeds, networking speeds...
But even a modest VM (or several) can be helpful.
Your fancy gaming laptop is good too ;-)
11 / 32
OS
Linux (Ubuntu) Server.
This approach will, of course, work for other setups.
12 / 32
Filesystem
Ideally, have data copies on each processing node, using
identical layouts.
Fast network should work too.
13 / 32
Examples
14 / 32
Bash Script Example
time pv /path/to/uid-time-url-post.gz 
| pigz -cdp 4 
| cut -d$’t’ -f 1,3 
| parallel --gnu --progress -P 10 --pipe --block=16M 
$(cat <<"EOF"
luajit ~me/url-to-normalized-domain.lua
EOF
) 
| LC_ALL=C sort -u -t$’t’ -k2 --parallel 6 -S20% 
| luajit ~me/reduce-value-counter.lua 
| LC_ALL=C sort -t$’t’ -nrk2 --parallel 6 -S20% 
| pigz -cp4 >/path/to/domain-uniqs_count-merged.gz
15 / 32
Lua Script Example: url-to-normalized-domain.lua
for l in io.lines() do
local key, value = l:match("^([^t]+)t(.*)")
if value then
value = url_to_normalized_domain(value)
end
if key and value then
io.write(key, "t", value, "n")
end
end
16 / 32
Lua Script Example: reduce-value-counter.lua 1/3
-- Assumes input sorted by VALUE
-- a foo --> foo 3
-- a foo bar 2
-- b foo quo 1
-- a bar
-- c bar
-- d quo
17 / 32
Lua Script Example: reduce-value-counter.lua 2/3
local last_key = nil, accum = 0
local flush = function(key)
if last_key then
io.write(last_key, "t", accum, "n")
end
accum = 0
last_key = key -- may be nil
end
18 / 32
Lua Script Example: reduce-value-counter.lua 3/3
for l in io.lines() do
-- Note reverse order!
local value, key = l:match("^(.-)t(.*)$")
assert(key and value)
if key ~= last_key then
flush(key)
collectgarbage("step")
end
accum = accum + 1
end
flush()
19 / 32
Tying It All Together
Basically:
You work with sorted data,
mapping and reducing it line-by-line,
in parallel where at all possible,
while trying to use as much of available hardware resources as
practical,
and without running out of memory.
20 / 32
The Tools
21 / 32
The Tools
parallel
sort, uniq, grep
cut, join, comm
pv
compression utilities
LuaJIT
22 / 32
LuaJIT?
Up to a point:
2.1 helps to speed things up,
FFI bogs down development speed.
Go plain Lua first (run it with LuaJIT),
then roll your own ecosystem as needed ;-)
23 / 32
Parallel
xargs for parallel computation
can run your jobs in parallel on a single machine
or on a "cluster"
24 / 32
Compression
gzip: default, bad
lz4: fast, large files
pigz: fast, parallelizable
xz: good compression, slow
...and many more,
be on lookout for new formats!
25 / 32
GNU sort Tricks
LC_ALL=C 
sort -t$’t’ --parallel 4 -S60% 
-k3,3nr -k2,2 -k1,1nr
Disable locale.
Specify delimiter.
Note that parallel x4 with 60% memory will consume 0.6 *
log(4) = 120% of memory.
When doing multi-key sort, specify parameters after key
number.
26 / 32
grep
http://stackoverflow.com/questions/9066609/fastest-possible-grep
27 / 32
Notes and Remarks
28 / 32
Why Lua?
Perl, AWK are traditional alternatives to Lua, but, if you’re not
very disciplined and experienced, they are much less maintainable.
29 / 32
Start Small!
Always run your scripts on small representative excerpts from
your datasets, not only while developing them locally, but on
actual data-processing nodes too.
Saves time and helps you learn the bottlenecks.
Sometimes large run still blows in your face though:
Monitor resource utilization at run-time.
30 / 32
Discipline!
Many moving parts, large turn-around times, hard to keep tabs.
Keep journal: Write down what you run and what time it took.
Store actual versions of your scripts in a source control system.
Don’t forget to sanity-check the results you get!
31 / 32
Questions?
Alexander Gladysh, ag@logiceditor.com
32 / 32

More Related Content

What's hot

Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Tim Lossen
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoEqunix Business Solutions
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
Koichi Suzuki - Postgres-XC Dynamic Cluster  Management @ Postgres OpenKoichi Suzuki - Postgres-XC Dynamic Cluster  Management @ Postgres Open
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres OpenPostgresOpen
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB
 
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP IntegrationBKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP IntegrationLinaro
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkkbajda
 
BKK16-505 Kernel and Bootloader Consolidation and Upstreaming
BKK16-505 Kernel and Bootloader Consolidation and UpstreamingBKK16-505 Kernel and Bootloader Consolidation and Upstreaming
BKK16-505 Kernel and Bootloader Consolidation and UpstreamingLinaro
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talkUdo Seidel
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportAlexander Korotkov
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRed Hat Developers
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergendistributed matters
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkDvir Volk
 

What's hot (20)

kpatch.kgraft
kpatch.kgraftkpatch.kgraft
kpatch.kgraft
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
Koichi Suzuki - Postgres-XC Dynamic Cluster  Management @ Postgres OpenKoichi Suzuki - Postgres-XC Dynamic Cluster  Management @ Postgres Open
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP IntegrationBKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Foss Gadgematics
Foss GadgematicsFoss Gadgematics
Foss Gadgematics
 
BKK16-505 Kernel and Bootloader Consolidation and Upstreaming
BKK16-505 Kernel and Bootloader Consolidation and UpstreamingBKK16-505 Kernel and Bootloader Consolidation and Upstreaming
BKK16-505 Kernel and Bootloader Consolidation and Upstreaming
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talk
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction support
 
librados
libradoslibrados
librados
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 

Viewers also liked

Nurit Leshem Studio
Nurit Leshem StudioNurit Leshem Studio
Nurit Leshem StudioNurit Leshem
 
Quick functional UI sketches with Lua templates and mermaid.js
Quick functional UI sketches with Lua templates and mermaid.jsQuick functional UI sketches with Lua templates and mermaid.js
Quick functional UI sketches with Lua templates and mermaid.jsAlexander Gladysh
 
Design Yo' Self : Why should designers care about Open Source Software?
Design Yo' Self : Why should designers care about Open Source Software?Design Yo' Self : Why should designers care about Open Source Software?
Design Yo' Self : Why should designers care about Open Source Software?celinecelines semaan vernon
 
Course outline
Course outlineCourse outline
Course outlinecocolatto
 
Partner Training: Nonprofit Industry
Partner Training: Nonprofit IndustryPartner Training: Nonprofit Industry
Partner Training: Nonprofit IndustryGrace Dunlap
 
Rekapitulasi 2010 06 28
Rekapitulasi 2010 06 28Rekapitulasi 2010 06 28
Rekapitulasi 2010 06 28bramanian
 
Who Are You? Branding Your Nonprofit
Who Are You? Branding Your NonprofitWho Are You? Branding Your Nonprofit
Who Are You? Branding Your NonprofitGrace Dunlap
 
ABA Information- and Communications Technologies in Austria
ABA Information- and Communications Technologies in AustriaABA Information- and Communications Technologies in Austria
ABA Information- and Communications Technologies in AustriaABA - Invest in Austria
 
网页制作基础
网页制作基础网页制作基础
网页制作基础loo2k
 
Presentation2
Presentation2Presentation2
Presentation2cocolatto
 
Presentation1
Presentation1Presentation1
Presentation1cocolatto
 
Learning in the 21st century a national report of online learning
Learning in the 21st century a national report of online learningLearning in the 21st century a national report of online learning
Learning in the 21st century a national report of online learningtspicuzza
 
Open-V
Open-VOpen-V
Open-VOPEN-V
 

Viewers also liked (20)

Nurit Leshem Studio
Nurit Leshem StudioNurit Leshem Studio
Nurit Leshem Studio
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Quick functional UI sketches with Lua templates and mermaid.js
Quick functional UI sketches with Lua templates and mermaid.jsQuick functional UI sketches with Lua templates and mermaid.js
Quick functional UI sketches with Lua templates and mermaid.js
 
CRIS and Institutional Repository Integration: Standardising Open Access
CRIS and Institutional Repository Integration: Standardising Open AccessCRIS and Institutional Repository Integration: Standardising Open Access
CRIS and Institutional Repository Integration: Standardising Open Access
 
Design Yo' Self : Why should designers care about Open Source Software?
Design Yo' Self : Why should designers care about Open Source Software?Design Yo' Self : Why should designers care about Open Source Software?
Design Yo' Self : Why should designers care about Open Source Software?
 
Course outline
Course outlineCourse outline
Course outline
 
PRESIMETRICS
PRESIMETRICSPRESIMETRICS
PRESIMETRICS
 
Partner Training: Nonprofit Industry
Partner Training: Nonprofit IndustryPartner Training: Nonprofit Industry
Partner Training: Nonprofit Industry
 
Rekapitulasi 2010 06 28
Rekapitulasi 2010 06 28Rekapitulasi 2010 06 28
Rekapitulasi 2010 06 28
 
Meet Christian Farioli
Meet Christian FarioliMeet Christian Farioli
Meet Christian Farioli
 
Who Are You? Branding Your Nonprofit
Who Are You? Branding Your NonprofitWho Are You? Branding Your Nonprofit
Who Are You? Branding Your Nonprofit
 
Caesar's quotes
Caesar's quotesCaesar's quotes
Caesar's quotes
 
ABA Information- and Communications Technologies in Austria
ABA Information- and Communications Technologies in AustriaABA Information- and Communications Technologies in Austria
ABA Information- and Communications Technologies in Austria
 
网页制作基础
网页制作基础网页制作基础
网页制作基础
 
Presentation2
Presentation2Presentation2
Presentation2
 
Presentation1
Presentation1Presentation1
Presentation1
 
L298 h
L298 hL298 h
L298 h
 
Learning in the 21st century a national report of online learning
Learning in the 21st century a national report of online learningLearning in the 21st century a national report of online learning
Learning in the 21st century a national report of online learning
 
Open-V
Open-VOpen-V
Open-V
 
Social media in schools
Social media in schoolsSocial media in schools
Social media in schools
 

Similar to Ad-hoc Big-Data Analysis with Lua

11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xrkr10
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Jérôme Petazzoni
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Java in containers
Java in containersJava in containers
Java in containersMartin Baez
 

Similar to Ad-hoc Big-Data Analysis with Lua (20)

Training
TrainingTraining
Training
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Java in containers
Java in containersJava in containers
Java in containers
 

Recently uploaded

concept of soil quality & soil health.pptx
concept of soil quality & soil health.pptxconcept of soil quality & soil health.pptx
concept of soil quality & soil health.pptxpranavmishrafzd
 
MANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
MANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITYMANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
MANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITYFreelance
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdfDSP Mutual Fund
 
Báo cáo Connected Consumer Quý 4 năm 2023
Báo cáo Connected Consumer Quý 4 năm 2023Báo cáo Connected Consumer Quý 4 năm 2023
Báo cáo Connected Consumer Quý 4 năm 2023MarketingTrips
 
Introductio to Data Science and types of data
Introductio to Data Science and types of dataIntroductio to Data Science and types of data
Introductio to Data Science and types of dataManishaPatil932723
 
Data Discovery With Power Query in excel
Data Discovery With Power Query in excelData Discovery With Power Query in excel
Data Discovery With Power Query in excelKapilSidhpuria3
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachAdekunleJoseph4
 
Film cover research.pptx for media courseowrk
Film cover research.pptx for media courseowrkFilm cover research.pptx for media courseowrk
Film cover research.pptx for media courseowrk494f574xmv
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbas73678sri
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Unlocking Anticipatory Text Generation- A Constrained Approach for Large Lan...
Unlocking Anticipatory Text Generation-  A Constrained Approach for Large Lan...Unlocking Anticipatory Text Generation-  A Constrained Approach for Large Lan...
Unlocking Anticipatory Text Generation- A Constrained Approach for Large Lan...Ingeol Baek
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
INTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
INTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITYINTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
INTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITYFreelance
 
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvwAdobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvws73678sri
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 

Recently uploaded (20)

concept of soil quality & soil health.pptx
concept of soil quality & soil health.pptxconcept of soil quality & soil health.pptx
concept of soil quality & soil health.pptx
 
MANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
MANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITYMANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
MANAGING RESOURCES FOR BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdf
 
Báo cáo Connected Consumer Quý 4 năm 2023
Báo cáo Connected Consumer Quý 4 năm 2023Báo cáo Connected Consumer Quý 4 năm 2023
Báo cáo Connected Consumer Quý 4 năm 2023
 
Introductio to Data Science and types of data
Introductio to Data Science and types of dataIntroductio to Data Science and types of data
Introductio to Data Science and types of data
 
Data Discovery With Power Query in excel
Data Discovery With Power Query in excelData Discovery With Power Query in excel
Data Discovery With Power Query in excel
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approach
 
Film cover research.pptx for media courseowrk
Film cover research.pptx for media courseowrkFilm cover research.pptx for media courseowrk
Film cover research.pptx for media courseowrk
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Unlocking Anticipatory Text Generation- A Constrained Approach for Large Lan...
Unlocking Anticipatory Text Generation-  A Constrained Approach for Large Lan...Unlocking Anticipatory Text Generation-  A Constrained Approach for Large Lan...
Unlocking Anticipatory Text Generation- A Constrained Approach for Large Lan...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
INTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
INTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITYINTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
INTRODUCTION TO BUSINESS ANALYTICS BA4206 ANNA UNIVERSITY
 
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvwAdobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 

Ad-hoc Big-Data Analysis with Lua

  • 1. Ad-hoc Big-Data Analysis with Lua And LuaJIT Alexander Gladysh <ag@logiceditor.com> @agladysh Lua Workshop 2015 Stockholm 1 / 32
  • 3. Alexander Gladysh CTO, co-owner at LogicEditor In löve with Lua since 2005 3 / 32
  • 4. The Problem You have a dataset to analyze, which is too large for "small-data" tools, and have no resources to setup and maintain (or pay for) the Hadoop, Google Big Query etc. but you have some processing power available. 4 / 32
  • 5. Goal Pre-process the data so it can be handled by R or Excel or your favorite analytics tool (or Lua!). If the data is dynamic, then learn to pre-process it and build a data processing pipeline. 5 / 32
  • 6. An Approach Use Lua! And (semi-)standard tools, available on Linux. Go minimalistic while exploring, avoid frameworks, Then move on to an industrial solution that fits your newly understood requirements, Or roll your own ecosystem! ;-) 6 / 32
  • 8. Data Format Plain text Column-based (csv-like), optionally with free-form data in the end Typical example: web-server log files 8 / 32
  • 9. Data Format Example: Raw Data 2015/10/15 16:35:30 [info] 14171#0: *901195 [lua] index:14: 95c1c06e626b47dfc705f8ee6695091a 109.74.197.145 *.example.com GET 123456.gif?q=0&step=0&ref= HTTP/1.1 example.com NB: This is a single, tab-separated line from a time-sorted file. 9 / 32
  • 10. Data Format Example: Intermediate Data alpha.example.com 5 beta.example.com 7 gamma.example.com 1 NB: These are several tab-separated lines from a key-sorted file. 10 / 32
  • 11. Hardware As usual, more is better: Cores, cache, memory speed and size, HDD speeds, networking speeds... But even a modest VM (or several) can be helpful. Your fancy gaming laptop is good too ;-) 11 / 32
  • 12. OS Linux (Ubuntu) Server. This approach will, of course, work for other setups. 12 / 32
  • 13. Filesystem Ideally, have data copies on each processing node, using identical layouts. Fast network should work too. 13 / 32
  • 15. Bash Script Example time pv /path/to/uid-time-url-post.gz | pigz -cdp 4 | cut -d$’t’ -f 1,3 | parallel --gnu --progress -P 10 --pipe --block=16M $(cat <<"EOF" luajit ~me/url-to-normalized-domain.lua EOF ) | LC_ALL=C sort -u -t$’t’ -k2 --parallel 6 -S20% | luajit ~me/reduce-value-counter.lua | LC_ALL=C sort -t$’t’ -nrk2 --parallel 6 -S20% | pigz -cp4 >/path/to/domain-uniqs_count-merged.gz 15 / 32
  • 16. Lua Script Example: url-to-normalized-domain.lua for l in io.lines() do local key, value = l:match("^([^t]+)t(.*)") if value then value = url_to_normalized_domain(value) end if key and value then io.write(key, "t", value, "n") end end 16 / 32
  • 17. Lua Script Example: reduce-value-counter.lua 1/3 -- Assumes input sorted by VALUE -- a foo --> foo 3 -- a foo bar 2 -- b foo quo 1 -- a bar -- c bar -- d quo 17 / 32
  • 18. Lua Script Example: reduce-value-counter.lua 2/3 local last_key = nil, accum = 0 local flush = function(key) if last_key then io.write(last_key, "t", accum, "n") end accum = 0 last_key = key -- may be nil end 18 / 32
  • 19. Lua Script Example: reduce-value-counter.lua 3/3 for l in io.lines() do -- Note reverse order! local value, key = l:match("^(.-)t(.*)$") assert(key and value) if key ~= last_key then flush(key) collectgarbage("step") end accum = accum + 1 end flush() 19 / 32
  • 20. Tying It All Together Basically: You work with sorted data, mapping and reducing it line-by-line, in parallel where at all possible, while trying to use as much of available hardware resources as practical, and without running out of memory. 20 / 32
  • 22. The Tools parallel sort, uniq, grep cut, join, comm pv compression utilities LuaJIT 22 / 32
  • 23. LuaJIT? Up to a point: 2.1 helps to speed things up, FFI bogs down development speed. Go plain Lua first (run it with LuaJIT), then roll your own ecosystem as needed ;-) 23 / 32
  • 24. Parallel xargs for parallel computation can run your jobs in parallel on a single machine or on a "cluster" 24 / 32
  • 25. Compression gzip: default, bad lz4: fast, large files pigz: fast, parallelizable xz: good compression, slow ...and many more, be on lookout for new formats! 25 / 32
  • 26. GNU sort Tricks LC_ALL=C sort -t$’t’ --parallel 4 -S60% -k3,3nr -k2,2 -k1,1nr Disable locale. Specify delimiter. Note that parallel x4 with 60% memory will consume 0.6 * log(4) = 120% of memory. When doing multi-key sort, specify parameters after key number. 26 / 32
  • 29. Why Lua? Perl, AWK are traditional alternatives to Lua, but, if you’re not very disciplined and experienced, they are much less maintainable. 29 / 32
  • 30. Start Small! Always run your scripts on small representative excerpts from your datasets, not only while developing them locally, but on actual data-processing nodes too. Saves time and helps you learn the bottlenecks. Sometimes large run still blows in your face though: Monitor resource utilization at run-time. 30 / 32
  • 31. Discipline! Many moving parts, large turn-around times, hard to keep tabs. Keep journal: Write down what you run and what time it took. Store actual versions of your scripts in a source control system. Don’t forget to sanity-check the results you get! 31 / 32