Open Source Tools for Big Data

OPEN SOURCE TOOLS FOR BIG DATA
Helsinki 19.9.2017
Teemu Heikkilä
Emblica

EMBLICA
We’re super small company of 5
people
We’re into Data Engineering,
DevOps and ML
We’re hiring!

Let’s start with something simple
ﬁrst

…but, still it has a meaning.

We are not really in Facebook scale
but is it worth to talk about big data
tools?

Because what works with petabytes of
data, almost certainly works with
gigabytes

Helsinki City bike station usage
17M rows of JSON

You will get:
Fault tolerance, reliability, scalability and working
models of processing data of any amounts

… but it doesn't mean you need
fancy frameworks necessarily

History of data processing
with free software

NOW200320011997 2006
Google published
whitepaper about solving
storage problems with
web indexing. Carafella
and Cutting implemented
the white paper as part of
the Nutch project
GFS
HISTORY OF HADOOP
Doug Cutting started to
develop first version of
Lucene at Yahoo!
START Cutting moved the NDFS
and MapReduce related
codebase under new
project called Hadoop
HADOOP
Cutting open sourced
Lucene and it was moved
under Apache Foundation
Mike Cafarella joined with
Cutting to start Apache
Nutch - project to index
whole internet.
OPEN SOURCED

Ideas, (whitepapers)
DFS
MR
BigTable
Dynamo
FOSS Implementations
HDFS
Hadoop MR
HBase
Cassandra

ACTIVITY DATA
Clickstreams
App usage
Application speciﬁc usage
Music listening
Video streaming
Money usage
Credit cards
Transactions

SENSOR DATA
Locations
Spatial data
Sensor metrics
IoT devices
Industrial and consumer
Time series

UNSTRUCTURED DATA
Machine logs,
Unstructured text,
natural language
Sound, Photos, Video

Use cases
What are you using those fancy logos for?

CASE 1: EVENT SOURCING SQL-DATABASES
Working legacy systems that used
MySQL-database as a realtime data
storage.
No historical data saved ever.
Delete means delete
Update means update
We could touch the legacy code to
save the changes
But we don’t have to

Maxwell’s daemon
Reads MySQL replication binary log
Produces stream of JSON-formatted changes

KAFKA - DISTRIBUTED APPEND-ONLY LOG
Kafka was originally developed by
LinkedIn, open sourced 2011
Distributed, append-only log
Great tool for delivering reliably
millions of arbitrary formatted
messages
Scales by partitioning and adding new
nodes
(c) Ch.ko123 / CC BY 4.0

(c) Apache Spark
+ Fast writes (queue/log)
+ Fast reads (in-memory)
- Latency
- Reliable event delivery 
is essential
KAPPA ARCHITECTURE

MATERIALIZING EVENT SOURCES
Change stream
Change stream
Change stream
Materialized
‘User’-table
Materialized
‘Resource’-table
Materialized
‘Usage’-table

APACHE SPARK
Originally developed at the University
of California, Berkeley's AMPLab
General large-scale data processing
framework
Based on MapReduce architecture but
keeps intermediate results in memory
instead of saving them to slow disks
like Hadoop
(c) Ch.ko123 / CC BY 4.0
Supports lot’s of different data
sources 
Programming APIs for Scala, Java or
Python

EKS-STACK
Elasticsearch is based on Lucene but
it’s more than just search engine, it
can be used to provide real time
analytics even for end users, it’s
usually used to store the aggregated
data
Kibana is great tool for the developers
and for internal use to discover and
analyze the data lying inside ES
Spark is used to process the events,
produce the needed aggregates and
ingest data into Elasticsearch so it can
be queried

Event
Collector
Processing AnalyticsEventsUser agent
CASE 2: EVERY ANALYTICS PIPELINE EVER

State
N
ew
State
Event Session

New session: 
started 07:17:09, duration 0s, OPEN
Existing session: 
Existing session: 
Existing session:
started 07:17:09, duration 14s,
paused 07:17:23, CLOSED

You can find me at:
@theikkilap
teemu@emblica.fi
https://emblica.fi
Any questions?
Thanks!
Icons from Font Awesome project

Open Source Tools for Big Data

More Related Content

What's hot

Similar to Open Source Tools for Big Data

Recently uploaded

Open Source Tools for Big Data