Ledion Bitincka from Splunk spoke at the AWS Big Data Meetup in Palo Alto and give an overview of Splunk’s processing pipeline topology and explained their approach to indexing data at scale.
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Splunk talk at the AWS Big Data Meetup in Palo Alto on Nov 17 2015
1. Data
Through
Splunk
1
Ledion
Bi6ncka
(ledion@splunk.com)
Alex
Batsakis
(abatsakis@splunk.com)
Architects
2. Spelunking:
Splunking:
to
explore
underground
caves
to
explore
machine
data
Splunk
Make
machine
data
accessible,
usable
and
valuable
to
everyone.
3. What
Does
Machine
Data
Look
Like?
3
Sources
Twi2er
Care
IVR
Middleware
Error
Order
Processing
4. Machine
Data
Contains
Cri6cal
Insights
4
Customer
ID
Order
ID
Customer’s
Tweet
Time
Wai6ng
On
Hold
TwiMer
ID
Product
ID
Company’s
TwiMer
ID
Sources
Twi2er
Care
IVR
Middleware
Error
Order
Processing
Customer
ID
Order
ID
Customer
ID
5. Machine
Data
Contains
Cri6cal
Insights
5
Order
ID
Customer’s
Tweet
Time
Wai6ng
On
Hold
Product
ID
Company’s
TwiMer
ID
Sources
Twi2er
Care
IVR
Middleware
Error
Order
Processing
Order
ID
Customer
ID
TwiMer
ID
Customer
ID
Customer
ID
6. Web
Services
Search,
Inves6gate
and
Explore
Your
Data
6
Find
and
fix
issues
and
incidents
drama6cally
faster
across
your
organiza6on
Energy
Manufacturing
Shipping
RFID
Web
Services
Developers
App
Support
Telecoms
Networking
Desktops
Servers
Security
Databases/
DWH
Storage
Messaging
Online
Shopping
Carts
Clickstream
GPS/Cellular
Social
Media
7. Search
and
Inves6gate
Proac6ve
Monitoring
and
Aler6ng
Opera6onal
Visibility
Real-‐6me
Business
Insight
Turning
Machine
Data
into
Opera6onal
Intelligence
7
Proac6ve
Reac6ve
9. Massive
Linear
Scalability
to
100s
of
TBs/Day
9
Auto
load-‐balanced
forwarding
to
as
many
Splunk
Indexers
as
you
need
to
index
TB/day
Offload
search
load
to
Splunk
Search
Heads
11. Consider
this
chunk
of
data
from
a
log
file:
/var/log/secure.log
...
2013/07/01T14:30:24.234-‐0400
Brian
pretends
to
be
from
South
Africa
2013/07/01T14:31:24.234-‐0400
Sean
is
originally
Canadian
2013/07/01T14:30:50.234-‐0400
Brian
spends
his
time
in:
-‐
Kentucky
with
phone
number
345.567.3456
-‐
New
Jersey
2013/07/01T14:32:24.234-‐0400
Matty
has
lived
in
the
following
cities:
-‐
Tijuana:
345
Main
St.
-‐
Saskatchewan:
3
One
Lane
-‐
Colombia:
567
White
line
Dr.
Bogota
2013/07/01T14:33:24.234-‐0400
Cesar
prefers
Burbon
Manhattans
over
beer
2013/07/01T14:33:24.234-‐0400
Matty
loves
GiGi
Mellow
Burgers
2013/07/01T14:33:24.234-‐0400
Sean
is
not
the
only
one
to
not
like
them
...
11
12. Host
my_host
Index
my_index
_raw
2013/07/01T14:30:24.234-‐0400
Brian
pretends
to
be
from
South
Africa
2013/07/01T14:31:24.234-‐0400
Sean
is
originally
Canadian
2013/07/01T14:30:50.234-‐0400
Brian
spends
his
time
in:
...
UTF-‐8
Line
Broken
_conf
<key
here>
Pipeline
Data
13. Pipelines/Processors
Parsing
Queue
Agg
Queue
Typing
Queue
Index
Queue
uk8
header
aggregator
regex
replacement
annotator
tcp
out
syslog
out
indexer
Parsing
Pipeline
Merging
Pipeline
Typing
Pipeline
Index
Pipeline
linebreaker
TCP/UDP
pipeline
Tailing
FIFO
pipeline
FSChange
Exec
pipeline
14. Queue
pData
pData
pData
pData
Queue
Thread
Thread
Process
Process
Remove
Insert
ü Queue
size
bounded
by
memory
ü Variable
size
Pipeline
Data
15. Persistent
Queue
Splunk
Host
Internal
Queues
Full
pData
pData
Tcpout
Q
Input
Q
Persistent
Q
A
Full
Network
Much
Bigger
Queue
Network
16. Indexing
Parsing
Queue
Agg
Queue
Typing
Queue
Index
Queue
uk8
header
aggregator
regex
replacement
annotator
tcp
out
syslog
out
indexer
Parsing
Pipeline
Merging
Pipeline
Typing
Pipeline
Index
Pipeline
linebreaker
TCP/UDP
pipeline
Tailing
FIFO
pipeline
FSChange
Exec
pipeline
17. What’s
an
index
Collec6ve
term
used
to
describe
rawdata
and
associated
tsidx
&
metadata
files.
17
18. Inside
an
index
18
[09:31:39]
[1065]::
lbi6ncka@lbi6ncka:
/opt/splunk/var/lib/splunk/_internaldb/
$
ls
-‐l
total
0
drwx-‐-‐-‐-‐-‐-‐
2
lbi6ncka
admin
68
Feb
6
12:57
colddb
drwx-‐-‐-‐-‐-‐-‐
17
lbi6ncka
admin
578
Jul
1
09:31
db
drwx-‐-‐-‐-‐-‐-‐
13
lbi6ncka
admin
442
Jun
27
16:36
summary
drwx-‐-‐-‐-‐-‐-‐
2
lbi6ncka
admin
68
Aug
24
2012
thaweddb
Index
name
Bucket
loca6ons
20. Inside
a
bucket
20
[10:31:32]
[1092]::
lbi6ncka@lbi6ncka:
/opt/splunk/var/lib/splunk/_internaldb/db/db_1371998025_1371214200_158/
$
ll
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
27M
Jun
21
16:49
1371847782-‐1371214200-‐1941140693112088843.tsidx
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
7.1M
Jun
26
12:43
1371998025-‐1371847783-‐907852835360656754.tsidx
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
2.5M
Jun
26
12:43
merged_lexicon.lex
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
459K
Jun
26
12:43
bloomfilter
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
1.3K
Jun
23
10:33
Sources.data
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
615B
Jun
23
10:33
SourceTypes.data
drwx-‐-‐-‐-‐-‐-‐
17
lbi6ncka
admin
578B
Jul
1
10:31
..
drwx-‐-‐x-‐-‐x
16
lbi6ncka
admin
544B
Jun
26
12:50
.
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
451B
Jun
23
10:31
Strings.data
drwx-‐-‐-‐-‐-‐-‐
4
lbi6ncka
admin
136B
Jun
26
12:42
rawdata
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
116B
Jun
23
10:33
Hosts.data
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
76B
Jun
23
10:33
splunk-‐autogen-‐params.dat
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
52B
Jun
26
12:50
bucket_info.csv
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
49B
Jun
26
12:43
op6mize.result
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
10B
Jun
26
12:43
.rawSize
-‐rw-‐-‐-‐-‐-‐-‐-‐
1
lbi6ncka
admin
8B
Jun
26
12:43
.sizeManifest4.1
21. Metadata
&
Bloomfilters
*.data
– metadata
about
sources,
sourcetypes
and
hosts
of
the
events
contained
in
each
bucket
Bloomfilters
– Efficient
data
structure
that
authorita6vely
rules
out
buckets
ê i.e.
tells
you
with
100%
certainty
that
a
querying
term
is
NOT
in
present
in
a
bucket
– By
default
consulted
by
every
search
21
22. Rawdata
(not
raw
data)
Collec6on
of
compressed
(gzipped)
blocks,
called
slices,
– Concatenated
together
in
a
rawdata/journal.gz
– Think
”cat
chunkA.gz
chunkB.gz
...chunkN.gz
>
journal.gz”).
Slices
contain
the
actual
raw
events.
Pool
of
concatenated
slices
allows
be
seeked
into
– Loca6ons
offsets
are
pointed
to
by
the
values
array
pointers
in
tsidx.
Such
organiza6on
allows
us
to
zoom
in
to
the
right
slice
– reduces
the
amount
of
decompression
6me
&
volume
compared
to
having
a
single,
massive
rawdata
file.
22
23. TSIDX
Time
series
index
(Inverted
index
op6mized
for
6me)
Lexicon:
– Keywords
within
the
specified
6me
range
– Pos6ngs
list
array
Values
array:
– Structure
that
contains
pos6ng
values,
seek
address,
_6me
etc.
– Seek
address
points
to
offsets
in
rawdata
Time
is
of
transcendent
importance
in
Splunk,
– tsidx
filenames
expose
et
and
lt
– Values
arrays
arranged
in
6me
order
as
well
23
24. Lexicon
24
2013/07/01T14:30:24.234-‐0400
Brian
pretends
to
be
from
South
Africa
2013/07/01T14:31:24.234-‐0400
Sean
is
originally
Canadian
2013/07/01T14:30:50.234-‐0400
Brian
spends
his
time
in:
-‐
Kentucky
with
phone
number
345.567.3456
-‐
New
Jersey
2013/07/01T14:32:24.234-‐0400
Matty
has
lived
in
the
following
cities:
-‐
Tijuana:
345
Main
St.
-‐
Saskatchewan:
3
One
Lane
-‐
Colombia:
567
White
line
Dr.
Bogota
2013/07/01T14:33:24.234-‐0400
Cesar
prefers
Burbon
Manhattans
over
beer
2013/07/01T14:33:24.234-‐0400
Matty
loves
GiGi
Mellow
Burgers
2013/07/01T14:33:24.234-‐0400
Sean
is
not
the
only
one
to
not
like
them
Term
Posbng
List
3
4
345
3,4
…
…
Africa
0
Brian
0,2
Bogota
4
…
…
MaMy
5,6
Tijuana
4
25. Values
Array
25
2013/07/01T14:30:24.234-‐0400
Brian
pretends
to
be
from
South
Africa
2013/07/01T14:31:24.234-‐0400
Sean
is
originally
Canadian
2013/07/01T14:30:50.234-‐0400
Brian
spends
his
time
in:
-‐
Kentucky
with
phone
number
345.567.3456
-‐
New
Jersey
2013/07/01T14:32:24.234-‐0400
Matty
has
lived
in
the
following
cities:
-‐
Tijuana:
345
Main
St.
-‐
Saskatchewan:
3
One
Lane
-‐
Colombia:
567
White
line
Dr.
Bogota
2013/07/01T14:33:24.234-‐0400
Cesar
prefers
Burbon
Manhattans
over
beer
2013/07/01T14:33:24.234-‐0400
Matty
loves
GiGi
Mellow
Burgers
2013/07/01T14:33:24.234-‐0400
Sean
is
not
the
only
one
to
not
like
them
Posbng
Seek
addr
_bme
host
…
0
130
1372689024
my_host
…
1
150
1372689084
my_host
…
2
190
1372689050
my_host
…
3
389
1372689050
my_host
…
4
589
1372689050
my_host
…
5
800
1372689050
my_host
…
6
1399
1372689050
my_host
…
…
…
…
…
*all
values
for
illustra6on
purposes.
Not
necessarily
accurate
26. Tsidx
merging
Many
small
tsidx
files
due
to
data
streaming
Searching
is
inefficient
when
going
against
many
tsidx
files
splunk-‐op6mize
– Merging
of
small
tsidx
files
into
a
larger
ones
– Consolida6on
of
lexicons
and
pos6ng
list
26
27. Puzng
it
together
27
IDX
1
IDX
2
IDX
3
Cold
Path
Thawed
Path
Rawdata
TSIDX
hot_v1_100
hot_v1_101
db_lt_et_80
db_lt_et_101
*.data
*.tsidx
rawdata
db_lt_et_70
apple
beer
LEXICON
POSTING
“apple
pie
and
ice
cream
is
delicious”
“an
apple
a
day
keeps
doctor
away”
150
100
et
et
lt
lt
it
it
apple
beer
coke
ice
java
…
Home
Path
Source/Sourcetype/Host
Metadata
1
source
:
:
/my/log
2
source:
:
/blah
cream
28. Bucket
Lifecycle
28
Events
[Too
Many
Warms]
[Hot
Bucket
is
Full]
[Out
of
Space
or
Bucket
is
Old]
[Explicit
User
Ac6on]
$
Thawed
Path
$
Home
Path
$
Cold
Path
[Cheaper
Storage]
$
Frozen
Path
or
Deleted
29. How
do
we
search?
Consult
the
lexicon
and
combine
the
pos6ng
lists
– brian
OR
tijuana
=>
(0,
2)
OR
(4)
=
(0,
2,
4)
Use
values
array
to
get
seek
address,
_6me,
source
and
sourcetype
for
(0,
2,
4)
Use
the
seek
addresses
to
read
rawdata
in
offset
(130,
150,
190)
Send
“results”
to
the
search
29
30. Search
Model
Example
sourcetype=syslog ERROR | top user | fields - percent
Fetch
events
from
disk,
apply
schema
Summarize
into
table
of
top
10
users
Remove
column
showing
percentage
Intermediate
results table
Intermediate
results table
Final results
table
Disk
31. What
can
we
do
with
events?
It’s
not
just
search
…
SPL
=
Search
Processing
Language
– Inspired
by
*nix
pipes
– Schema
on
read
– 130+
search
commands
for
slicing
thru
data
Versa6le
visualiza6on
library
Scheduling
and
aler6ng
…
31
32. LOB
Owners/
Execu6ves
System
Administrator
Opera6ons
Teams
Security
Analysts
IT
Execu6ves
Applica6on
Developers
Auditors
Website/Business
Analysts
Customer
Support
32
IT
Opera6ons
Management
Web
Intelligence
Business
Analy6cs
Applica6on
Management
Security
and
Compliance
33. Take
it
for
a
spin
…
hMp://www.splunk.com/download/
-‐ Download
-‐ Try
Splunk
Cloud
–
AWS
WE’RE
HIRING
!!
(in
SF
&
valley)