The document discusses real-time analysis and visualization of streaming data using Anubis Networks' StreamForce platform. It describes how StreamForce collects security events from various feeds, processes them using Node.js applications to aggregate and store the information in MongoDB and Redis. The stored data can then be queried and used to generate reports on infected machines, botnets, countries and other metrics.
2. Agenda
Who are we?
AnubisNetworks Stream
Stream Information Processing
Adding Valuable Information to Stream Events
2
3. Who are we?
Tiago Martins
AnubisNetworks
@Gank_101
3
João Gouveia
AnubisNetworks
@jgouv
Tiago Henriques
Centralway
@Balgan
4. Anubis StreamForce
Events (lots and lots of events)
Events are “volatile” by nature
They exist only if someone is listening
Remember?:
“If a tree falls in a forest and no one is
around to hear it, does it make a
sound?”
4
6. Anubis StreamForce
Problems (and ambitions) to tackle
The huge amount and variety of data to process
Mechanisms to share data across multiple systems,
organizations, teams, companies..
Common API for dealing with all this (both from a
producer and a consumer perspective)
6
7. Anubis StreamForce
Enter the security events CEP - StreamForce
High performance, scalable, Complex Event
Processor (CEP) – 1 node (commodity hw) = 50k
evt/second
Uses streaming technology
Follows a publish / subscriber model
7
8. Anubis StreamForce
Data format
Events are published in JSON format
Events are consumed in JSON format
8
13. Anubis CyberFeed 13
Feed galore!
Sinkhole data, traps, IP reputation, etc.
Bespoke feeds (create your own view)
Measure, group, correlate, de-duplicate ..
High volume (usually ~6,000 events per
second, more data being added frequently
20. Challenge
Let‟s use the Stream to help
Group by machine and trojan
From peak ~4k/s to peak ~1k/s
Filter fields.
Geo location
We end up with
{"env":{"remote_addr":"207.215.48.83"},"trojanfamily":"W32Expiro","_geo_env_remote_addr
":{"country_code":"US","country_name":"United States","city":"Los
Angeles","latitude":34.0067,"longitude":-118.3455,"asn":7132,"asn_name":"AS for SBIS-AS"}}
20
22. Technologies 22
Applications
NodeJS
Server-side Javascript Platform.
V8 Javascript Engine.
http://nodejs.org/
Why?
Great for prototyping.
Fast and scalable.
Modules for (almost) everything.
23. Technologies 23
Databases
MongoDB
NoSQL Database.
Stores JSON-style documents.
GridFS
http://www.mongodb.org/
Why?
JSON from the
Stream, JSON in the
database.
Fast and scalable.
Redis
Key-value storage.
In-memory dataset.
http://redis.io/
Why?
Faster than MongoDB for
certain operations, like
keeping track of number of
infected machines.
Very fast and scalable.
25. Data Collection 25
Storage
Aggregate
information
MongoDB Redis
Worker
Worker
Worker
Processor
Process real time
events
Events comes from the Stream.
Collector distributes events to Workers.
Workers persist event information.
Processor aggregates information and stores it for statistical and historical
analysis.
Collector
Stream
26. Data Collection 26
Storage
Aggregate
information
MongoDB Redis
Worker
Worker
Worker
Processor
Process real time
events
MongoDB
Real time information of infected machines.
Historical aggregated information.
Redis
Real time counters of infected machines.
Collector
Stream
27. Data Collection - Collector 27
Collector
Old data is periodically remove, i.e. machines that don‟t
produce events for more than 24 hours.
Send events to Workers.Workers
Decrements counters of removed information.
Send warnings
Country / ASN is no longer infected.
Botnet X decreased Y % of its size.
28. Data Collection - Worker 28
Worker
Create new entries for unseen machines.
Adds information about new trojans / domains.
Update the last time the machine was seen.
Process events and update the Redis counters
accordingly.
Needs to check MongoDB to determine if:
New entry – All counters incremented
Existing entry – Increment only the counters related to
that Trojan
Send warnings
Botnet X increased Y % in its size.
New infections seen on Country / ASN.
29. Data Collection - Processor
Processor
29
Processor retrieves real time counters from Redis.
Information is processed by:
Botnet;
ASN;
Country;
Botnet/Country;
Botnet/ASN/Country;
Total.
Persisting information to MongoDB creates a historic
database of counters that can be queried and
analyzed.
31. Data Collection - MongoDB
Collection for aggregated information (the historic counters database)
{
"_id" : ObjectId("519c0abac1172e813c004ac3"),
"0" : 744,
"1" : 745,
"3" : 748,
"4" : 748,
"5" : 746,
"6" : 745,
...
"10" : 745,
"11" : 742,
"12" : 746,
"13" : 750,
"14" : 753,
...
"metadata" : {
"country" : "CH",
"date" : "2013-05-22T00:00:00+0000",
"trojan" : "conficker_b",
"type" : "daily"
}
}
31
Preallocated entries for each hour
when the document is created.
If we don’t, MongoDB will keep
extending the documents by adding
thousands of entries every hour and it
becomes very slow.
32. Data Collection - MongoDB
Collection for 24 hours
4 MongoDB Shard instances
>3 Million infected machines
~2 Gb of data
~558 bytes per document.
Indexes by
ip – helps inserts and updates.
ip_numeric – enables queries by CIDRs.
last – Faster removes for expired machines.
host – Hmm, is there any .gov?
country, family, asn – Speeds MongoDB
queries and also allows faster custom
queries.
Collection for aggregated information
Data for 119 days (25 May to 11 July)
> 18 Million entries
~6,5 Gb of data
~366 bytes per object
~56 Mb per day
Indexes by
metadata.country
metadata.trojan
metadata.date
Metadata.asn
Metadata.type,
metadata.country,metadata.date,met.......
(all)
32
34. Data Collection - Redis
Redis performance in our machine
SET: 473036.88 requests per second
GET: 456412.59 requests per second
INCR: 461787.12 requests per second
Time to get real time data
Getting all the data from Familys/ASN/Counters to the NodeJS application and ready to
be processed in around half a second
> 120 000 entries in… (very fast..)
Our current usage is
~ 3% CPU (of a 2.0 Ghz core)
~ 480 Mb of RAM
34
35. Data Collection - API
But! There is one more application..
How to easily retrieve stored data
MongoDB Rest API is a bit limited.
NodeJS HTTP + MongoDB + Redis
Redis
http://<host>/counters_countries
...
MongoDB
http://<host>/family_country
...
Custom MongoDB Querys
http://<host>/ips?f.ip_numeric=95.68.149.0/22
http://<host>/ips?f.country=PT
http://<host>/ips?f.host=bgovb
35
36. Data Collection - Limitations
Grouping information by machine and trojan doesn‟t allow to
study the real number of events per machine.
Can be useful to get an idea of the botnet operations or how many
machines are behind a single IP (everyone is behind a router).
Slow MongoDB impacts everything
Worker application needs to tolerate a slow MongoDB and discard some
information has a last resort.
Beware of slow disks! Data persistence occurs every 60 seconds (default)
and can take too much time, having a real impact on performance..
>10s to persist is usually very bad, something is wrong with hard drives..
36
37. Data Collection - Evolution
Warnings
Which warnings to send? When? Thresholds?
Aggregate data by week, month, year.
Aggregate information in shorter intervals.
Data Mining algorithms applied to all the collected information.
Apply same principles to other feeds of the Stream.
Spam
Twitter
Etc..
37
38. Reports
What‟s happening in country X?
What about network 192.168.0.1/24?
Can send me the report of Y everyday at 7 am?
Ohh!! Remember the report I asked last week?
Can I get a report for ASN AnubisNetwork?
38
39. Reports 39
HTTP API
Schedule
Get
Edit
Delete
List schedules
List reports
Check MongoDB for work.
Generate CSV report or store the JSON Document for
later querying.
Send email with link to files when report is ready.
Server
Generator
45. Globe – NodeJS 45
Stream NodeJS Browser
NodeJS
HTTP
Get JSON from Stream.
Socket.IO
Multiple protocol support (to bypass some proxys and handle
old browsers).
Redis
Get real time number of infected machines.
46. Globe – Browser 46
Stream NodeJS Browser
Browser
Socket.IO Client
Real time apps.
Websockets and other
types of transport.
WebGL
ThreeJS
Tween
jQuery
WebWorkers
Runs in the background.
Where to place the red dots?
Calculations from geolocation
to 3D point goes here.
47. Globe – Evolution
Some kind of HUD to get better interaction and notifications.
Request actions by clicking in the globe.
Generate report of infected in that area.
Request operations in a specific that area.
Real time warnings
New Infections
Other types of warnings...
47
48. Adding Valuable Information to
Stream Events
How to distribute workload to other machines?
Adding value to the information we already have.
48
49. Minions
Typically the operations that would had value
are expensive in terms of resources
CPU
Bandwidth
Master-slave approach that distributes work
among distributed slaves we called Minions.
49
Master
Minion
Minion
Minion
Minion
50. Minions 50
Master receives work from Requesters and store the work in MongoDB.
Minions request work.
Requesters receive real time information on the work from the Master or
they can ask for work information at a later time.
Process / Storage Minions
Master MongoDB
DNS
Scan
Minion
Minion
Requesters
Minion
51. Minions
Master has an API that allows custom Requesters to ask for
work and monitor the work.
Minion have a modular architecture
Easily create a custom module.
Information received from the Minions can then be
processed by the Requesters and
Sent to the Stream
Saved on the database
Update existing database
51
Minion
DNS
Scanning
Data
Mining
52. Extras...
So what else could we possibly do using the Stream?
Distributed Portscanning
Distributed DNS Resolutions
Transmit images
Transmit videos
Realtime tools
Data agnostic. Throw stuff at it and it will deal with it.
52
53. Extras...
So what else could we possibly do using the Stream?
Distributed Portscanning
Distributed DNS Resolutions
Transmit images
Transmit videos
Realtime tools
Data agnostic. Throw stuff at it and it will deal with it.
53
FOCUS
FOCUS
54. Portscanning
Portscanning done right…
Its not only about your portscanner being able to throw 1 billion
packets per second.
Location = reliability of scans.
Distributed system for portscanning is much better. But its not just
about having it distributed. Its about optimizing what it scans.
54
59. Portscanning problems...
Doing portscanning correctly brings along certain problems.
If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time
59
60. Portscanning problems...
Doing portscanning correctly brings along certain problems.
If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time
60
61. Portscanning problems...
Doing portscanning correctly brings along certain problems.
If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time
You need lots of minions in different parts of the world
Doesn‟t actually require an amazing CPU or RAM if you do it correctly.
Storing all that data...
Querying that data...
Is it possible to have a cheap, distributed portscanning
system?
61
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Internet scale. Devices, systems, firewalls, ids..
Hi, I’m going to present the next section of the presentation.So, how can we collect events from the Stream? What information can we gather from those events?How can we access to those events in real time?
The challenge here is the large number of events per second, on total we currently have over 6000 events per second, 4000 of these events are from a single feed called banktrojans, which is basically formed by infected machines. This is what an event from that machines looks like.
So, basicallythiswhatwesee..
And, thisiswhatwewant.Wewant to knowwhereour targets are, where to look.
Infected machines are usually noisy and they tend to produce a big number of events. We can use the stream to help us, the group module groups the events that occur within 4 minutes of each other and originate from the same machine and trojan, we can go from 4000 to 1000 events per second, basically we receive an event for a machine and trojan and the next events will not be received because they are considered duplicates. Then we have the filter module to filter the fields we need, for example, we only care about the IP address, ASN, Trojan, C&C domain and geo location of the machine.How do we process and store these 1000 events per second?
Infected machines are usually noisy and they tend to produce a big number of events. We can use the stream to help us, the group module groups the events that occur within 4 minutes of each other and originate from the same machine and trojan, we can go from 4000 to 1000 events per second, basically we receive an event for a machine and trojan and the next events will not be received because they are considered duplicates. Then we have the filter module to filter the fields we need, for example, we only care about the IP address, ASN, Trojan, C&C domain and geo location of the machine.How do we process and store these 1000 events per second?
First, some technical information about the technologies we use.For applications development, we NodeJS, a server-side javascript platform built on top of the V8 engine. It’s fast, scalable and has modules for almost everything.For data storage, MongoDB is a NoSQL database that is fast and scalable. It can also store JSON-style documents and files in GridFS.And then we have Redis, key-value storage that is very fast and also scalable.
First, some technical information about the technologies we use.For applications development, we NodeJS, a server-side javascript platform built on top of the V8 engine. It’s fast, scalable and has modules for almost everything.For data storage, MongoDB is a NoSQL database that is fast and scalable. It can also store JSON-style documents and files in GridFS.And then we have Redis, key-value storage that is very fast and also scalable.
This is an overview of the Data Collection. We built 3 applications: Collector; Worker; Processor.We have the events coming from the Stream to the Collector. The Collector then distributes the workload to workers that process and store the information in MongoDB and Redis.The Processor will then gather information from MongoDB and stores it for statistical and historical analysis.
Events come from the Stream to the Collector. The Collector then distributes the workload to workers that process and store the information in MongoDB and Redis.The Processor will then gather information from Redis and stores it in MongoDB for statistical and historical analysis.
Events come from the Stream to the Collector. The Collector then distributes the workload to workers that process and store the information in MongoDB and Redis.The Processor will then gather information from Redis and stores it in MongoDB for statistical and historical analysis.
So the Collector, talks to these 3 components. It maintains the information on MongoDB, removing information about machines that don’t produce events for more than 24 hours.Decrements counters for Redis, and while maintaining this information, it is possible to send warnings.Workers receive events from the Collector and can run in any machine with connection to the collector and database..
The Worker processes and stores the event in MongoDB, creating new entries or updating information about new trojans in existing entry. It also updates the last time we saw an event for that machine.While updating MongoDB the Worker also needs to maintain the Redis counters information, incrementing the values for new entries or updating counters for a new trojan in a seen machine. While performing this task it can also understand if there is a warning to be sent.
The last component is Processor. It retrieves real time counters from Redis, processes and stores them in MongoDB aggregated by Botnet, ASN, Country, etc. This information can then be analysed and queried.
Let’s now check the Databases. MongoDB collection that stores information of active machines in the last 24 hours, looks like this. It’s a JSON document with information about geolocation, IP address, Trojans, last time seen, etc. There is also a numerical representation of the IP Address that helps to query for specific network ranges.
The aggregated information collection holds documents with this format. The metadata field that holds information about the specific document, its type and origin of information. In this case its country and trojan. It has an entry per hour with the number of infections, these entries need to be preallocated with zeros, so at every day a new document is created for a specific metadata with all the hours at 0. If we don’t do this there will be a lot of extends of documents on MongoDB and I will become very slow.
Some more information for this collections. The 24 hours collection is sharded between 4 MongoDB instance and in July it had information over 3 million infected machines, that only takes 2 Gb of disk to store. The aggregated information collected for 119 days, had over 18 million entries and occupied around 6,5 Gb of data, that’s around 56Mb per day.These were the indexes created. We need to be very careful with these because they speed the readings but they slow the writings. We want fast writes for the 24 hours collection and for that reason we need to keep the indexes optimized and only the IP index runs on the foreground, all the other run on the background.For the aggregated information collection we don’t need to be very careful, we can add the indexes that will allow us to perform faster queries.
Let’s look at the Redis information. The counters look like this, they are concatenation of string separated by colons, for example (example).
Redis is very fast, we can retrieve all the information from the biggest in around half a second. The insertion of data is also very fast while using very few resources of a machine.
There was also the need to access all this information on demand, so an API was created that allow to retrieve or query information on both Redis and MongoDB.
So, there are a couple of limitations with these approaches. By grouping events in order to reduce the amount of events per second we are discarding information that could be studied in order to better understand what is behind those machines, for example, the number of events of a machine with a specific botnet could indicate how many machines are on that network (everyone has a router nowadays).Also MongoDB can impact everything, it is fast but needs to be used carefully. We need 3 MongoDB shards to keep the performance on acceptable levels. If we start getting 2 or 3 times the events we currently have the Workers won’t be able to persist all that information in time and will have to start discarding it at some point. The alternative to discard is to add more shards. You need to constantly monitor your hard drives, if the performance decreases, bad things will happen. Mongo won’t be able to persist the information in time and will start to slow down everything.
How can we evolve this solution?We can send more warnings with the information we have, but when? What thresholds should we use.We only aggregate information by the hours and day, what about weeks, months, years? What about shorter intervals?We can also apply data mining algorithms in order to retrieve more information from the data we already collect.And of course, apply these principals to other feeds like Spam or Twitter.
So how do we extract information about a specific network or country? What about what happened last week?
Of course we used NodeJS and built 2 applications, one that is used as an API to access and request reports and the other that checks the Database for requests, generates the reports and stores them. The reports are saved in CSV or in JSON format, for later query. They are also sent by email where we give a URL to download the files.
The collections that hold the CSV reports look like this. The have a scheduled work collection that keeps an record of the report its generating and the reports it already generated. The reports keep an array of files generated and saved on MongoDB storage for files, called GridFS.
Then we have the JSONs reports, that we call snapshots. The main differences are the count field in the snapshot that holds the number of infected machines in that snapshot, and the results for that snapshot which include the information about the machine and the metadata that identifies the origin of that entry.We could store an array of results in the Snapshot collection but it would be hard to use it because it would have too many entries, possibly millions and would just be useless.
How could we evolve the Reports? We can store reports in other formats, generate charts for that report with specific information and start storing other type of reports, not just for botnets.
So, how can we visualize realtime events? Let’s focus on the botnets again, it would be awesome if we could see the distribution of botnets thru the world, receive warnings and monitor other information in realtime.For that purpose, there is a shiny globe (demo).We can see in realtime when infected machines produce events, monitor a top of most infected with a specific Trojan, number of events being generated every second and a total number of infections. countries
This information comes from the Steam, we group it by Trojan and country, we don’t really want to sent ALL the events to the browser because some browsers would just crash.. For that reason we also filter only the geolocation and trojan family. The information about the top infected come from a KPI module that dynamically calculates the top in the stream.
Between the Stream and the Browser we have a NodeJS application that controls the flow of events to the browser, discarding if too many events are received and relaying the information to the Browser using the socket.io module. We all need to get the total number of infected machines from the Redis counters.
At the browser end we use the socket.io client to receive the events, process those events using WebWorkers (calculation of where to place the dots) and render everything using WebGL.
We can evolve the globe to create a more interactive experience where we could perform actions in realtime through the globe.We can also show warnings in the globe, for example, about new infections.
How can we add valuable information to the information we already have?
Typically the operations that would had value are expensive, they need CPU and bandwidth. So we needed a master-slave approach that distributes the work among multiple slaves, we called Minions.
Masters receive work from the Requesters and store that work on MongoDB. Minions will then request work and send the work result to the Master. Master then send updates directly to the Requester of the work and also stores the results in MongoDB.
The Master has an API that allows for custom Requesters to ask for work and monitor the work results received from the Minion.The Minion application was built with a modular architecture in mind, so it is very easy to create a custom module.Information received by the minion can then be injected on the stream or stored in a database.
Getting full picture from an infected machine or a networking involves lots of steps:Sinkholing that botnetPortscanning target gives u an idea if the machine is connected directly to internet or behind gateway or if there are shares available, how could this machine possibly been compromised (ms08-067 ? )DNS analysis
We are going to focus on:PortscanningDNS resolutionsRealtime demos
Its really cool to have a super fast scanner in a lab giving 1 quadrillion packets per second. However this is the wrong way. Correct way:Slow scanGeo DistributedScanning angola from australia = 60% of services timeout and look closedScanning USA from Russia or vice versa = retarded
Combining a model B raspberry pi with the distro pwnpi and a custom set of scripts makes it a Minion. A cheap device that we can use to do distributed scanning and even ask others to deploy and contribute to our system.In the near future we intend to make this image available for others that want to contribute to our system.