Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 10

Security analytics at web scale

0

Share

Download to read offline

OLAP is an acronym for online analytical processing. It focuses on reporting and in a broader sense, it is about answering schema oriented queries quickly. Queries could be “how many distinct infections seen for a threat in a given month” or “what is the maximum duration in last month that a particular infection was seen in my enterprise”.
Contrast this to OLTP or online transaction processing where storing a fast stream of transactional elements is more important.
If we talk about OLAP, Star Schema is the first thing that comes to mind. In a relational OLAP world, Star Schema is an important concept. Modeling OLAP data in Star Schema format means segregating data into Fact and Dimension tables. The central table represents couple of dimensions which constitutes a fact and one or more measures which we try to calculate. Measure is often a derived field and can be deduced with SQL queries like group by and aggregate functions.
We use Spark and HBase to implement a Hybrid OLAP system. We call it hybrid because we store data in both relational(ROLAP) and multi-dimensional (MOLAP) format.
MOLAP materialization can be best visualized as a lattice. Each of the circular points here is called Tile or Cuboid. Each of the tiles can be thought to be equivalent of Group By clause in SQL, aggregates like Sum or Count are implicit and not shown in the diagram. If we are reading the lattice from bottom to top we are skipping one field out of the 3 fields (Infection_type,country,monthId). The 2-D cuboids are based on dropping one field at a time. This is called roll up. Conversely if we start from the top i.e. 0-D cuboid and move downwards we are grouping by on one field, this is called drill down. There are various literature on how to do this rollup and drilldown efficiently and which cuboids to materialize. I would strongly recommend Han and Kamber's Data Mining book and the lattice paper by Harinarayan et al for deep understanding of this domain.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Security analytics at web scale

  1. 1. Security Analytics at Web Scale pratim_mukherjee@symantec.com
  2. 2.  Bangladesh Bank Chief Resigns After Cyber Theft of $81 million  New York Times (Mar 15,2016)  Cybercrime is a key fraud risk in India  ey.com (Jan 20,2016)  Target settles for $39 million over data breach  Cnn.com (Dec 2,2015)  Anthem is warning consumers about its huge data breach  Los Angeles Times.com (Mar,2015)  Ashley Madison  Anyone Here !! Why Should You Care !
  3. 3.  Incident Response  Identify root cause and fix vulnerabilities  Intrusion Detection  Monitor network and systems for malicious activities  Alert Prioritization  Reduce false positives to stop the threat with highest impact  Predicting Compromises  Predict attacks based on vulnerability, command & control activity and past infections  Access Analytics  Isolate unusual user behavior e.g. concurrent geographical login  Simulation  Simulate various attacks by doing internal pen testing and take precautions based on log mining  Simulate insider attack on data loss prevention software and take precautions based on its logs What is Security Analytics
  4. 4.  No real time query on Petabytes  Reduce data in stages like a funnel Web Scale - Dealing with Petabytes Streaming Logs Kafka Log Parser HiveSemi Aggregates HBase MOLAP CubesKafka Client
  5. 5.  Relational OLAP (ROLAP)  SQL kind of queries from client front-end tools for a relational back-end database.  ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services  ROLAP technology tends to have greater scalability than MOLAP technology  Multi-dimensional OLAP (MOLAP)  Query materialized views , think about Partially Ordered Sets (POSET)  The advantage of using a data cube is that it allows fast indexing to pre- computed summarized data and usually much faster than ROLAP  Difficult to scale because of “curse of dimensionality” Hybrid OLAP
  6. 6. Visualization of MOLAP as Lattice O-D (apex) cuboid 1-D cuboids 2-D cuboids 3-D (base) cuboid Infection_type monthId country (Infection_type,monthId) (country,monthId) (Infection_type,country) (Infection_type,country,monthId)
  7. 7. HBase MOLAP View ROWKEY [Infection_type,country,monthId] Aggregate Column Family Detection Count(COUNT Distinct) GEN-JP-1 10 4a44dc15364204a GEN-JP-2 12 e80e9039455cc GEN-JP-3 9 f1e5233ade6af GEN-JP-4 15 a80fe80e90 GEN-JP-5 5 3ade6af1dd5 GEN-JP-6 12 a44dc1536420 GEN-JO-1 2 …. GEN-JO-2 1 …. GEN-JO-3 0 …. GEN-JO-4 5 ….. GEN-JO-5 2 ….. GEN-JO-6 1 …... **hashes are representative Hyperloglog Hash
  8. 8.  Hyperloglog  Used for approximate count distinct queries  Store HLL hash in 5 bytes in HBase columns  Apply monoid SUM pattern to rollup  Bloom Filter  Used for checking whether an incoming stream element is “not” a member of a set  False negative never happens, i.e. an element “definitely not in set” is always correct  Also used by Hbase to ascertain whether input row key is part of a Hfile  Count-Min Sketch  Used for counting frequencies of specific elements in sub-linear space  Twitter’s Algebird library with Spark for HLL and CMS implementation Probabilistic Data Structures
  9. 9. Real-Time Query Response Server Query Controller Calcite HBase Adapter Yes Spark Driver on Jetty No SparkSQLQuery Is Cuboid Found ? HDFS/Hive/HBase Incoming Query Response HBaseQuery
  10. 10.  Questions/Comments Thank You

×