Trend Micro Big Data Platform and Apache Bigtop

葉祐欣 (Evans Ye)
Big Data Conference 2015
Trend Micro Big Data Platform  
and Apache Bigtop

Who am I
• Apache Bigtop PMC member
• Apache Big Data Europe 2015 Speaker
• Software Engineer @ Trend Micro
• Develop big data apps & infra
• Has some experience in Hadoop, HBase, Pig,
Spark, Kafka, Fluentd, Akka, and Docker

Outline
• Quick Intro to Bigtop
• Trend Micro Big Data Platform
• Mission-speciﬁc Platform
• Big Data Landscape (3p)
• Bigtop 1.1 Release (6p)

Hadoop Distributions
We’re fully open sourced !

From source code
to packages
Bigtop 
Packaging

Bigtop feature set
Packaging Testing Deployment Virtualization
for you to easily build your own Big Data Stack

• $ git clone https://github.com/apache/bigtop.git
• $ docker run  
--rm  
--volume `pwd`/bigtop:/bigtop  
--workdir /bigtop  
bigtop/slaves:trunk-centos-7  
bash -l -c ‘./gradlew rpm’
One click to build packages

Easy to do CI
ci.bigtop.apache.org

RPM/DEB packages
www.apache.org/dist/bigtop

One click Hadoop provisioning
./docker-hadoop.sh -c 3

bigtop/deploy image  
on Docker hub

bigtop/deploy image  
on Docker hub
puppet apply
puppet apply
puppet apply
Just google bigtop provisioner

If you want to build your
own customised  
Big Data Stack

Pros & cons
• Bigtop
• You need a talented Hadoop team
• Self-service: troubleshoot, ﬁnd solutions, develop patches
• Add any patch at any time you want (additional efforts)
• Choose any version of component you want (additional efforts)
• Vendors (Hortonworks, Cloudera, etc)
• Better support since they’re the guy who write the code !
• $

Trend Micro  
Big Data Platform

• Use Bigtop as the basis for our internal custom
distribution of Hadoop
• Apply community, private patches to upstream
projects for business and operational need
• Newest TMH7 is based on Bigtop 1.0 SNAPSHOT
Trend Micro Hadoop (TMH)

Working with community
made our life easier
• Knowing community status made TMH7 release  
based on Bigtop 1.0 SNAPSHOT possible

• Contribute Bigtop Provisioner, packaging code,
puppet recipes, bugﬁxes, CI infra, anything!
• Knowing community status made TMH7 release  
based on Bigtop 1.0 SNAPSHOT possible

• Leverage Bigtop smoke tests and integration tests  
with Bigtop Provisioner to evaluate TMH7

• Contribute feedback, evaluation, use case
through Production level adoption
• Leverage Bigtop smoke tests and integration tests  
with Bigtop Provisioner to evaluate TMH7

Hadoop YARN
Hadoop HDFS
Mapreduce
Ad-hoc Query UDFs
Pig
App A App C
Oozie
Resource
Management
Storage
Processing
Engine
APIs and 
Interfases
In-house  
Apps
Trend Micro Big Data Stack
Powered by Bigtop
Kerberos
App B App D
HBase
Wuji
Solr
Cloud
Hadooppet (prod) Hadoocker (dev)Deployment

Hadooppet
• Puppet recipes to deploy and manage TMH  
Big Data Platform
• HDFS, YARN, HA auto-configured
• Kerberos, LDAP auto-configured
• Kerberos cross realm authentication auto-configured 
(For distcp to run across secured clusters)

• A Devops toolkit for Hadoop app developer  
to develop and test its code on
• Big Data Stack preload images 
—> dev & test env w/o deployment 
—> support end-to-end CI test
• A Hadoop env for apps to test against new  
Hadoop distribution
• https://github.com/evans-ye/hadoocker
Hadoocker

internal Docker registry
./execute.sh
Hadoop server
Hadoop client
data
Docker based dev & test env
TMH7
Hadoop app
Restful  
APIs
sample data
hadoop fs put

internal Docker registry
./execute.sh
Hadoop server
Hadoop client
data
TMH7
Hadoop app
Restful  
APIs
sample data
hadoop fs putSolr
Oozie(Wuji)
Dependency service
Docker based dev & test env

Use case
• Real-time streaming data ﬂows in
• Lookup external info when data ﬂows in
• Detect threat/malicious activities on streaming data
• Correlate with other historical data (batch query) to gather
more info
• Can also run batch detections by specifying arbitrary start
time and end time
• Support Investigation down to raw log level

transformation, 
lookup ext info
receiver
buffer

batch
streaming
receiver
buffer
transformation, 
lookup ext info

transformation, 
lookup ext info
batch
streaming
receiver
buffer

• High-throughput, distributed publish-subscribe
messaging system
• Supports multiple consumers attached to a topic
• Conﬁgurable partition(shard), replication  
factor
• Load-balance within same consumer group
• Only consume message once
a b c

• Distributed NoSQL key-value storage, no SPOF
• Super fast on write, suitable for data keeps coming in
• Decent read performance, if design it right
• Build data model around your queries
• Spark Cassandra Connector
• Conﬁgurable CA (CAP theorem)
• Choose A over C for availability and vise-versa
Dynamo: Amazon’s Highly Available Key-value Store

• Fast, distributed, in-memory processing engine
• One system for streaming and batch workloads
• Spark streaming

Akka
• High performance concurrency framework for Java and Scala
• Actor model for message-driven processing
• Asynchronous by design to achieve high throughput
• Each message is handled in a single threaded context 
(no lock, synchronous needed)
• Let-it-crash model for fault tolerance and auto-healing system
• Clustering mechanism to scale out
The Road to Akka Cluster, and Beyond

Akka Streams
• Akka Streams is a DSL library for streaming computation on Akka
• Materializer to transform each step into Actor
• Back-pressure enabled by default
Source Flow Sink
The Reactive Manifesto

No back-pressure
Source Fast!!! SinkSlow…
(＞﹏＜)’v(￣︶￣)y

No back-pressure
(＞﹏＜)’’’’’v(￣︶￣)y

With back-pressure

With back-pressure
request 3request 3

Data pipeline with Akka Streams
• Scale up using balance and merge
source: http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html#working-with-ﬂows
worker
worker
worker
balance merge

• Scale out using docker
Data pipeline with Akka Streams
$ docker-compose scale pipeline=3

Reactive Kafka
• Akka Streams wrapper for Kafka
• Commit processed offset back into Kafka
• Provide at-least-once delivery guarantee
https://github.com/softwaremill/reactive-kafka

Message delivery guarantee
• Actor Model: at-most-once
• Akka Persistence: at-least-once
• Persist log to external storage (like WAL)
• Reactive Kafka: at-least-once + back-pressure
• Write offset back into Kafka
• At-least-once + Idempotent writes = exactly-once

• Spark: both streaming and batch analytics
• Docker: resource management (ﬁne for one app)
• Akka: ﬁne-grained, elastic data pipelines
• Cassandra: batch queries
• Kafka: durable buffer, fan-out to multiple consumers
Recap: SDACK Stack

The SMACK Stack
Toolbox for wide variety of data processing scenarios

SMACK Stack
• Spark: fast and general engine for large-scale data
processing
• Mesos: cluster resource management system
• Akka: toolkit and runtime for building highly concurrent,
distributed, and resilient message-driven applications
• Cassandra: distributed, highly available database designed
to handle large amounts of data across datacenters
• Kafka: high-throughput, low-latency distributed pub-sub
messaging system for real-time data feeds
Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka

Reference
• Spark Summit Europe 2015
• Streaming Analytics with Spark, Kafka,
Cassandra, and Akka (Helena Edelson)
• Big Data AW Meetup
• SMACK Architectures (Anton Kirillov)

• Memory is faster than SSD/disk, and is cheaper
• In Memory Computing & Fast Data
• Spark : In memory batch/streaming engine
• Flink : In memory streaming/batch engine
• Iginte : In memory data fabric
• Geode (incubating) : In memory database
Big Data moving trend

• Off-Heap storage is a JVM process memory
outside of the heap, which is allocated and
managed using native calls.
• size not limited by JVM (it is limited by physical
memory limits)
• is not subject to GC which essentially removes
long GC pauses
• Project Tungsten, Flink, Iginte, Geode, HBase
Off-Heap, Off-Heap, Off-Heap

Pig
Hadoop YARN
Hadoop HDFS
Resource
Management
Storage
Processing
Engine
(Some) Apache Big Data
Components
Slider
Flink Spark
Flink ML,
Gelly
Streaming,
MLlib, GraphX
Kafka
HBase
Mesos
Tez
Hive Phoenix
Ignite
APIs and 
Interfases
Geode
Trafodion
Solr
}
messaging system in memory data grid search engine
Bigtop
Ambari
Hadoop 
Distribution
Hadoop 
Management
Cassandra
NoSQL

Bigtop 1.1 Release
Jan, 2016, I expect…

Bigtop 1.1 Release
• Hadoop 2.7.1
• Spark 1.5.1
• Hive 1.2.1
• Pig 0.15.0
• Oozie 4.2.0
• Flume 1.6.0
• Zeppelin 0.5.5
• Ignite Hadoop 1.5.0
• Phoenix 4.6.0
• Hue 3.8.1
• Crunch 0.12
• …, 24 components included!

Hadoop 2.6
• Heterogeneous Storages
• SSD + hard drive
• Placement policy (all_ssd, hot, warm, cold)
• Archival Storage (cost saving)
• HDFS-7285 (Hadoop 3.0)
• Erasure code to save storage from 3X to 1.5X
http://www.slideshare.net/Hadoop_Summit/reduce-storage-
costs-by-5x-using-the-new-hdfs-tiered-storage-feature

Hadoop 2.7
• Transparent encryption (encryption zone)
• Available in 2.6
• Known issue: Encryption is sometimes done
incorrectly (HADOOP-11343)
• Fixed in 2.7
http://events.linuxfoundation.org/sites/events/ﬁles/slides/
HDFS2015_Past_present_future.pdf

Rising star: Flink
• Streaming dataﬂow engine
• Treat batch computing as ﬁxed length streaming
• Exactly-once by distributed snapshotting
• Event time handling by watermarks

• Integrate and package Apache Flink
• Re-implement Bigtop Provisioner using  
docker-machine, compose, swarm
• Deploy containers on multiple hosts
• Support any kind of base image for deployment
Bigtop Roadmap

• Hadoop Distribution
• Choose Bigtop if you want more control
• The SMACK Stack
• Toolbox for variety data processing scenarios
• Big Data Landscape
• In-memory, off-heap solutions are hot
Wrap up

Trend Micro Big Data Platform and Apache Bigtop

More Related Content

What's hot

Similar to Trend Micro Big Data Platform and Apache Bigtop

More from Evans Ye

Recently uploaded

Trend Micro Big Data Platform and Apache Bigtop