Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
葉祐欣 (Evans Ye)
Big Data Conference 2015
Trend Micro Big Data Platform 

and Apache Bigtop
Who am I
• Apache Bigtop PMC member
• Apache Big Data Europe 2015 Speaker
• Software Engineer @ Trend Micro
• Develop big ...
Outline
• Quick Intro to Bigtop
• Trend Micro Big Data Platform
• Mission-specific Platform
• Big Data Landscape (3p)
• Big...
Quick Intro to Bigtop
Linux Distributions
Hadoop Distributions
Hadoop Distributions
We’re fully open sourced !
How do I add patches?
From source code
to packages
Bigtop

Packaging
Bigtop feature set
Packaging Testing Deployment Virtualization
for you to easily build your own Big Data Stack
Supported components
• $ git clone https://github.com/apache/bigtop.git
• $ docker run 

--rm 

--volume `pwd`/bigtop:/bigtop 

--workdir /bigt...
• $ ./gradlew tasks’
Easy to do CI
ci.bigtop.apache.org
RPM/DEB packages
www.apache.org/dist/bigtop
One click Hadoop provisioning
./docker-hadoop.sh -c 3
bigtop/deploy image 

on Docker hub
./docker-hadoop.sh -c 3
One click Hadoop provisioning
bigtop/deploy image 

on Docker hub
./docker-hadoop.sh -c 3
puppet apply
puppet apply
puppet apply
One click Hadoop provis...
Should I use Bigtop?
If you want to build your
own customised 

Big Data Stack
Curves ahead…
Pros & cons
• Bigtop
• You need a talented Hadoop team
• Self-service: troubleshoot, find solutions, develop patches
• Add ...
Trend Micro 

Big Data Platform
• Use Bigtop as the basis for our internal custom
distribution of Hadoop
• Apply community, private patches to upstream
pr...
Working with community
made our life easier
• Knowing community status made TMH7 release 

based on Bigtop 1.0 SNAPSHOT po...
Working with community
made our life easier
• Contribute Bigtop Provisioner, packaging code,
puppet recipes, bugfixes, CI i...
Working with community
made our life easier
• Leverage Bigtop smoke tests and integration tests 

with Bigtop Provisioner ...
Working with community
made our life easier
• Contribute feedback, evaluation, use case
through Production level adoption
...
Hadoop YARN
Hadoop HDFS
Mapreduce
Ad-hoc Query UDFs
Pig
App A App C
Oozie
Resource
Management
Storage
Processing
Engine
AP...
Hadooppet
• Puppet recipes to deploy and manage TMH 

Big Data Platform
• HDFS, YARN, HA auto-configured
• Kerberos, LDAP a...
• A Devops toolkit for Hadoop app developer 

to develop and test its code on
• Big Data Stack preload images

—> dev & te...
internal Docker registry
./execute.sh
Hadoop server
Hadoop client
data
Docker based dev & test env
TMH7
Hadoop app
Restful...
internal Docker registry
./execute.sh
Hadoop server
Hadoop client
data
TMH7
Hadoop app
Restful 

APIs
sample data
hadoop f...
Mission-specific Platform
Use case
• Real-time streaming data flows in
• Lookup external info when data flows in
• Detect threat/malicious activities ...
Lambda Architecture
receiver
receiver
buffer
transformation,

lookup ext info
receiver
buffer
batch
streaming
receiver
buffer
transformation,

lookup ext info
transformation,

lookup ext info
batch
streaming
receiver
buffer
• High-throughput, distributed publish-subscribe
messaging system
• Supports multiple consumers attached to a topic
• Confi...
• Distributed NoSQL key-value storage, no SPOF
• Super fast on write, suitable for data keeps coming in
• Decent read perf...
• Fast, distributed, in-memory processing engine
• One system for streaming and batch workloads
• Spark streaming
Akka
• High performance concurrency framework for Java and Scala
• Actor model for message-driven processing
• Asynchronou...
Akka Streams
• Akka Streams is a DSL library for streaming computation on Akka
• Materializer to transform each step into ...
No back-pressure
Source Fast!!! SinkSlow…
(>﹏<)’v( ̄︶ ̄)y
No back-pressure
Source Fast!!! SinkSlow…
(>﹏<)’’’’’v( ̄︶ ̄)y
With back-pressure
Source Fast!!! SinkSlow…
With back-pressure
Source Fast!!! SinkSlow…
request 3request 3
Data pipeline with Akka Streams
• Scale up using balance and merge
source: http://doc.akka.io/docs/akka-stream-and-http-ex...
• Scale out using docker
Data pipeline with Akka Streams
$ docker-compose scale pipeline=3
Reactive Kafka
• Akka Streams wrapper for Kafka
• Commit processed offset back into Kafka
• Provide at-least-once delivery...
Message delivery guarantee
• Actor Model: at-most-once
• Akka Persistence: at-least-once
• Persist log to external storage...
• Spark: both streaming and batch analytics
• Docker: resource management (fine for one app)
• Akka: fine-grained, elastic d...
Your mileage may vary
we’re still evolving
Remember this:
The SMACK Stack
Toolbox for wide variety of data processing scenarios
SMACK Stack
• Spark: fast and general engine for large-scale data
processing
• Mesos: cluster resource management system
•...
Reference
• Spark Summit Europe 2015
• Streaming Analytics with Spark, Kafka,
Cassandra, and Akka (Helena Edelson)
• Big D...
Big Data Landscape
• Memory is faster than SSD/disk, and is cheaper
• In Memory Computing & Fast Data
• Spark : In memory batch/streaming eng...
• Off-Heap storage is a JVM process memory
outside of the heap, which is allocated and
managed using native calls.
• size ...
Pig
Hadoop YARN
Hadoop HDFS
Resource
Management
Storage
Processing
Engine
(Some) Apache Big Data
Components
Slider
Flink S...
Bigtop 1.1 Release
Jan, 2016, I expect…
Bigtop 1.1 Release
• Hadoop 2.7.1
• Spark 1.5.1
• Hive 1.2.1
• Pig 0.15.0
• Oozie 4.2.0
• Flume 1.6.0
• Zeppelin 0.5.5
• I...
Hadoop 2.6
• Heterogeneous Storages
• SSD + hard drive
• Placement policy (all_ssd, hot, warm, cold)
• Archival Storage (c...
Hadoop 2.7
• Transparent encryption (encryption zone)
• Available in 2.6
• Known issue: Encryption is sometimes done
incor...
Rising star: Flink
• Streaming dataflow engine
• Treat batch computing as fixed length streaming
• Exactly-once by distribut...
• Integrate and package Apache Flink
• Re-implement Bigtop Provisioner using 

docker-machine, compose, swarm
• Deploy con...
Wrap up
• Hadoop Distribution
• Choose Bigtop if you want more control
• The SMACK Stack
• Toolbox for variety data processing sce...
Questions ?
Thank you !
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Upcoming SlideShare
Loading in …5
×

Trend Micro Big Data Platform and Apache Bigtop

8,194 views

Published on

* Quick Intro to Bigtop
* Trend Micro Big Data Platform
* Mission-specific Platform
* Big Data Landscape (3p)
* Bigtop 1.1 Release (6p)

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Trend Micro Big Data Platform and Apache Bigtop

  1. 1. 葉祐欣 (Evans Ye) Big Data Conference 2015 Trend Micro Big Data Platform 
 and Apache Bigtop
  2. 2. Who am I • Apache Bigtop PMC member • Apache Big Data Europe 2015 Speaker • Software Engineer @ Trend Micro • Develop big data apps & infra • Has some experience in Hadoop, HBase, Pig, Spark, Kafka, Fluentd, Akka, and Docker
  3. 3. Outline • Quick Intro to Bigtop • Trend Micro Big Data Platform • Mission-specific Platform • Big Data Landscape (3p) • Bigtop 1.1 Release (6p)
  4. 4. Quick Intro to Bigtop
  5. 5. Linux Distributions
  6. 6. Hadoop Distributions
  7. 7. Hadoop Distributions We’re fully open sourced !
  8. 8. How do I add patches?
  9. 9. From source code to packages Bigtop
 Packaging
  10. 10. Bigtop feature set Packaging Testing Deployment Virtualization for you to easily build your own Big Data Stack
  11. 11. Supported components
  12. 12. • $ git clone https://github.com/apache/bigtop.git • $ docker run 
 --rm 
 --volume `pwd`/bigtop:/bigtop 
 --workdir /bigtop 
 bigtop/slaves:trunk-centos-7 
 bash -l -c ‘./gradlew rpm’ One click to build packages
  13. 13. • $ ./gradlew tasks’
  14. 14. Easy to do CI ci.bigtop.apache.org
  15. 15. RPM/DEB packages www.apache.org/dist/bigtop
  16. 16. One click Hadoop provisioning ./docker-hadoop.sh -c 3
  17. 17. bigtop/deploy image 
 on Docker hub ./docker-hadoop.sh -c 3 One click Hadoop provisioning
  18. 18. bigtop/deploy image 
 on Docker hub ./docker-hadoop.sh -c 3 puppet apply puppet apply puppet apply One click Hadoop provisioning Just google bigtop provisioner
  19. 19. Should I use Bigtop?
  20. 20. If you want to build your own customised 
 Big Data Stack
  21. 21. Curves ahead…
  22. 22. Pros & cons • Bigtop • You need a talented Hadoop team • Self-service: troubleshoot, find solutions, develop patches • Add any patch at any time you want (additional efforts) • Choose any version of component you want (additional efforts) • Vendors (Hortonworks, Cloudera, etc) • Better support since they’re the guy who write the code ! • $
  23. 23. Trend Micro 
 Big Data Platform
  24. 24. • Use Bigtop as the basis for our internal custom distribution of Hadoop • Apply community, private patches to upstream projects for business and operational need • Newest TMH7 is based on Bigtop 1.0 SNAPSHOT Trend Micro Hadoop (TMH)
  25. 25. Working with community made our life easier • Knowing community status made TMH7 release 
 based on Bigtop 1.0 SNAPSHOT possible
  26. 26. Working with community made our life easier • Contribute Bigtop Provisioner, packaging code, puppet recipes, bugfixes, CI infra, anything! • Knowing community status made TMH7 release 
 based on Bigtop 1.0 SNAPSHOT possible
  27. 27. Working with community made our life easier • Leverage Bigtop smoke tests and integration tests 
 with Bigtop Provisioner to evaluate TMH7
  28. 28. Working with community made our life easier • Contribute feedback, evaluation, use case through Production level adoption • Leverage Bigtop smoke tests and integration tests 
 with Bigtop Provisioner to evaluate TMH7
  29. 29. Hadoop YARN Hadoop HDFS Mapreduce Ad-hoc Query UDFs Pig App A App C Oozie Resource Management Storage Processing Engine APIs and
 Interfases In-house 
 Apps Trend Micro Big Data Stack Powered by Bigtop Kerberos App B App D HBase Wuji Solr Cloud Hadooppet (prod) Hadoocker (dev)Deployment
  30. 30. Hadooppet • Puppet recipes to deploy and manage TMH 
 Big Data Platform • HDFS, YARN, HA auto-configured • Kerberos, LDAP auto-configured • Kerberos cross realm authentication auto-configured
 (For distcp to run across secured clusters)
  31. 31. • A Devops toolkit for Hadoop app developer 
 to develop and test its code on • Big Data Stack preload images
 —> dev & test env w/o deployment
 —> support end-to-end CI test • A Hadoop env for apps to test against new 
 Hadoop distribution • https://github.com/evans-ye/hadoocker Hadoocker
  32. 32. internal Docker registry ./execute.sh Hadoop server Hadoop client data Docker based dev & test env TMH7 Hadoop app Restful 
 APIs sample data hadoop fs put
  33. 33. internal Docker registry ./execute.sh Hadoop server Hadoop client data TMH7 Hadoop app Restful 
 APIs sample data hadoop fs putSolr Oozie(Wuji) Dependency service Docker based dev & test env
  34. 34. Mission-specific Platform
  35. 35. Use case • Real-time streaming data flows in • Lookup external info when data flows in • Detect threat/malicious activities on streaming data • Correlate with other historical data (batch query) to gather more info • Can also run batch detections by specifying arbitrary start time and end time • Support Investigation down to raw log level
  36. 36. Lambda Architecture
  37. 37. receiver
  38. 38. receiver buffer
  39. 39. transformation,
 lookup ext info receiver buffer
  40. 40. batch streaming receiver buffer transformation,
 lookup ext info
  41. 41. transformation,
 lookup ext info batch streaming receiver buffer
  42. 42. • High-throughput, distributed publish-subscribe messaging system • Supports multiple consumers attached to a topic • Configurable partition(shard), replication 
 factor • Load-balance within same consumer group • Only consume message once a b c
  43. 43. • Distributed NoSQL key-value storage, no SPOF • Super fast on write, suitable for data keeps coming in • Decent read performance, if design it right • Build data model around your queries • Spark Cassandra Connector • Configurable CA (CAP theorem) • Choose A over C for availability and vise-versa Dynamo: Amazon’s Highly Available Key-value Store
  44. 44. • Fast, distributed, in-memory processing engine • One system for streaming and batch workloads • Spark streaming
  45. 45. Akka • High performance concurrency framework for Java and Scala • Actor model for message-driven processing • Asynchronous by design to achieve high throughput • Each message is handled in a single threaded context
 (no lock, synchronous needed) • Let-it-crash model for fault tolerance and auto-healing system • Clustering mechanism to scale out The Road to Akka Cluster, and Beyond
  46. 46. Akka Streams • Akka Streams is a DSL library for streaming computation on Akka • Materializer to transform each step into Actor • Back-pressure enabled by default Source Flow Sink The Reactive Manifesto
  47. 47. No back-pressure Source Fast!!! SinkSlow… (>﹏<)’v( ̄︶ ̄)y
  48. 48. No back-pressure Source Fast!!! SinkSlow… (>﹏<)’’’’’v( ̄︶ ̄)y
  49. 49. With back-pressure Source Fast!!! SinkSlow…
  50. 50. With back-pressure Source Fast!!! SinkSlow… request 3request 3
  51. 51. Data pipeline with Akka Streams • Scale up using balance and merge source: http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html#working-with-flows worker worker worker balance merge
  52. 52. • Scale out using docker Data pipeline with Akka Streams $ docker-compose scale pipeline=3
  53. 53. Reactive Kafka • Akka Streams wrapper for Kafka • Commit processed offset back into Kafka • Provide at-least-once delivery guarantee https://github.com/softwaremill/reactive-kafka
  54. 54. Message delivery guarantee • Actor Model: at-most-once • Akka Persistence: at-least-once • Persist log to external storage (like WAL) • Reactive Kafka: at-least-once + back-pressure • Write offset back into Kafka • At-least-once + Idempotent writes = exactly-once
  55. 55. • Spark: both streaming and batch analytics • Docker: resource management (fine for one app) • Akka: fine-grained, elastic data pipelines • Cassandra: batch queries • Kafka: durable buffer, fan-out to multiple consumers Recap: SDACK Stack
  56. 56. Your mileage may vary
  57. 57. we’re still evolving
  58. 58. Remember this:
  59. 59. The SMACK Stack Toolbox for wide variety of data processing scenarios
  60. 60. SMACK Stack • Spark: fast and general engine for large-scale data processing • Mesos: cluster resource management system • Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications • Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters • Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
  61. 61. Reference • Spark Summit Europe 2015 • Streaming Analytics with Spark, Kafka, Cassandra, and Akka (Helena Edelson) • Big Data AW Meetup • SMACK Architectures (Anton Kirillov)
  62. 62. Big Data Landscape
  63. 63. • Memory is faster than SSD/disk, and is cheaper • In Memory Computing & Fast Data • Spark : In memory batch/streaming engine • Flink : In memory streaming/batch engine • Iginte : In memory data fabric • Geode (incubating) : In memory database Big Data moving trend
  64. 64. • Off-Heap storage is a JVM process memory outside of the heap, which is allocated and managed using native calls. • size not limited by JVM (it is limited by physical memory limits) • is not subject to GC which essentially removes long GC pauses • Project Tungsten, Flink, Iginte, Geode, HBase Off-Heap, Off-Heap, Off-Heap
  65. 65. Pig Hadoop YARN Hadoop HDFS Resource Management Storage Processing Engine (Some) Apache Big Data Components Slider Flink Spark Flink ML, Gelly Streaming, MLlib, GraphX Kafka HBase Mesos Tez Hive Phoenix Ignite APIs and
 Interfases Geode Trafodion Solr } messaging system in memory data grid search engine Bigtop Ambari Hadoop
 Distribution Hadoop
 Management Cassandra NoSQL
  66. 66. Bigtop 1.1 Release Jan, 2016, I expect…
  67. 67. Bigtop 1.1 Release • Hadoop 2.7.1 • Spark 1.5.1 • Hive 1.2.1 • Pig 0.15.0 • Oozie 4.2.0 • Flume 1.6.0 • Zeppelin 0.5.5 • Ignite Hadoop 1.5.0 • Phoenix 4.6.0 • Hue 3.8.1 • Crunch 0.12 • …, 24 components included!
  68. 68. Hadoop 2.6 • Heterogeneous Storages • SSD + hard drive • Placement policy (all_ssd, hot, warm, cold) • Archival Storage (cost saving) • HDFS-7285 (Hadoop 3.0) • Erasure code to save storage from 3X to 1.5X http://www.slideshare.net/Hadoop_Summit/reduce-storage- costs-by-5x-using-the-new-hdfs-tiered-storage-feature
  69. 69. Hadoop 2.7 • Transparent encryption (encryption zone) • Available in 2.6 • Known issue: Encryption is sometimes done incorrectly (HADOOP-11343) • Fixed in 2.7 http://events.linuxfoundation.org/sites/events/files/slides/ HDFS2015_Past_present_future.pdf
  70. 70. Rising star: Flink • Streaming dataflow engine • Treat batch computing as fixed length streaming • Exactly-once by distributed snapshotting • Event time handling by watermarks
  71. 71. • Integrate and package Apache Flink • Re-implement Bigtop Provisioner using 
 docker-machine, compose, swarm • Deploy containers on multiple hosts • Support any kind of base image for deployment Bigtop Roadmap
  72. 72. Wrap up
  73. 73. • Hadoop Distribution • Choose Bigtop if you want more control • The SMACK Stack • Toolbox for variety data processing scenarios • Big Data Landscape • In-memory, off-heap solutions are hot Wrap up
  74. 74. Questions ? Thank you !

×