Breaking the Kubernetes Kill Chain: Host Path Mount
[Hadoop] NexR Terapot: Massive Email Archiving
1. Terapot: Massive Email Archiving with
Hadoop & Friends
- Commercial Hadoop Application
Jason Han
Founder & CEO, NexR
jshan@nexr.co.kr
Next Revolution, Toward Open Platform
2. #2
NexR: Introduction
Offering Hadoop & Cloud Computing Platform and Services
Hadoop Provisioning & Management Hadoop & Cloud Computing Services
Academic Support
Massive Email Archiving MapReduce Workflow Program
Massive Data Storage & Processing Platform
Cloud Computing Platform
(Compatible with Amazon AWS)
icube-cc (Co icube-sc
mpute) (Storage)
3. #3
Email Archiving: Objectives
Regulatory compliance
e-Discovery: Litigation and legal discovery
E-mail backup and disaster recovery
Messaging system & storage optimization
Monitoring of internal and external e-mail content
4. #4
Email Archiving: Architecture
Email
Servers
Crawling
Journaling
DB Email Archiving
Server Servers (HA)
Search &
Discovery
Metadata Indexes
Storage
Network
Archival Storage
Aging Email
DAS SAN
NAS
Tape Library
5. #5
Email Archiving: Challenges
Explosive growth of digital data
- 6 times (988XB) in 2010 than 2006
- 95% (939 XB) unstructured data including email
- Increasing the cost and complexity of archiving
Requiring scalable & low cost archiving
Reinforcement of data retention regulation
- Retention, Disposal, e-Discovery, Security
- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
Requiring scalable archiving & fast discovery
Needs for intelligent data management
- Knowledge management from email data
- Filtering, monitoring, data mining, etc
Requiring integration with intelligent system
7. #7
Email Archiving: Problems
Email
Servers
Crawling
Journaling
DB Email Archiving
Server Servers (HA) Centralized search
Search &
is slow &
Discovery
not scalable
Metadata Indexes
Storage
Network
Archival Storage
Discovery from ta Storage is expensi Email
Aging ve &
pe is slow
not scalable
DAS SAN
NAS
Tape Library
8. #8
Terapot: When Hadoop Met Email Archiving…
Scale-out architecture with Hadoop
- Hadoop HDFS for archiving email data
- Hadoop MapReduce for crawling & indexing
- Apache Lucene for search & discovery
Email
Servers Email Archiving
Servers (HA)
Distributed Crawling
Journaling
Hadoop MapReduce
(Crawling, Indexing, etc)
Metadata
DB Journaling Hadoop HDFS
Server (Archiving)
Server
Distributed Search & Discovery
9. #9
Terapot: Overview
Design Principles
Shared nothing architecture Unlimited scalability
Inexpensive hardware Low cost
Using open source software Fast development
Exploiting parallelism High performance
Integrating with analysis High intelligence
Features
Distributed massive email archiving
High scalability
thousands of servers, billions of emails
High Performance
Fast search under 1-2 seconds for each user account
Fast discovery in parallel with MapReduce
High Intelligence
Email data mining, such as social network analysis
Support both on-premise version and cloud(hosted) version
Development with various open source software
10. #10
Terapot: Open Source Software Stack
Frontend Layer
Apache Tomcat Apache JAMES
Crawling Indexing Searching Email Mining
Downloadi
ng
Zookeeper
Apache Lucene Hive
MySQL
Hadoop MapReduce
Hadoop HDFS
Backend Layer
11. #11
Terapot: Architecture
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server
Terapot Frontend
Search Gateway MailServer MR Workflow Manager Analyzer
Batch processing Analysis
Searching Real-Time
Crawling Indexing Merging ETL Mining
Indexing
Hadoop MapReduce, Lucene, & Hive
HDFS
(email, index)
Local
(index)
12. #12
Terapot Data Archiving Flow
1. Send email
6. Receive email
Internet
2. Deliver email HTTP/
NAS/
FTP/SFTP
5. Forward email NFS
Server
SMTP
1. Search emails Server
1. Fetch emails in parallel
3. Push email
Crawler Indexing
(MR) (MR)
Real-Time
Shard Shard Shard Shard
Index 2. Save emails
Index Index Index
4. Save email & 3. Build index files
build index files in runtime
emails emails emails emails emails emails Index
HDFS
emails Index
Search Layer Real-Time Indexing Layer Batch Processing Layer
13. #13
Terapot Data Analysis Flow
Terapot Terapot
Mining Engine Archiving Storage
1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel
to analysis data
2. Generate
Transform
NexR Terapot Front Report in MySQL (MR)
2. Store large data
Analysis data Analysis data
MySQL HDFS
Analysis data Analysis data
Report Retrieval Layer Data Analysis Layer ETL Layer
14. #14
Technical Features
Distributed Archiving
Hadoop HDFS for storing email data
Compression and deduplication for storage space efficiency
Distributed Crawling & Indexing
Implemented by Hadoop MapReduce
Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
, HTTP, NFS, etc)
Support batch indexing & merging by MapReduce and real-time indexing for i
nstant archiving
Distributed Search
Shard a search job and executing it in parallel
Searchable instantly on receiving an email (due to real-time indexing)
Parallel Download
Download full search results in parallel by MapReduce
Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)
Standard Client Interface
Support REST/SOAP and JSON interface
Management
Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
15. #15
Crawling
Store Massive Email Data in HDFS through MapReduce
Hadoop utility(dfs –put) just copies data sequentially
Each Crawling MR takes & stores a range of data in parallel
{key,email}*
Crawling
Crawling Data MR
Location
Client Information HDFS
Splitting Crawling
MR
Crawling
MR
INPUT
16. #16
Indexing
Indexing Email Data with MapReduce
Each Indexing MR takes a range of data and makes lucene index
in parallel
{key,index}*
Indexing
Indexing Email Data MR
Client HDFS
Splitting Indexing
MR
Indexing
MR
INPUT
17. #17
Real-Time Indexing
Indexing Email Data in Runtime
Indexing in memory on arriving a new email
Flushing RT-Shard periodically into HDFS
Periodic
Real-Time Shard flushing
into HDFS
emails
Local Index
Forwarding
Mailet Email Data
RT emails
Component Shard HDFS
JAMES
RT emails
Shard
Mail
18. #18
Searching
Distributed Search
Indexes are split & stored in local disks
Shard is responsible for searching a range of index
Local Index
Read email
Shard
Searching
Client Search
HDFS
Shard
Notification
Update shard state RT
& index information
Zookeeper Shard
19. #19
Parallel Downloading
Downloading Massive Search Results in Parallel
Support various types of communications for downloading
Downloading MR sorts search results globally & pushes into targets
write result directly
write result Local
DL
Map
HDF
DL DL
write result Map Reduce
S
Shard
Donwload Download Request
Client DL DL FTP
Map Reduce
write result
Shard DL DL
Map Reduce
SFTP
DL
write result Map HTTP
Shard
HDFS Distributed
Global Sort
20. #20
Email Data Analysis
Analysis Process
ETL(Extract-Transform-Load) email archiving data to Hive table format
Analyzing data using Hive with various analysis algorithm
Generating the analysis result report
write result
Terapot
Mining
ETL M write result execute HiveQL
Terapot R
Mining Load Archving Data
HIVE
ETL M write result Generate Report
R
ETL M write result MySQL
R
21. #21
Types of Analysis
Social Network Analysis
Personal Network Analysis
Computing distance between recipients or senders based on TO, CC, FRO
M links
Analyzing the statistics of mail frequency
Domain Analysis
Computing distance between recipient’s domain based on TO, CC, FROM
Keyword Analysis (in progress)
Keyword frequency for each user
22. #22
Terapot Performance
Experimental Environment
11 Intel Servers: 1 Master + 10 Slaves
Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
The number of emails: 270 millions (Index size: 270 GB)
Results
Indexing in local disks
Number of Emails Number of Results Response Time (sec)
67,217,298 12,547,398 1.4
134,434,596 25,094,796 1.4
201,651,894 37,642,194 1.4
268,869,192 50,189,592 1.4
Indexing in HDFS
Number of Emails Number of Results Response Time (sec)
67,217,298 12,547,398 2.8
134,434,596 25,094,796 2.8
201,651,894 37,642,194 3.2
268,869,192 50,189,592 3.2