[Hadoop] NexR Terapot: Massive Email Archiving

Terapot: Massive Email Archiving with
Hadoop & Friends
- Commercial Hadoop Application

Jason Han
Founder & CEO, NexR
jshan@nexr.co.kr

Next Revolution, Toward Open Platform

#2

NexR: Introduction

Offering Hadoop & Cloud Computing Platform and Services

Hadoop Provisioning & Management Hadoop & Cloud Computing Services

Academic Support
Massive Email Archiving MapReduce Workflow Program

Massive Data Storage & Processing Platform

Cloud Computing Platform
(Compatible with Amazon AWS)

icube-cc (Co icube-sc
mpute) (Storage)

#3

Email Archiving: Objectives

  Regulatory compliance

  e-Discovery: Litigation and legal discovery

  E-mail backup and disaster recovery

  Messaging system & storage optimization

  Monitoring of internal and external e-mail content

#4

Email Archiving: Architecture

Email
Servers

Crawling
Journaling

DB Email Archiving
Server Servers (HA)
Search &
Discovery
Metadata Indexes
Storage
Network
Archival Storage
Aging Email

DAS SAN
NAS
Tape Library

#5

Email Archiving: Challenges

  Explosive growth of digital data
-  6 times (988XB) in 2010 than 2006
-  95% (939 XB) unstructured data including email
-  Increasing the cost and complexity of archiving
 Requiring scalable & low cost archiving

  Reinforcement of data retention regulation
-  Retention, Disposal, e-Discovery, Security
-  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
 Requiring scalable archiving & fast discovery

  Needs for intelligent data management
-  Knowledge management from email data
-  Filtering, monitoring, data mining, etc
 Requiring integration with intelligent system

#6

Email Archiving: Regulatory Compliance

#7

Email Archiving: Problems

Email
Servers

Crawling
Journaling

DB Email Archiving
Server Servers (HA) Centralized search
Search &
is slow &
Discovery
not scalable
Metadata Indexes
Storage
Network
Archival Storage
Discovery from ta Storage is expensi Email
Aging ve &
pe is slow
not scalable
DAS SAN
NAS
Tape Library

#8

Terapot: When Hadoop Met Email Archiving…
  Scale-out architecture with Hadoop
-  Hadoop HDFS for archiving email data
-  Hadoop MapReduce for crawling & indexing
-  Apache Lucene for search & discovery

Email
Servers Email Archiving
Servers (HA)
Distributed Crawling
Journaling

Hadoop MapReduce
(Crawling, Indexing, etc)

Metadata
DB Journaling Hadoop HDFS
Server (Archiving)
Server

Distributed Search & Discovery

#9

Terapot: Overview

  Design Principles

  Shared nothing architecture  Unlimited scalability

  Inexpensive hardware  Low cost

  Using open source software  Fast development

  Exploiting parallelism  High performance

  Integrating with analysis  High intelligence

  Features

  Distributed massive email archiving

  High scalability

  thousands of servers, billions of emails

  High Performance

  Fast search under 1-2 seconds for each user account

  Fast discovery in parallel with MapReduce

  High Intelligence

  Email data mining, such as social network analysis

  Support both on-premise version and cloud(hosted) version

  Development with various open source software

#10

Terapot: Open Source Software Stack

Frontend Layer

Apache Tomcat Apache JAMES

Crawling Indexing Searching Email Mining
Downloadi
ng
Zookeeper

Apache Lucene Hive
MySQL

Hadoop MapReduce

Hadoop HDFS

Backend Layer

#11

Terapot: Architecture
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server

Terapot Frontend

Search Gateway MailServer MR Workflow Manager Analyzer

Batch processing Analysis
Searching Real-Time
Crawling Indexing Merging ETL Mining
Indexing

Hadoop MapReduce, Lucene, & Hive

HDFS
(email, index)
Local
(index)

#12

Terapot Data Archiving Flow
1. Send email

6. Receive email
Internet

2. Deliver email HTTP/
NAS/
FTP/SFTP
5. Forward email NFS
Server

SMTP
1. Search emails Server
1. Fetch emails in parallel

3. Push email
Crawler Indexing
(MR) (MR)
Real-Time
Shard Shard Shard Shard
Index 2. Save emails
Index Index Index
4. Save email & 3. Build index files
build index files in runtime

emails emails emails emails emails emails Index
HDFS
emails Index

Search Layer Real-Time Indexing Layer Batch Processing Layer

#13

Terapot Data Analysis Flow

Terapot Terapot
Mining Engine Archiving Storage

1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel
to analysis data

2. Generate
Transform
NexR Terapot Front Report in MySQL (MR)

2. Store large data

Analysis data Analysis data
MySQL HDFS
Analysis data Analysis data

Report Retrieval Layer Data Analysis Layer ETL Layer

#14

Technical Features

  Distributed Archiving

  Hadoop HDFS for storing email data

  Compression and deduplication for storage space efficiency

  Distributed Crawling & Indexing

  Implemented by Hadoop MapReduce

  Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
, HTTP, NFS, etc)

  Support batch indexing & merging by MapReduce and real-time indexing for i
nstant archiving

  Distributed Search

  Shard a search job and executing it in parallel

  Searchable instantly on receiving an email (due to real-time indexing)

  Parallel Download

  Download full search results in parallel by MapReduce

  Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)

  Standard Client Interface

  Support REST/SOAP and JSON interface

  Management

  Configurable MapReduce job scheduling (crawling, indexing, merging, etc)

#15

Crawling

  Store Massive Email Data in HDFS through MapReduce

  Hadoop utility(dfs –put) just copies data sequentially

  Each Crawling MR takes & stores a range of data in parallel

{key,email}*
Crawling
Crawling Data MR
Location
Client Information HDFS
Splitting Crawling
MR

Crawling
MR
INPUT

#16

Indexing

  Indexing Email Data with MapReduce

  Each Indexing MR takes a range of data and makes lucene index
in parallel

{key,index}*
Indexing
Indexing Email Data MR
Client HDFS
Splitting Indexing
MR

Indexing
MR
INPUT

#17

Real-Time Indexing

  Indexing Email Data in Runtime

  Indexing in memory on arriving a new email

  Flushing RT-Shard periodically into HDFS
Periodic
Real-Time Shard flushing
into HDFS
emails

Local Index
Forwarding
Mailet Email Data
RT emails
Component Shard HDFS
JAMES
RT emails
Shard

Mail

#18

Searching

  Distributed Search

  Indexes are split & stored in local disks

  Shard is responsible for searching a range of index

Local Index

Read email
Shard
Searching
Client Search
HDFS

Shard

Notification
Update shard state RT
& index information
Zookeeper Shard

#19

Parallel Downloading

  Downloading Massive Search Results in Parallel

  Support various types of communications for downloading

  Downloading MR sorts search results globally & pushes into targets
write result directly
write result Local
DL
Map
HDF
DL DL
write result Map Reduce
S
Shard
Donwload Download Request
Client DL DL FTP
Map Reduce
write result
Shard DL DL
Map Reduce
SFTP

DL
write result Map HTTP
Shard

HDFS Distributed
Global Sort

#20

Email Data Analysis

  Analysis Process

  ETL(Extract-Transform-Load) email archiving data to Hive table format

  Analyzing data using Hive with various analysis algorithm

  Generating the analysis result report

write result

Terapot
Mining
ETL M write result execute HiveQL
Terapot R
Mining Load Archving Data
HIVE
ETL M write result Generate Report

R

ETL M write result MySQL
R

#21

Types of Analysis

  Social Network Analysis

  Personal Network Analysis

  Computing distance between recipients or senders based on TO, CC, FRO
M links

  Analyzing the statistics of mail frequency

  Domain Analysis

  Computing distance between recipient’s domain based on TO, CC, FROM

  Keyword Analysis (in progress)

  Keyword frequency for each user

#22

Terapot Performance

  Experimental Environment

  11 Intel Servers: 1 Master + 10 Slaves

  Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk

  The number of emails: 270 millions (Index size: 270 GB)

  Results
Indexing in local disks
Number of Emails Number of Results Response Time (sec)

67,217,298 12,547,398 1.4

134,434,596 25,094,796 1.4

201,651,894 37,642,194 1.4

268,869,192 50,189,592 1.4

Indexing in HDFS
Number of Emails Number of Results Response Time (sec)

67,217,298 12,547,398 2.8

134,434,596 25,094,796 2.8

201,651,894 37,642,194 3.2

268,869,192 50,189,592 3.2

#24

www.nexr.co.kr

Hadoop & Cloud Computing
Company

[Hadoop] NexR Terapot: Massive Email Archiving

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to [Hadoop] NexR Terapot: Massive Email Archiving

Similar to [Hadoop] NexR Terapot: Massive Email Archiving (20)

More from Jinho Jung

More from Jinho Jung (20)

Recently uploaded

Recently uploaded (20)

[Hadoop] NexR Terapot: Massive Email Archiving