Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Hadoop] NexR Terapot: Massive Email Archiving


Published on

Terapot: Massive Email Archiving with Hadoop & Friends
- Commercial Hadoop Application

Jason Han
Founder & CEO, NexR

Published in: Technology

[Hadoop] NexR Terapot: Massive Email Archiving

  1. 1. Terapot: Massive Email Archiving with Hadoop & Friends - Commercial Hadoop Application Jason Han Founder & CEO, NexR Next Revolution, Toward Open Platform
  2. 2. #2 NexR: Introduction Offering Hadoop & Cloud Computing Platform and Services Hadoop Provisioning & Management Hadoop & Cloud Computing Services Academic Support Massive Email Archiving MapReduce Workflow Program Massive Data Storage & Processing Platform Cloud Computing Platform (Compatible with Amazon AWS) icube-cc (Co icube-sc mpute) (Storage)
  3. 3. #3 Email Archiving: Objectives   Regulatory compliance   e-Discovery: Litigation and legal discovery   E-mail backup and disaster recovery   Messaging system & storage optimization   Monitoring of internal and external e-mail content
  4. 4. #4 Email Archiving: Architecture Email Servers Crawling Journaling DB Email Archiving Server Servers (HA) Search & Discovery Metadata Indexes Storage Network Archival Storage Aging Email DAS SAN NAS Tape Library
  5. 5. #5 Email Archiving: Challenges   Explosive growth of digital data -  6 times (988XB) in 2010 than 2006 -  95% (939 XB) unstructured data including email -  Increasing the cost and complexity of archiving  Requiring scalable & low cost archiving   Reinforcement of data retention regulation -  Retention, Disposal, e-Discovery, Security -  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX  Requiring scalable archiving & fast discovery   Needs for intelligent data management -  Knowledge management from email data -  Filtering, monitoring, data mining, etc  Requiring integration with intelligent system
  6. 6. #6 Email Archiving: Regulatory Compliance
  7. 7. #7 Email Archiving: Problems Email Servers Crawling Journaling DB Email Archiving Server Servers (HA) Centralized search Search & is slow & Discovery not scalable Metadata Indexes Storage Network Archival Storage Discovery from ta Storage is expensi Email Aging ve & pe is slow not scalable DAS SAN NAS Tape Library
  8. 8. #8 Terapot: When Hadoop Met Email Archiving…   Scale-out architecture with Hadoop -  Hadoop HDFS for archiving email data -  Hadoop MapReduce for crawling & indexing -  Apache Lucene for search & discovery Email Servers Email Archiving Servers (HA) Distributed Crawling Journaling Hadoop MapReduce (Crawling, Indexing, etc) Metadata DB Journaling Hadoop HDFS Server (Archiving) Server Distributed Search & Discovery
  9. 9. #9 Terapot: Overview   Design Principles   Shared nothing architecture  Unlimited scalability   Inexpensive hardware  Low cost   Using open source software  Fast development   Exploiting parallelism  High performance   Integrating with analysis  High intelligence   Features   Distributed massive email archiving   High scalability   thousands of servers, billions of emails   High Performance   Fast search under 1-2 seconds for each user account   Fast discovery in parallel with MapReduce   High Intelligence   Email data mining, such as social network analysis   Support both on-premise version and cloud(hosted) version   Development with various open source software
  10. 10. #10 Terapot: Open Source Software Stack Frontend Layer Apache Tomcat Apache JAMES Crawling Indexing Searching Email Mining Downloadi ng Zookeeper Apache Lucene Hive MySQL Hadoop MapReduce Hadoop HDFS Backend Layer
  11. 11. #11 Terapot: Architecture Terapot Clients Email Sources HTTP/ SOAP REST JSON POP3 Mail NAS/ FTP/SFTP Server Server NFS Server Terapot Frontend Search Gateway MailServer MR Workflow Manager Analyzer Batch processing Analysis Searching Real-Time Crawling Indexing Merging ETL Mining Indexing Hadoop MapReduce, Lucene, & Hive HDFS (email, index) Local (index)
  12. 12. #12 Terapot Data Archiving Flow 1. Send email 6. Receive email Internet 2. Deliver email HTTP/ NAS/ FTP/SFTP 5. Forward email NFS Server SMTP 1. Search emails Server 1. Fetch emails in parallel 3. Push email Crawler Indexing (MR) (MR) Real-Time Shard Shard Shard Shard Index 2. Save emails Index Index Index 4. Save email & 3. Build index files build index files in runtime emails emails emails emails emails emails Index HDFS emails Index Search Layer Real-Time Indexing Layer Batch Processing Layer
  13. 13. #13 Terapot Data Analysis Flow Terapot Terapot Mining Engine Archiving Storage 1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel to analysis data 2. Generate Transform NexR Terapot Front Report in MySQL (MR) 2. Store large data Analysis data Analysis data MySQL HDFS Analysis data Analysis data Report Retrieval Layer Data Analysis Layer ETL Layer
  14. 14. #14 Technical Features   Distributed Archiving   Hadoop HDFS for storing email data   Compression and deduplication for storage space efficiency   Distributed Crawling & Indexing   Implemented by Hadoop MapReduce   Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP , HTTP, NFS, etc)   Support batch indexing & merging by MapReduce and real-time indexing for i nstant archiving   Distributed Search   Shard a search job and executing it in parallel   Searchable instantly on receiving an email (due to real-time indexing)   Parallel Download   Download full search results in parallel by MapReduce   Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)   Standard Client Interface   Support REST/SOAP and JSON interface   Management   Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
  15. 15. #15 Crawling   Store Massive Email Data in HDFS through MapReduce   Hadoop utility(dfs –put) just copies data sequentially   Each Crawling MR takes & stores a range of data in parallel {key,email}* Crawling Crawling Data MR Location Client Information HDFS Splitting Crawling MR Crawling MR INPUT
  16. 16. #16 Indexing   Indexing Email Data with MapReduce   Each Indexing MR takes a range of data and makes lucene index in parallel {key,index}* Indexing Indexing Email Data MR Client HDFS Splitting Indexing MR Indexing MR INPUT
  17. 17. #17 Real-Time Indexing   Indexing Email Data in Runtime   Indexing in memory on arriving a new email   Flushing RT-Shard periodically into HDFS Periodic Real-Time Shard flushing into HDFS emails Local Index Forwarding Mailet Email Data RT emails Component Shard HDFS JAMES RT emails Shard Mail
  18. 18. #18 Searching   Distributed Search   Indexes are split & stored in local disks   Shard is responsible for searching a range of index Local Index Read email Shard Searching Client Search HDFS Shard Notification Update shard state RT & index information Zookeeper Shard
  19. 19. #19 Parallel Downloading   Downloading Massive Search Results in Parallel   Support various types of communications for downloading   Downloading MR sorts search results globally & pushes into targets write result directly write result Local DL Map HDF DL DL write result Map Reduce S Shard Donwload Download Request Client DL DL FTP Map Reduce write result Shard DL DL Map Reduce SFTP DL write result Map HTTP Shard HDFS Distributed Global Sort
  20. 20. #20 Email Data Analysis   Analysis Process   ETL(Extract-Transform-Load) email archiving data to Hive table format   Analyzing data using Hive with various analysis algorithm   Generating the analysis result report write result Terapot Mining ETL M write result execute HiveQL Terapot R Mining Load Archving Data HIVE ETL M write result Generate Report R ETL M write result MySQL R
  21. 21. #21 Types of Analysis   Social Network Analysis   Personal Network Analysis   Computing distance between recipients or senders based on TO, CC, FRO M links   Analyzing the statistics of mail frequency   Domain Analysis   Computing distance between recipient’s domain based on TO, CC, FROM   Keyword Analysis (in progress)   Keyword frequency for each user
  22. 22. #22 Terapot Performance   Experimental Environment   11 Intel Servers: 1 Master + 10 Slaves   Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk   The number of emails: 270 millions (Index size: 270 GB)   Results Indexing in local disks Number of Emails Number of Results Response Time (sec) 67,217,298 12,547,398 1.4 134,434,596 25,094,796 1.4 201,651,894 37,642,194 1.4 268,869,192 50,189,592 1.4 Indexing in HDFS Number of Emails Number of Results Response Time (sec) 67,217,298 12,547,398 2.8 134,434,596 25,094,796 2.8 201,651,894 37,642,194 3.2 268,869,192 50,189,592 3.2
  23. 23. #23 Demonstration
  24. 24. #24 Hadoop & Cloud Computing Company