Hw09 Terapot Email Archiving With Hadoop

2,895 views
2,673 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,895
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
145
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Hw09 Terapot Email Archiving With Hadoop

  1. 1. Next Revolution Toward Open Platform Terapot: Massive Email Archiving with Hadoop & Friends - Commercial Hadoop Application Jaesun Han Founder & CEO of NexR jshan@nexrcorp.com
  2. 2. #2 About NexR Offering Hadoop & Cloud Computing Platform and Services Hadoop & Cloud Computing Services Hadoop Provisioning & Management Academic Support Massive Email Archiving MapReduce Workflow Program Massive Data Storage & Processing Platform Cloud Computing Platform (Compatible with Amazon AWS) icube-cc icube-sc (Compute) (Storage)
  3. 3. #3 What is Email Archiving?  The Objectives of Email Archiving - Regulatory compliance - e-Discovery: Litigation and legal discovery - E-mail backup and disaster recovery - Messaging system & storage optimization - Monitoring of internal and external e-mail content
  4. 4. #4 The Architecture of Email Archiving Data Acquisition Data Processing Data Access Journaling Indexing Search Mailbox Crawling Filtering Discovery Email Servers Journaling Crawling Search employee Indexing Indexes Email Archiving Server Discovery auditor administrator Archival Storage email data
  5. 5. #5 The Challenges of Email Archiving  Explosive growth of digital data - 6 times (988XB) in 2010 than 2006 - 95% (939 XB) unstructured data including email - Increasing the cost and complexity of archiving  Requiring scalable & low cost archiving  Reinforcement of data retention regulation - Retention, Disposal, e-Discovery, Security - HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX  Requiring scalable archiving & fast discovery  Needs for intelligent data management - Knowledge management from email data - Filtering, monitoring, data mining, etc  Requiring integration with intelligent system
  6. 6. #6 New Requirements of Email Archiving  High Scalability  Low Cost  High Performance  Intelligence
  7. 7. #7 Terapot: When Hadoop Met Email Archiving…  Scale-out architecture with Hadoop - Hadoop HDFS for archiving email data - Hadoop MapReduce for crawling & indexing - Apache Lucene for search & discovery Email Servers Distributed Crawling Journaling Hadoop MapReduce (Crawling, Indexing, etc) Journaling Hadoop HDFS Server (Archiving) Distributed Search & Discovery
  8. 8. #8 Features of Terapot  Distributed Massive Email Archiving  High Scalability by Shared-Nothing Architecture - Thousands of servers, billions of emails  Low Cost by Inexpensive Hardware - Entry servers under $5,000  High Performance by Parallelism - Fast search under 1-2 seconds for each user account - Fast discovery in parallel with MapReduce  Intelligence by Data Mining - Contact network analysis, content analysis, statistics  Support Both On-premise Version and Cloud(hosted) Version  Development with Various Open Source Software
  9. 9. #9 The Architecture of Terapot Terapot Clients Email Sources HTTP/ SOAP REST JSON POP3 Mail NAS/ FTP/SFTP Server Server NFS Server Terapot Frontend MR Workflow Manager MailServer Search Gateway Analyzer Batch processing Analysis 4 key Real-Time Crawling Indexing Merging Searching ETL Mining components Indexing Hadoop MapReduce, Lucene, & Hive HDFS (email) Local (index)
  10. 10. #10 Batch Processing Component Email Sources HDFS Crawling Archiving policies (MR)  An archive file per user An archive file per user  Several archive files per crawling (sequence file) configured period Indexing (MR) a temporary index file per user (lucene index file) Local file system Merging shard 1 shard 0 Search a merged index file (for backing up) index shard (3 copy replication)
  11. 11. #11 Real-Time Indexing Component Journaling Server Forwarding Database Memory Indexing Real-Time Archiving Indexing Crawling Real-Time HDFS Index Flushing archive Batch Processing index Component
  12. 12. #12 Search & Discovery Component Search Gateway Locating index shards Distributed Search Assigning shards Search Nodes Real-Time copy index shards Indexing Nodes to local file system Zookeeper Updating shard status HDFS index shards
  13. 13. #13 Data Analysis Component  Personal contact network analysis Mining  Domain statistics Engine Hive queries ETL (MR) Analyzer Extract-Transform- Hive Web Load Reporter MR MR MR MR MR reports generating reports email archive files Hive table analysis results database HDFS
  14. 14. #14 Installation & Quantitative Analysis Quantitative Analysis 2  Assuming HA master - 1000 employees nodes - 16 emails per day for each person - 215KB (content 142 KB + attachment 73 KB) for average email size - 1.25 GB per year for 1 employee  Storage 10 - index size: about 80% of email - compression ratio: about 50 % worker  Disk volume required for 1 year nodes - email archive (HDFS): 1881 GB (datanode, - indexes (HDFS + Local): 4559 GB tasktracker, - total: about 6.4 TB per year searcher, etc)  40 TB may cover 6 years archiving Description Qty Intel Xeon Nehalem 2 CPU E5504 2.0GHz (8 cores) DDR3 2GB PC3-10600 9 Memory Registered Dimm (18GB) 4 HDD 1TB 7200 RPM SATA2 (4TB)
  15. 15. #15 Demonstration
  16. 16. For more information - www.nexrcorp.com - www.terapot.com - jshan@nexrcorp.com - @jaesun_han www.nexrcorp.com Hadoop & Cloud Computing Company

×