SlideShare a Scribd company logo
1 of 24
Download to read offline
Terapot: Massive Email Archiving with
          Hadoop & Friends
       - Commercial Hadoop Application



                    Jason Han
               Founder & CEO, NexR
                 jshan@nexr.co.kr




             Next Revolution, Toward Open Platform
#2

NexR: Introduction

   Offering Hadoop & Cloud Computing Platform and Services

      Hadoop Provisioning & Management                 Hadoop & Cloud Computing Services



                                                                                            Academic Support
                                         Massive Email Archiving   MapReduce Workflow           Program




                                                  Massive Data Storage & Processing Platform


                                                                            Cloud Computing Platform
                                                                          (Compatible with Amazon AWS)

                                                    icube-cc (Co                icube-sc
                                                       mpute)                   (Storage)
#3

Email Archiving: Objectives


     Regulatory compliance

     e-Discovery: Litigation and legal discovery

     E-mail backup and disaster recovery

     Messaging system & storage optimization

     Monitoring of internal and external e-mail content
#4

Email Archiving: Architecture


      Email
     Servers


                                                         Crawling
                                Journaling



  DB                                                   Email Archiving
 Server                                                 Servers (HA)
                                                                            Search &
                                                                            Discovery
             Metadata                        Indexes
                                 Storage
                                 Network
                                                              Archival Storage
                  Aging                                             Email

                          DAS                      SAN
                                 NAS
   Tape Library
#5

Email Archiving: Challenges

              Explosive growth of digital data
               -  6 times (988XB) in 2010 than 2006
               -  95% (939 XB) unstructured data including email
               -  Increasing the cost and complexity of archiving
                Requiring scalable & low cost archiving



              Reinforcement of data retention regulation
               -  Retention, Disposal, e-Discovery, Security
               -  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
                  OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
                Requiring scalable archiving & fast discovery


              Needs for intelligent data management
               -  Knowledge management from email data
               -  Filtering, monitoring, data mining, etc
                Requiring integration with intelligent system
#6

Email Archiving: Regulatory Compliance
#7

Email Archiving: Problems


      Email
     Servers


                                                        Crawling
                               Journaling



  DB                                                  Email Archiving
 Server                                                Servers (HA) Centralized search
                                                                           Search &
                                                                            is slow &
                                                                           Discovery
                                                                           not scalable
             Metadata                       Indexes
                                Storage
                                Network
                                                             Archival Storage
    Discovery from ta           Storage is expensi                 Email
                 Aging                 ve &
        pe is slow
                                   not scalable
                         DAS                         SAN
                                NAS
   Tape Library
#8

Terapot: When Hadoop Met Email Archiving…
                  Scale-out architecture with Hadoop
                   -  Hadoop HDFS for archiving email data
                   -  Hadoop MapReduce for crawling & indexing
                   -  Apache Lucene for search & discovery


      Email
     Servers                                                              Email Archiving
                                                                           Servers (HA)
                                        Distributed Crawling
            Journaling


                                                                  Hadoop MapReduce
                                                                  (Crawling, Indexing, etc)

     Metadata
   DB       Journaling                                               Hadoop HDFS
              Server                                                     (Archiving)
  Server




                                 Distributed Search & Discovery
#9

Terapot: Overview

   Design Principles
  
       Shared nothing architecture        Unlimited scalability
  
       Inexpensive hardware               Low cost
  
       Using open source software          Fast development
  
       Exploiting parallelism             High performance
  
       Integrating with analysis          High intelligence


   Features
  
   Distributed massive email archiving
  
   High scalability
          
   thousands of servers, billions of emails
  
   High Performance
          
   Fast search under 1-2 seconds for each user account
          
   Fast discovery in parallel with MapReduce
  
   High Intelligence
          
   Email data mining, such as social network analysis
  
   Support both on-premise version and cloud(hosted) version
  
   Development with various open source software
#10

Terapot: Open Source Software Stack

                                  Frontend Layer


          Apache Tomcat                                 Apache JAMES



                                  Crawling   Indexing   Searching   Email Mining
                      Downloadi
                         ng
          Zookeeper




                                         Apache Lucene                 Hive
  MySQL




                                         Hadoop MapReduce


                                             Hadoop HDFS


                                  Backend Layer
#11

Terapot: Architecture
      Terapot Clients                                  Email Sources
                                                      HTTP/
  SOAP      REST   JSON                POP3                          Mail          NAS/
                                                    FTP/SFTP
                                       Server                       Server         NFS
                                                     Server




                                    Terapot Frontend

 Search Gateway     MailServer              MR Workflow Manager                Analyzer




                                                Batch processing              Analysis
    Searching           Real-Time
                                         Crawling    Indexing   Merging      ETL   Mining
                         Indexing

                   Hadoop MapReduce, Lucene, & Hive




                                                                                               HDFS
                                                                                            (email, index)
                                                                                               Local
                                                                                               (index)
#12

Terapot Data Archiving Flow
                                               1. Send email

                                                                                     6. Receive email
                                                           Internet


                                                  2. Deliver email                                  HTTP/
                                                                                                                       NAS/
                                                                                                   FTP/SFTP
                                                                     5. Forward email                                  NFS
                                                                                                    Server


                                                                 SMTP
              1. Search emails                                   Server
                                                                                                  1. Fetch emails in parallel


                                                          3. Push email
                                                                                               Crawler               Indexing
                                                                                                (MR)                   (MR)
                                                               Real-Time
 Shard                Shard       Shard                          Shard
                                                                             Index         2. Save emails
     Index                Index        Index
                                                          4. Save email &                                3. Build index files
                                                          build index files in runtime


emails       emails      emails   emails                              emails                   emails                    Index
                                                       HDFS
                                                                                               emails                    Index

                 Search Layer                        Real-Time Indexing Layer                    Batch Processing Layer
#13

Terapot Data Analysis Flow


                                                    Terapot                              Terapot
                                                 Mining Engine                      Archiving Storage




  1. View Report for Archving data                   1. Send HiveQL               1. Fetch emails in parallel
                                                        to analysis data


                                        2. Generate
                                                                                          Transform
    NexR Terapot Front                     Report in MySQL                                  (MR)



                                                                                      2. Store large data

                                                             
   
              
                                                    Analysis data                        Analysis data
                                MySQL                                      HDFS
                                                    Analysis data                        Analysis data


   Report Retrieval Layer                      Data Analysis Layer                       ETL Layer
#14

Technical Features

   Distributed Archiving
   
   Hadoop HDFS for storing email data
   
   Compression and deduplication for storage space efficiency

   Distributed Crawling & Indexing
   
   Implemented by Hadoop MapReduce
   
   Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
       , HTTP, NFS, etc)
   
   Support batch indexing & merging by MapReduce and real-time indexing for i
       nstant archiving

   Distributed Search
   
   Shard a search job and executing it in parallel
   
   Searchable instantly on receiving an email (due to real-time indexing)

   Parallel Download
   
   Download full search results in parallel by MapReduce
   
   Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)

   Standard Client Interface
   
   Support REST/SOAP and JSON interface

   Management
   
   Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
#15

Crawling

   Store Massive Email Data in HDFS through MapReduce
  
   Hadoop utility(dfs –put) just copies data sequentially
  
   Each Crawling MR takes & stores a range of data in parallel




                                                    {key,email}*
                                         Crawling
      Crawling               Data          MR
                           Location
       Client            Information                               HDFS
      Splitting                          Crawling
                                           MR



                                         Crawling
                                           MR
       INPUT
#16

Indexing

   Indexing Email Data with MapReduce
  
   Each Indexing MR takes a range of data and makes lucene index
      in parallel




                                                {key,index}*
                                     Indexing
      Indexing        Email Data        MR
        Client                                                 HDFS
      Splitting                      Indexing
                                        MR



                                     Indexing
                                        MR
      INPUT
#17

Real-Time Indexing

   Indexing Email Data in Runtime
  
   Indexing in memory on arriving a new email
  
   Flushing RT-Shard periodically into HDFS
                                                          Periodic
                                 Real-Time Shard          flushing
                                                        into HDFS
                                                                     emails

                                                   Local Index
                    Forwarding
      Mailet        Email Data
                                            RT                       emails
    Component                              Shard                     HDFS
      JAMES
                                            RT                       emails
                                           Shard



        Mail
#18

Searching

   Distributed Search
  
   Indexes are split & stored in local disks
  
   Shard is responsible for searching a range of index



                                           Local Index


                                                         Read email
                                           Shard
  Searching
    Client               Search
                                                                      HDFS

                                           Shard

  Notification
                  Update shard state        RT
                  & index information
  Zookeeper                                Shard
#19

 Parallel Downloading

   Downloading Massive Search Results in Parallel
   
   Support various types of communications for downloading
   
   Downloading MR sorts search results globally & pushes into targets
                                                            write result directly
                                      write result                                  Local
                                                              DL
                                                              Map
                                                                                    HDF
                                                              DL            DL
                                      write result            Map         Reduce
                                                                                     S
                              Shard
Donwload   Download Request
 Client                                                       DL            DL      FTP
                                                              Map         Reduce
                                      write result
                              Shard                           DL            DL
                                                              Map         Reduce
                                                                                    SFTP

                                                              DL
                                      write result            Map                   HTTP
                              Shard

                                                     HDFS       Distributed
                                                                Global Sort
#20

 Email Data Analysis

   Analysis Process
   
   ETL(Extract-Transform-Load) email archiving data to Hive table format
   
   Analyzing data using Hive with various analysis algorithm
   
   Generating the analysis result report


                                        write result


                                                                                    Terapot
                                                                                    Mining
                                ETL M   write result               execute HiveQL
Terapot                           R
Mining     Load Archving Data
                                                           HIVE
                                ETL M   write result                       Generate Report

                                  R


                                ETL M   write result                           MySQL
                                  R
#21

Types of Analysis

   Social Network Analysis
  
   Personal Network Analysis
    
   Computing distance between recipients or senders based on TO, CC, FRO
        M links
    
   Analyzing the statistics of mail frequency
  
   Domain Analysis
    
   Computing distance between recipient’s domain based on TO, CC, FROM

   Keyword Analysis (in progress)
  
   Keyword frequency for each user
#22

Terapot Performance

   Experimental Environment
  
   11 Intel Servers: 1 Master + 10 Slaves
     
   Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
  
   The number of emails: 270 millions (Index size: 270 GB)

   Results
     Indexing in local disks
         Number of Emails       Number of Results      Response Time (sec)

                   67,217,298             12,547,398                     1.4

                  134,434,596             25,094,796                     1.4

                  201,651,894             37,642,194                     1.4

                  268,869,192             50,189,592                     1.4


     Indexing in HDFS
         Number of Emails       Number of Results      Response Time (sec)

                   67,217,298             12,547,398                     2.8

                  134,434,596             25,094,796                     2.8

                  201,651,894             37,642,194                     3.2

                  268,869,192             50,189,592                     3.2
#23

Demonstration
#24




      www.nexr.co.kr

Hadoop & Cloud Computing
       Company

More Related Content

What's hot

SQLBits X SQL Server 2012 Rich Unstructured Data
SQLBits X SQL Server 2012 Rich Unstructured DataSQLBits X SQL Server 2012 Rich Unstructured Data
SQLBits X SQL Server 2012 Rich Unstructured DataMichael Rys
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32jujukoko
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2Wilfried Hoge
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANASAP Technology
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for roboticsJoão Gabriel Lima
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projectsaf83
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011iammutex
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBJeremy Taylor
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows AzureJeremy Taylor
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows AzureJeremy Taylor
 

What's hot (20)

HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
SQLBits X SQL Server 2012 Rich Unstructured Data
SQLBits X SQL Server 2012 Rich Unstructured DataSQLBits X SQL Server 2012 Rich Unstructured Data
SQLBits X SQL Server 2012 Rich Unstructured Data
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANA
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projects
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDB
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
Science & technology (s&t) cloud2
Science & technology (s&t) cloud2Science & technology (s&t) cloud2
Science & technology (s&t) cloud2
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 

Viewers also liked

Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With HadoopCloudera, Inc.
 
Scaling and Managing Cassandra with docker, CoreOS and Presto
Scaling and Managing Cassandra with docker, CoreOS and PrestoScaling and Managing Cassandra with docker, CoreOS and Presto
Scaling and Managing Cassandra with docker, CoreOS and PrestoVali-Marius Malinoiu
 
EV.Cloud Email Archiving
EV.Cloud Email ArchivingEV.Cloud Email Archiving
EV.Cloud Email Archivingcrussell79
 
Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...
Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...
Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...Advania
 
Driving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information ManagementDriving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information ManagementRay Bachert
 
Archiving 2.0 - Retain Business Value
Archiving 2.0 - Retain Business ValueArchiving 2.0 - Retain Business Value
Archiving 2.0 - Retain Business ValueGWAVA
 
Deep Dive Into Email Archiving Products
Deep Dive Into Email Archiving ProductsDeep Dive Into Email Archiving Products
Deep Dive Into Email Archiving ProductsStephen Foskett
 
Exchange Architecture & Sizing
Exchange Architecture & SizingExchange Architecture & Sizing
Exchange Architecture & SizingGWAVA
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Case Study - Business Transformation
Case Study - Business TransformationCase Study - Business Transformation
Case Study - Business TransformationPeopleWiz Consulting
 

Viewers also liked (13)

Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With Hadoop
 
ElasticInbox
ElasticInboxElasticInbox
ElasticInbox
 
Scaling and Managing Cassandra with docker, CoreOS and Presto
Scaling and Managing Cassandra with docker, CoreOS and PrestoScaling and Managing Cassandra with docker, CoreOS and Presto
Scaling and Managing Cassandra with docker, CoreOS and Presto
 
EV.Cloud Email Archiving
EV.Cloud Email ArchivingEV.Cloud Email Archiving
EV.Cloud Email Archiving
 
Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...
Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...
Introduktion till Advanias e-arkivlösning, baserad på HP Records Manager 8 - ...
 
Driving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information ManagementDriving Business Performance with effective Enterprise Information Management
Driving Business Performance with effective Enterprise Information Management
 
Archiving 2.0 - Retain Business Value
Archiving 2.0 - Retain Business ValueArchiving 2.0 - Retain Business Value
Archiving 2.0 - Retain Business Value
 
Deep Dive Into Email Archiving Products
Deep Dive Into Email Archiving ProductsDeep Dive Into Email Archiving Products
Deep Dive Into Email Archiving Products
 
Exchange Architecture & Sizing
Exchange Architecture & SizingExchange Architecture & Sizing
Exchange Architecture & Sizing
 
Cassandra via-docker
Cassandra via-dockerCassandra via-docker
Cassandra via-docker
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Case Study - Business Transformation
Case Study - Business TransformationCase Study - Business Transformation
Case Study - Business Transformation
 

Similar to [Hadoop] NexR Terapot: Massive Email Archiving

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Amazon Web Services
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data cloudsdamienjoyce
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Amazon Web Services
 
AWS Use Cases
AWS Use CasesAWS Use Cases
AWS Use Casessamof76
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
SQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and ScaleSQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and ScaleMichael Rys
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarAmazon Web Services
 
Big Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesBig Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesAmazon Web Services
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 

Similar to [Hadoop] NexR Terapot: Massive Email Archiving (20)

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data clouds
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
 
AWS Use Cases
AWS Use CasesAWS Use Cases
AWS Use Cases
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
SQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and ScaleSQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and Scale
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace Webinar
 
Big Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesBig Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web Services
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 

More from Jinho Jung

[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기
[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기
[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기Jinho Jung
 
[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...
[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...
[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...Jinho Jung
 
[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행
[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행
[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행Jinho Jung
 
[Ignite 강남 2016] 장정화-내인생을바꾼 improv
[Ignite 강남 2016] 장정화-내인생을바꾼 improv[Ignite 강남 2016] 장정화-내인생을바꾼 improv
[Ignite 강남 2016] 장정화-내인생을바꾼 improvJinho Jung
 
[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?
[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?
[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?Jinho Jung
 
[Ignite 강남 2016] 유소희-job nomad
[Ignite 강남 2016] 유소희-job nomad[Ignite 강남 2016] 유소희-job nomad
[Ignite 강남 2016] 유소희-job nomadJinho Jung
 
[Ignite 강남 2016] 김태길 무엇이든 적어보세요
[Ignite 강남 2016] 김태길 무엇이든 적어보세요[Ignite 강남 2016] 김태길 무엇이든 적어보세요
[Ignite 강남 2016] 김태길 무엇이든 적어보세요Jinho Jung
 
[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투
[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투
[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투Jinho Jung
 
[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다
[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다
[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다Jinho Jung
 
[Ignite 강남 2016] 이미화-일상,조용한혁명
[Ignite 강남 2016] 이미화-일상,조용한혁명[Ignite 강남 2016] 이미화-일상,조용한혁명
[Ignite 강남 2016] 이미화-일상,조용한혁명Jinho Jung
 
[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니
[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니
[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니Jinho Jung
 
[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해
[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해
[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해Jinho Jung
 
[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정
[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정
[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정Jinho Jung
 
[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다
[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다
[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다Jinho Jung
 
[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다
[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다
[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다Jinho Jung
 
Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기
Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기
Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기Jinho Jung
 
행복한 1인기업 이야기 : Happy 1Man Company
행복한 1인기업 이야기 : Happy 1Man Company행복한 1인기업 이야기 : Happy 1Man Company
행복한 1인기업 이야기 : Happy 1Man CompanyJinho Jung
 
Hackathon & hack day 이야기
Hackathon & hack day 이야기Hackathon & hack day 이야기
Hackathon & hack day 이야기Jinho Jung
 
서울스케쳐 전시계획서
서울스케쳐 전시계획서서울스케쳐 전시계획서
서울스케쳐 전시계획서Jinho Jung
 
행복화실 2014 - 12주 과정 Happy drawing 2014
행복화실 2014 - 12주 과정 Happy drawing 2014행복화실 2014 - 12주 과정 Happy drawing 2014
행복화실 2014 - 12주 과정 Happy drawing 2014Jinho Jung
 

More from Jinho Jung (20)

[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기
[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기
[Ignite 강남 2016] 천성권 지속가능한 딸바보로 살기
 
[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...
[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...
[Ignite 강남 2016] 조현길 경영자처럼 일할테니 경영자의 월급을 주세ᄋ...
 
[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행
[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행
[Ignite 강남 2016] 이지현 비주얼씽킹 세계여행
 
[Ignite 강남 2016] 장정화-내인생을바꾼 improv
[Ignite 강남 2016] 장정화-내인생을바꾼 improv[Ignite 강남 2016] 장정화-내인생을바꾼 improv
[Ignite 강남 2016] 장정화-내인생을바꾼 improv
 
[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?
[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?
[Ignite 강남 2016] 김영진-당신의 이름에서 회사를 지우면?
 
[Ignite 강남 2016] 유소희-job nomad
[Ignite 강남 2016] 유소희-job nomad[Ignite 강남 2016] 유소희-job nomad
[Ignite 강남 2016] 유소희-job nomad
 
[Ignite 강남 2016] 김태길 무엇이든 적어보세요
[Ignite 강남 2016] 김태길 무엇이든 적어보세요[Ignite 강남 2016] 김태길 무엇이든 적어보세요
[Ignite 강남 2016] 김태길 무엇이든 적어보세요
 
[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투
[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투
[Ignite 강남 2016] 유혜경-사회복지사의 공감력고군분투
 
[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다
[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다
[Ignite 강남 2016] 정기원-스타트업계의 멋진 여자들을 인터뷰해보았다
 
[Ignite 강남 2016] 이미화-일상,조용한혁명
[Ignite 강남 2016] 이미화-일상,조용한혁명[Ignite 강남 2016] 이미화-일상,조용한혁명
[Ignite 강남 2016] 이미화-일상,조용한혁명
 
[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니
[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니
[Ignite 강남 2016] 정해인 아빠가 먼저 피아노 배워도 되겠니
 
[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해
[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해
[Ignite 강남 2016] 황준석-사회생활 초년생의 꿈을 찾는 항해
 
[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정
[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정
[Ignite 강남 2016] 차성국 아들과 함께한 국토대장정
 
[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다
[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다
[Ignite 강남 2016] 김홍균 모든 것은 누군가의 상상에서 시작됐다
 
[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다
[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다
[Ignite 강남 2016] 정예진 맞춤법은 꼭 지켜야합니다
 
Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기
Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기
Ignite seoul 8회 12 천성권 혼나지 않는 남편으로 살기
 
행복한 1인기업 이야기 : Happy 1Man Company
행복한 1인기업 이야기 : Happy 1Man Company행복한 1인기업 이야기 : Happy 1Man Company
행복한 1인기업 이야기 : Happy 1Man Company
 
Hackathon & hack day 이야기
Hackathon & hack day 이야기Hackathon & hack day 이야기
Hackathon & hack day 이야기
 
서울스케쳐 전시계획서
서울스케쳐 전시계획서서울스케쳐 전시계획서
서울스케쳐 전시계획서
 
행복화실 2014 - 12주 과정 Happy drawing 2014
행복화실 2014 - 12주 과정 Happy drawing 2014행복화실 2014 - 12주 과정 Happy drawing 2014
행복화실 2014 - 12주 과정 Happy drawing 2014
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

[Hadoop] NexR Terapot: Massive Email Archiving

  • 1. Terapot: Massive Email Archiving with Hadoop & Friends - Commercial Hadoop Application Jason Han Founder & CEO, NexR jshan@nexr.co.kr Next Revolution, Toward Open Platform
  • 2. #2 NexR: Introduction Offering Hadoop & Cloud Computing Platform and Services Hadoop Provisioning & Management Hadoop & Cloud Computing Services Academic Support Massive Email Archiving MapReduce Workflow Program Massive Data Storage & Processing Platform Cloud Computing Platform (Compatible with Amazon AWS) icube-cc (Co icube-sc mpute) (Storage)
  • 3. #3 Email Archiving: Objectives   Regulatory compliance   e-Discovery: Litigation and legal discovery   E-mail backup and disaster recovery   Messaging system & storage optimization   Monitoring of internal and external e-mail content
  • 4. #4 Email Archiving: Architecture Email Servers Crawling Journaling DB Email Archiving Server Servers (HA) Search & Discovery Metadata Indexes Storage Network Archival Storage Aging Email DAS SAN NAS Tape Library
  • 5. #5 Email Archiving: Challenges   Explosive growth of digital data -  6 times (988XB) in 2010 than 2006 -  95% (939 XB) unstructured data including email -  Increasing the cost and complexity of archiving  Requiring scalable & low cost archiving   Reinforcement of data retention regulation -  Retention, Disposal, e-Discovery, Security -  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX  Requiring scalable archiving & fast discovery   Needs for intelligent data management -  Knowledge management from email data -  Filtering, monitoring, data mining, etc  Requiring integration with intelligent system
  • 7. #7 Email Archiving: Problems Email Servers Crawling Journaling DB Email Archiving Server Servers (HA) Centralized search Search & is slow & Discovery not scalable Metadata Indexes Storage Network Archival Storage Discovery from ta Storage is expensi Email Aging ve & pe is slow not scalable DAS SAN NAS Tape Library
  • 8. #8 Terapot: When Hadoop Met Email Archiving…   Scale-out architecture with Hadoop -  Hadoop HDFS for archiving email data -  Hadoop MapReduce for crawling & indexing -  Apache Lucene for search & discovery Email Servers Email Archiving Servers (HA) Distributed Crawling Journaling Hadoop MapReduce (Crawling, Indexing, etc) Metadata DB Journaling Hadoop HDFS Server (Archiving) Server Distributed Search & Discovery
  • 9. #9 Terapot: Overview   Design Principles   Shared nothing architecture  Unlimited scalability   Inexpensive hardware  Low cost   Using open source software  Fast development   Exploiting parallelism  High performance   Integrating with analysis  High intelligence   Features   Distributed massive email archiving   High scalability   thousands of servers, billions of emails   High Performance   Fast search under 1-2 seconds for each user account   Fast discovery in parallel with MapReduce   High Intelligence   Email data mining, such as social network analysis   Support both on-premise version and cloud(hosted) version   Development with various open source software
  • 10. #10 Terapot: Open Source Software Stack Frontend Layer Apache Tomcat Apache JAMES Crawling Indexing Searching Email Mining Downloadi ng Zookeeper Apache Lucene Hive MySQL Hadoop MapReduce Hadoop HDFS Backend Layer
  • 11. #11 Terapot: Architecture Terapot Clients Email Sources HTTP/ SOAP REST JSON POP3 Mail NAS/ FTP/SFTP Server Server NFS Server Terapot Frontend Search Gateway MailServer MR Workflow Manager Analyzer Batch processing Analysis Searching Real-Time Crawling Indexing Merging ETL Mining Indexing Hadoop MapReduce, Lucene, & Hive HDFS (email, index) Local (index)
  • 12. #12 Terapot Data Archiving Flow 1. Send email 6. Receive email Internet 2. Deliver email HTTP/ NAS/ FTP/SFTP 5. Forward email NFS Server SMTP 1. Search emails Server 1. Fetch emails in parallel 3. Push email Crawler Indexing (MR) (MR) Real-Time Shard Shard Shard Shard Index 2. Save emails Index Index Index 4. Save email & 3. Build index files build index files in runtime emails emails emails emails emails emails Index HDFS emails Index Search Layer Real-Time Indexing Layer Batch Processing Layer
  • 13. #13 Terapot Data Analysis Flow Terapot Terapot Mining Engine Archiving Storage 1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel to analysis data 2. Generate Transform NexR Terapot Front Report in MySQL (MR) 2. Store large data Analysis data Analysis data MySQL HDFS Analysis data Analysis data Report Retrieval Layer Data Analysis Layer ETL Layer
  • 14. #14 Technical Features   Distributed Archiving   Hadoop HDFS for storing email data   Compression and deduplication for storage space efficiency   Distributed Crawling & Indexing   Implemented by Hadoop MapReduce   Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP , HTTP, NFS, etc)   Support batch indexing & merging by MapReduce and real-time indexing for i nstant archiving   Distributed Search   Shard a search job and executing it in parallel   Searchable instantly on receiving an email (due to real-time indexing)   Parallel Download   Download full search results in parallel by MapReduce   Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)   Standard Client Interface   Support REST/SOAP and JSON interface   Management   Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
  • 15. #15 Crawling   Store Massive Email Data in HDFS through MapReduce   Hadoop utility(dfs –put) just copies data sequentially   Each Crawling MR takes & stores a range of data in parallel {key,email}* Crawling Crawling Data MR Location Client Information HDFS Splitting Crawling MR Crawling MR INPUT
  • 16. #16 Indexing   Indexing Email Data with MapReduce   Each Indexing MR takes a range of data and makes lucene index in parallel {key,index}* Indexing Indexing Email Data MR Client HDFS Splitting Indexing MR Indexing MR INPUT
  • 17. #17 Real-Time Indexing   Indexing Email Data in Runtime   Indexing in memory on arriving a new email   Flushing RT-Shard periodically into HDFS Periodic Real-Time Shard flushing into HDFS emails Local Index Forwarding Mailet Email Data RT emails Component Shard HDFS JAMES RT emails Shard Mail
  • 18. #18 Searching   Distributed Search   Indexes are split & stored in local disks   Shard is responsible for searching a range of index Local Index Read email Shard Searching Client Search HDFS Shard Notification Update shard state RT & index information Zookeeper Shard
  • 19. #19 Parallel Downloading   Downloading Massive Search Results in Parallel   Support various types of communications for downloading   Downloading MR sorts search results globally & pushes into targets write result directly write result Local DL Map HDF DL DL write result Map Reduce S Shard Donwload Download Request Client DL DL FTP Map Reduce write result Shard DL DL Map Reduce SFTP DL write result Map HTTP Shard HDFS Distributed Global Sort
  • 20. #20 Email Data Analysis   Analysis Process   ETL(Extract-Transform-Load) email archiving data to Hive table format   Analyzing data using Hive with various analysis algorithm   Generating the analysis result report write result Terapot Mining ETL M write result execute HiveQL Terapot R Mining Load Archving Data HIVE ETL M write result Generate Report R ETL M write result MySQL R
  • 21. #21 Types of Analysis   Social Network Analysis   Personal Network Analysis   Computing distance between recipients or senders based on TO, CC, FRO M links   Analyzing the statistics of mail frequency   Domain Analysis   Computing distance between recipient’s domain based on TO, CC, FROM   Keyword Analysis (in progress)   Keyword frequency for each user
  • 22. #22 Terapot Performance   Experimental Environment   11 Intel Servers: 1 Master + 10 Slaves   Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk   The number of emails: 270 millions (Index size: 270 GB)   Results Indexing in local disks Number of Emails Number of Results Response Time (sec) 67,217,298 12,547,398 1.4 134,434,596 25,094,796 1.4 201,651,894 37,642,194 1.4 268,869,192 50,189,592 1.4 Indexing in HDFS Number of Emails Number of Results Response Time (sec) 67,217,298 12,547,398 2.8 134,434,596 25,094,796 2.8 201,651,894 37,642,194 3.2 268,869,192 50,189,592 3.2
  • 24. #24 www.nexr.co.kr Hadoop & Cloud Computing Company