SlideShare a Scribd company logo
1 of 45
Next Revolution
Toward Open Platform

                                      NYC 2011




                       www.nexr.com
NexR Introduction
 Big data analytics firm
   Working on Hadoop and big data for 5 years
   Provided a NexR Hadoop solution to all major Korea telcos
   (KT, SKT, LG U+)
   Leading a Korean Hadoop community and holding Hadoop
   conferences


 Products
   NexR Data Analytics Platform (NDAP)
   iCube Cloud: cloud computing platform (like OpenStack)
   Massive email archiving solution (presented in Hadoop World 2009)




    Next Revolution
    Toward Open Platform                                        -2-
Agenda
 Voice of Customer: KT CDR Analysis System
 KT requirements for system migration
 NexR Data Analytics Platform (NDAP) overview
 Oracle-to-Hive migration
 Enterprise Hive
 RHive
 Lessons learned
 Conclusion

                    You can download this presentation file:
                      http://www.nexr.com/hw11/ndap.pdf



   Next Revolution
   Toward Open Platform                                        -3-
Introduction to KT business

                                                                   Business
                     1981.12 Establishment of KT Corporation       Domain
                     2002.08 Privatization from Gov. Owned         – Mobile 2G, 3G
                            Company                                – WiBro (Mobile
                                                                   WiMax)
                     2006.04 Commercial Launch of World’s
                            first Mobile WiMAX - WiBro             –Internet Access
                                                                   –IPTV
                     2008.11 Commercial Launch of Real-Time IPTV
                                                                   –VOIP
                     2009.06 Merged with KTF                       –Multimedia
                                                                   Contents
                     2010.06 Cloud Service Launched                –Local,
                                                                   International
                                                                   Telephone
                                                                   – Cloud Service




# of        Sales      Telephone      Broadband     Mobile
Employees   (2010)     Subscribers    Subscribers   Subscribers
Introduction to KT CDR data
• KT CDR(Call Detail Record)                                                            Unit : TB
                                             row data         summary              size
                          Data
                                             1 Month          1 Month     (row 1 yr +sum2 yrs)
                      Unrated CDR
                                                3.7               2.5             104
                 (VOICE, Data, SMS, MMS)
    Wireless            Rated CDR               1.5               0.2             22
                          Wi-Fi                 0.4               0.3              12
                         Wibro                  1.5               1.0             42
    Wireline           Rated CDR                1.5               1.5             55
      IPDR                IP-TV                 1.5               0.1             19
     Total                                      10                5.6             254


• KT Subscriber Analysis System(SAS) for Wireless CDR
   Reporting, call detail summary, subscriber’s call quality, call log search, etc

   Implemented with relational database over a high-end server
     - Data gathering and converting in a server every tens of seconds

     - Daily batch extract-transform-load(ETL) with SQL queries

     - Near real-time search against an indexed column(call number)

   Hundreds of DB tables, over 3000 SQL queries for 10 years
Current KT CDR Analysis System Architecture


                                        Relational
Data Sources                            Database                                   Real-Time
                                                                      Bottleneck
                                                                                    Search
                                     Bottleneck
                         Data                               Raw
  LALA2                Converting                           Data


                                                          Dimension
                                                            Table                   OLAP


                  Bottleneck                      Batch   Summary
                                                   ETL      Table
  NIBADA                              Bottleneck

                        Collector
                         Server




   ARGOS
                                    High-end
                                     Server
New Challenges Faced
• Increasing In Data Volume
   – Popular demand for Smart Phone and SNS
   – Need to do more complicated data analysis to bit the competition
   – Customer behavior analysis is needed
• Slow in performance
   – Peak time performance became unacceptable
   – Some CDR’s were lost due to slow performance
• KT Cloud Business Launched
   – Cheaper New KT Cloud H/W is available
   – Open source requirements are increasing in the company




           Can traditional DB give us an answer?
KT meets NexR for Big Data
• Scalability
   – Coping with increasing data volume
     and variety (wired, 2G, 3G, WiMax,
     LTE, WiFi, SMS, MMS, etc)
   – Enabling horizontal scalability in every
     data path (data collection, data           Replacing traditional
     storage, ETL process, data search)          RDB and DW with
                                                Hadoop and similar
• Performance                                           OSS
   – Handling streamed CDR data in near
                                                  Project start (2011.4)
     real-time
   – Completing daily ETL tasks in a given        Applying NexR solution
     time period regardless of data                for CDR analysis (pilot)
     increase


• Cost-Efficiency
   – Reducing the cost with inexpensive
     equipments
Continuous Journey for KT’s Big Data
• Step by Step approach….with NexR

         Steps           Open                         Coverage

                                  . Replacing representative data and SQLs
     Hadoop CDR         2012.1Q
   Analysis Platform              . Unrated Wireless CDR (Pilot)



                                  . Change all traditional application to OSS
                        2012.
     Wireless CDRs                . Add more views and reports



                                  . Rated CDR’s, Internet access log, TV log
    Data Integration    2013.
   Advanced Analytics             . Advanced Analytics



                                  . SNS, Location etc
     External Data      2014.
                                  . Data from KT subsidiaries
        Sources
Rethinking KT’s Requirements
                                                               Internet   Social
                                                    IPTV Log    Access    Data
                                                                 Log




                                                                                     Data
                                                          Data                      Volume
                                     Data
                                                                                       +
                                   Explosion           Integration                   Data
                                                                                    Variety


        Past                          Present                  Future



                   KT CDR System                                                     Data
           Hundreds of DB tables, 3000 SQLs                                        Interface
                    (for 10 years)                                                     +
                                                                                     Data
                                                                                   Engineers
           SQL                  Business Analysts              Who?
                          DBA
        Developers               (OLAP, SAS, etc)

   Next Revolution
   Toward Open Platform                                                                -10-
Big Data Analytics Requirements for Enterprise

    Data volume is only the basic requirement

    Data integration is the fundamental requirement
    (Structured data + Unstructured data)

    Need to preserve the existing data and apps
    Need to be familiar to enterprise data engineers
     (DBA, SQL developers, business analysts, etc)

    Smooth transition is also essential


                What’s the solution?
    Next Revolution
    Toward Open Platform                                -11-
NexR Solution: Hadoop + Hive
                                   Hive is the best solution for smooth transition
                                   from database world to Hadoop world


                             ANSI-SQL-based query engine
                                 Good for RDB migration



                                                             Batch data processing
       HOW TO                                                (ETL, Reporting, Ad-hoc query)
       CONVERT

        HOW TO
         ADAPT
                                                             Common data storage



Enterprise data engineers        File-based data store
        DBA, SQL                 Good for data integration
        Developers




      Next Revolution
      Toward Open Platform                                                             -12-
NexR Data Analytics Platform (NDAP)
 Embracing database into Hadoop world
   Support for migrating data and logics from RDB to Hadoop
   Support for integrating RDB and Hadoop
   Offering Hive tools for DBA and SQL developers


 Full package for big data analytics
   From data collection to batch data processing, real-time query, and
   even advanced analytics
   Leveraging open source technologies
   Horizontal scalability in every data processing path (collection,
   batch processing, real-time query, etc)
   Injecting real-world practices by the collaboration with KT




    Next Revolution
    Toward Open Platform                                          -13-
NDAP Bird’s View
 Used Open Sources

                                        NDAP RHive                        Advanced analytics
                                     Integration of R and Hive



                                 NDAP Enterprise Hive
                                  Oracle-to-Hive, Hive workflow,          Batch data processing
                             Hive performance monitor, query planner



                                    NDAP Data Store                       Common data storage
                              HDFS, Sqoop-based data import/export



                                       NDAP Search
                             ElasticSearch-based distributed log search   Real-time query
                                     Time-ranged index sharding


                                      NDAP Collector
                                   Flume-based data collector             Streamed data collection
                              Checkpointing for low overhead agents


                                  NDAP Admin Center
                             Zookeeper-based distributed coordinator      Coordination & Management
                             Collectd-based system/app management


      Next Revolution
      Toward Open Platform                                                                        -14-
NDAP Architecture
  Data                           NexR Data Analytics Platform (NDAP)                               Applications
 Sources
                                                                                                     Advanced
                                                           RHive                                     Analytics



                                             Enterprise               Hawk
                                                                                                         DBA
                                                              PerfMon, Query Plan
                                               Hive
                       Data
 Oracle                                                                                                 ETL
                     Importer                                         Lama
                                             Oracle-to-Hive                                         Ad-hoc query
                                                                   Hive Workflow                     Reporting
Databases
                                                                                        Existing
                                                                                         BI/DW
                       Data
  ODS                Importer                                                            OLAP
                                                                                         Server
                                                Data Store                                            OLAP
                                                (Hadoop)
                                                                               Data
                                                                             Exporter    Oracle
                     Collector

                                                                           REST/JSON
                                                  Search                      API
                                                                                                    Real-time
   Telco                                                                                              query
Equipments
(Streaming                                  Admin Center
   data)


             Next Revolution
             Toward Open Platform                                                                     -15-
NDAP Bird’s View – Today’s focus

                                     NDAP RHive
                                  Integration of R and Hive

                                                                       Today’s talk
                               NDAP Enterprise Hive
                                Oracle-to-Hive, Hive workflow,
                           Hive performance monitor, query planner



                                  NDAP Data Store
                           HDFS, Sqoop-based data import/export



                                     NDAP Search
                          Lucene-based distributed log search engine
                                 Time-ranged index sharding

                                                                       Refer to appendix
                                   NDAP Collector
                                 Flume-based data collector
                            Checkpointing for low overhead agents


                                NDAP Admin Center
                           Zookeeper-based distributed coordinator
                           Collectd-based system/app management


   Next Revolution
   Toward Open Platform                                                                -16-
Enterprise Hive




 Recreating Hive for Enterprise Data Engineers

 Two goals
   Migration of data and SQL from RDB(Oracle) to Hive
    Oracle-to-Hive support
   Rich environment for Hive developers, even DW/BI teams and DBA
    Performance monitor, query planner, workflow manager




   Next Revolution
   Toward Open Platform                                      -17-
Is Oracle-to-Hive trivial?
Simple Example
                  SELECT * FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id


                  SELECT * FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id)




Typical Example
  SELECT /*+ PARALLEL(K1 16) USE_NL(K1 B) */
          ETL_DATE, CALL_DATE,
          CASE WHEN SUBSCRIBER_TYPE ='PREMIUM'
              THEN 'Y'
              ELSE NVL(TO_CHAR(B.I_NCN),'X')
          END AS I_NCN,
          I_INOUT,VALID_CNT, I_CFC_TYPE, ……
    FROM 3G_CALL_LOG K1
       , SASCOMM.PHONE_MAPPING B
   WHERE K1.i_etl_dt        = TO_DATE('[#SAS_YDAY#,'YYYYMMDD')
     AND K1.i_call_dt ||'' >= TO_DATE('[#SAS_YDAY#]','YYYYMMDD')
     AND K1.i_call_dt ||'' < TO_DATE('[#SAS_YDAY#]','YYYYMMDD') + 1
     and K1.I_INOUT in ('0','1')
     AND DECODE(K1.I_INOUT,'0',NVL(K1.I_OUT_CTN, I_CALLING_NUM),'1',K1.I_IN_CTN) = B.I_CTN(+)
     AND K1.CALL_DATE   >= B.SDATE(+)
     AND K1.CALL_DATE   <   B.EDATE(+);




     Next Revolution
     Toward Open Platform                                                                   -18-
Enterprise Hive – Oracle-to-Hive
 Enhancing Hive by
   Fixing Hive code (JIRA issues, 2253, 2503, 2329, 2332, etc)
   Adding Hive UDF and UDAF for Oracle compatibility


 Enterprise Hive provides
   Conversion rules, a guide and a process
   Oracle data types that are not supported in Hive
   Oracle functions that are not supported in Hive


 Three conversion points to consider
   Data model and data types
   Basic functions, aggregate and analytic functions
   SQL syntax


   Next Revolution
   Toward Open Platform                                          -19-
Oracle-to-Hive – Data Model, Types, Functions

                                                                                                      Hive refered to
                                                                                                    MySQL function syntax
              Data Model                             Basic Functions
     Oracle                  Hive                   Function
                                                                        Oracle                       Hive
                                                      Type
      Table                  Table
                                                      Math          round,ceil,mod,           round,ceil,pmod,
     Partition             Partition                Functions      power,sqrt,sin/cos        power,sqrt,sin/cos
                                                                                            substr,trim,lpad/rpad
    Sampling                 Bucket                 Character     substr,trim,lpad/rpad
                                                                                           ltrim/rtrim,regexp_repl
                                                    Functions      ltrim/rtrim,replace
                                                                                                      ace
                 Data Type                            Null
                                                                    coalesce,nvl,nvl2      coalesce (no nvl, nvl2)
                                                    Functions
     Oracle                  Hive
                            TINYINT                       Added Basic Functions (Hive UDF)
    NUMBER(n)
                          INT/BIGINT
                                                   Function Type                          Hive
   NUMBER(n,m)         FLOAT/DOUBLE
                                                      Condition                  DECODE, GREATEST
    VARCHAR2                 STRING
                                                        Null                            NVL, NVL2
                          STRING
      DATE             "yyyy-MM-dd                     Type              TO_NUMBER, TO_CHAR, TO_DATE,
                      HH:mm:ss" format               Conversion          INSTR4, DATE_FORMAT, LAST_DAY


                                       Hive data type is designed to be converted into Java data type


   Next Revolution
   Toward Open Platform                                                                                              -20-
Oracle-to-Hive – SQL Syntax & Analytic Functions
                                     Most not-supported Oracle SQL syntax can be converted with Join syntax


                                   Oracle SQL                                             Hive HQL


                     SELECT * from Employee e WHERE e.DeptNo                        SELECT * from Employee e
 IN subquery
                         IN(SELECT d.DeptNo FROM Dept d)                 LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo)


                                                                                  SELECT e.* from Employee e
   NOT IN            SELECT * from Employee e WHERE e.DeptNo
                                                                        LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo)
  subquery             NOT IN(SELECT d.DeptNo FROM Dept d)
                                                                                   WHERE d.DeptNo IS NULL


                                    SELECT *                                              SELECT *
    JOIN
                  FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id        FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id)


    RANK         SELECT name,dept,salary,RANK() OVER (PARTITION BY      SELECT e.name,e.dept,e.salary,RANK(e.dept,e.salary)
  (Analytic                              dept                        FROM (SELECT name, dept, salary FROM emp DISTRIBUTED
  Function)              ORDER BY salary DESC) FROM emp                       BY dept SORT BY dept, salary DESC) e


    MIN                                                                 SELECT dept,tmp.m FROM emp JOIN (SELECT dept,
                 SELECT dept, MIN(salary) OVER (PARTITION BY dept)
 (Aggregate                                                                              MIN(salary) m
                                    FROM emp
  Function)                                                          FROM emp GROUP BY dept) tmp ON emp.dept = tmp.dept



                          Oracle analytic functions are used sometimes for statistical processing (5% in KT case)
                           Implemented some analytic functions
                            (RANK, DENSE_RANK, ROW_NUMBER, LAG, MIN, MAX, SUM)

       Next Revolution
       Toward Open Platform                                                                                          -21-
Oracle-to-Hive – Example


           select /*+ use_nl(E emp_idx1) */
                  D.dname, E.empno, E.ename,
                  decode(nvl(JOB, ‘SALESMAN’), 'SALESMAN', sal, 0) sal,
                  RANK() over (PARTITION BY D.deptno ORDER BY sal desc) ranking
           from   dept D, emp E
           where D.deptno = E.deptno
           and    E.ename in (select ename
                              from bonus
                              where job in ('SALESMAN', 'CLERK'));




           select X.dname, X.empno, X.ename
                  nexr_rank(HASH(X.deptno, X.sal), X.sal) ranking
           from (
               select D.dname, D.deptno, E.empno, E.ename,
                  (case coalese(JOB, ‘SALESMAN’) when 'SALESMAN‘ then sal else 0) sal,
               from   dept D
               join   emp   E on (D.deptno = E.deptno)
               join   bonus B on (D.ename = B.ename)
               where B.job in ('SALESMAN', 'CLERK')
           ) X
           distribute by hash(D.deptno, E.sal) sort by D.deptno, E.sal;




   Next Revolution
   Toward Open Platform                                                                  -22-
NDAP Process for RDB Migration

     Data
  Preparation               Conversion        Validation      Optimization



                               Function
  Oracle schema                                               Rewriting Hive
                              conversion
 to Hive schema                                                   queries
                                                 Data
                                                               semantically
                            SQL conversion    compatibility
 Data loading to                                               (when more
                              (by 1-on-1         check
   Hive using                                                  performance
                            conversion rule
     Sqoop                                                      is needed)
                             syntactically)




  The case of KT CDR migration
      Chose 100 representative SQLs for ETL and successfully converted
      Current step: 200-300 mainly used SQLs
      Next step(2012): 3000 SQLs



     Next Revolution
     Toward Open Platform                                                      -23-
Enterprise Hive – Rich Environment for Hive

 Building up Hive Ecosystem by
   Adding assistant programs to help DBA and SQL developers


 Enterprise Hive provides

                      Hive performance monitor and query planner


                      Hive workflow manager




   Next Revolution
   Toward Open Platform                                            -24-
Hawk – Hive Performance Monitor
 Difficulty of Hive performance diagnostics
   Metrics and logs from Hive and Hadoop are seperated
   Lack of the historical and statistical view of performance
 Hawk performance monitor for DBA
   Integrated view of a Hive query and the corresponding MapReduces
   Hourly/daily/weekly/monthly performance views of each query




                                                       Hawk Screenshot



             Hawk Architecture


    Next Revolution
    Toward Open Platform                                                 -25-
Hawk – Hive Query Planner
 Difficulty of Hive default query planner
     Too complicated due to show the detail of MapReduce execution
     Not for DBA, but for Hive internal developers
 Hawk query planner for DBA
     Displaying a Hive query in a HQL operator level (familiar to DBA)
     Showing a performance result with a query at once

Hive default query planner




                                                       Hawk query planner




                                                                         Performance result



       Next Revolution
       Toward Open Platform                                                                   -26-
Lama – Hive Workflow Manager

 Workflow development and management tool for Hive
   Managing data processing jobs for Hive
   Choosing Oozie as a core workflow engine
   Providing web-based interface
     Workflow editing & management, user management, job scheduling,
     project management, etc


 On-demand workflow change at runtime
   Need to fix and resume a workflow at runtime in failure
   Not supported in most workflow engines
   Patching Oozie for suspend/resume per action(i.e., Hive query)

 Future plan
   Supporting other data processing jobs like Pig, Sqoop, MapReduce,
   HDFS, SSH, and Java

   Next Revolution
   Toward Open Platform                                             -27-
NDAP Process for Batch Data Processing

    Analysis                 Development      Execution     Management



                                                             Performance
                                               Workflow
  Analyze service                                            diagnostics &
                               Workflow      deployment &
   request(SR)                                               optimization
                              development     scheduling
                               & testing
  Hive data and                                                 Workflow
                              & validation   Performance
 query modeling                                               suspend/fix/
                                              monitoring
                                                            resume in failure




                                 Lama             Hawk            Hawk
                                Workflow      Performance         Query
                                Manager         Monitor          Planner




      Next Revolution
      Toward Open Platform                                                   -28-
Enterprise Hive Demo




   Next Revolution
   Toward Open Platform   -29-
R for Advanced Analytics
 R (GNU open source)
   Programming language and software environment for statistical
   computing and graphics (wikipedia)
   4,000+ R libraries (more than SAS’s functionality)
   Becoming a de facto standard among statisticians


 R for Big Data
   R runs in a single node
   Some parallel R packages
     snowfall, rpvm, rmpi, etc


 Recent attempts to combine R and Hadoop
   RHIPE(Purdue), RHadoop(RA), Ricardo(IBM)


   Next Revolution
   Toward Open Platform                                        -30-
RHive
 Marrying R and Hive for Big Data Analytics
     Most R programmers are familiar to SQL
     Hive can hide the detail of Hadoop and MapReduce
     Inspired by IBM Ricardo(R+Jaql)




        Strong for deep analytics                           Strong for massive data manipulation
Lack of massive data manipulation                           Lack of analytical functionalities




                  Providing Hive interfaces in the R environment
      Allowing R programmers to use a familiar SQL for big data manipulation

                     Released as open source (Apache license version 2)
                           Source: https://github.com/nexr/RHive
                     CRAN: http://cran.r-project.org/web/packages/RHive

       Next Revolution
       Toward Open Platform                                                                 -31-
RHive API and Architecture
 RHive API
   rhive.connect(): connect R to Hive
   rhive.query(): send a Hive query and return the result
   rhive.export(): export R functions to R processes running on the MR nodes
   rhive.exportAll(): export R functions and R objects to R processes running on
   the MapReduce nodes
   rhive.close(): close a Hive connection

      RHive
   Architecture




    Next Revolution
    Toward Open Platform                                                      -32-
RHive Sample – Flight Delay Prediction
 R: Building a prediction model of flight delay using linear regression with a training data set (sampled from Hive)
 Hive: Running the prediction model(R objects) with an entire data set in Hive

  1   library(RHive)
  2   rhive.connect("127.0.0.1")
  3
  4   # get a training data set from Hive
  5   trainset <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines",fetchsize=30,limit=100)
  6
  7   # convert to numeric, and extract out missing values
  8   trainset$arrdelay <- as.numeric(trainset$arrdelay)
  9   trainset$distance <- as.numeric(trainset$distance)
 10   trainset <- trainset[!(is.na(trainset$arrdelay) | is.na(trainset$distance)),]         Data set: airline on-time performance
 11                                                                                         http://stat-computing.org/dataexpo/2009/
 12   # create a prediction model using R model objects and internal funtions               • Flight arrival and departure details for
 13   model <- lm(arrdelay ~ distance + dayofweek,data=trainset)                            all commercial flights within the USA,
 14   rhpredict <- function(arg1,arg2,arg3) {                                               from October 1987 to April 2008.
 15     if(arg1 == "NULL" | arg2 == "NULL" | arg3 == "NULL")
 16       return(0.0)
 17     res <- predict.lm(model, data.frame(dayofweek=arg1, arrdelay=arg2, distance=arg3))
 18     return(as.numeric(res))
 19   }
 20   null <- "NULL"
 21
 22   # set up R objects in Hive
 23   rhive.assign("null", null)
 24   rhive.assign("rhpredict", rhpredict)
 25   rhive.assign("model", model)
 26
 27   # export the R prediction model and run it in Hive
 28   rhive.exportAll("rhpredict", c("10.1.3.2","10.1.3.3","10.1.3.4","10.1.3.5","10.1.3.6","10.1.3.7"))
 29   rhive.query("create table delaypredict as select R('rhpredict', dayofweek, arrdelay, distance, 0.0) from airlines")

         Next Revolution
         Toward Open Platform                                                                                                  -33-
RHive Demo




   Next Revolution
   Toward Open Platform   -34-
Lessons Learned
 RDB migration to open source is complicated, time-
 consuming, and labor-intensive. It can become real
 with some practice and migration process.
   The average time of a query conversion (200~300 lines in average)
     8 hours -> 2 hours after 4 months (4 times faster)
   Advantageous to those who experienced database migration
   (similar to Oracle-to-MySQL migration)

 Current data engineers are not familiar with open
 sources like Hadoop. They want to use software tools
 similar to the ones that they use.
   Open sources such as Hadoop and MapReduce are not easy for
   current IT managers. Open sources are technology-driven, not
   demand-driven.
   Open sources and technologies need to be wrapped up in familiar
   interfaces in order to hide the detail.

   Next Revolution
   Toward Open Platform                                         -35-
Lessons Learned
 Open source software is not a panacea. Choosing a right
 open source is the first significant step. Combining several
 OSS is common. The modification of source code of OSS is
 inevitable if requirements are not negotiable.
   Combining two separate open sources, Hive and ElasticSearch for
   batch data processing and real-time query on Hadoop as a common
   data store.
   The change of Hive, ElasticSearch, Flume, Oozie, Zookeeper, etc.

 The integration of various types of data is a critical issue for
 an enterprise. Especially, the structured data of database
 and DW need to be coupled with unstructured data in order
 to better understand customer’s needs.
   It is necessary to embrace current data and business logics in a new
   environment.
   RDB/DW and Hadoop have their pros and cons, so it is necessary to
   find the right mix.

    Next Revolution
    Toward Open Platform                                             -36-
Conclusion


   Big data analytics for telco and enterprises




 Smooth transition from RDB/DW to



     NexR Data Analytics Platform (NDAP)



   Next Revolution
   Toward Open Platform                           -37-
NexR NDAP Team
   Jaesun Han                  Wonkuk Yang
   Sangmin Kwak                Sebong Oh
   JeongMin Kwon               SungHan Woo
   Keumju Kim                  Dongmin Yu
   Daegeun Kim                 Choonghyun Ryu
   Minseok Kim                 Bokju Yun
   Minwoo Kim                  Jonghee Lee
   Yeonseop Kim                HyungJoo, Lim
   Youngwoo Kim                HeeWon Jeon
   Hyeon-Cheol Nah             GooBum Jung
   SeungWoo Ryu                Sunghwan Cho
   Seoeun Park                 Junho Cho
   Young-Geun Park             ByungMyon Chae
   Eun-Sook Park               Yungtai Choi
   Chihoon Byun                Choi Jong-wook
   SeongHwa Ahn                Inho Han
    Youngbae An                Seonghak Hong
      Next Revolution
      Toward Open Platform                        -38-
Thank you
              Presentation file: http://www.nexr.com/hw11/ndap.pdf


                                                         Contact
                                                         jason.han@nexr.com
                                                         twitter: @jaesun_han




KT CDR                      NDAP       Enterprise
                                                      RHive           Appendix
 System                   Overview        Hive
                                                    (Slide 30)        (Slide 37)
(Slide 4)                 (Slide 14)   (Slide 17)


   Next Revolution
   Toward Open Platform                                                            -39-
Appendix



Next Revolution
Toward Open Platform              -40-
NDAP Collector
 Flume-based scalable data collector
    Choosing Flume due to the flexible architecture (source, decorator, sink)
    Adding a checkpoint mode and rolling/dedup
 Adding a checkpoint reliability mode
    Chukwa’s checkpoint is grafted onto Flume
    Less resource consumption in agents than Flume E2E mode
    Minimizing the amount of log data retransmitted at the failure of agents
 Rolling and deduplication
    Rolling fragmented log data periodically in Hadoop
    Removing duplicated log data in case of failover


                                              Rolling/Dedup Manager
      Zookeeper                                                              MapReduce
                                                                              Execution
                                 Rolling/De                      Workflow
                                                   Scheduler                              Data Store
                                  dup MR                         Manager
                                                                                          (Hadoop)




       Flume
                                  Source           Decorator          Sink                  Search
       Agent
                                                                             Log data
                                                                             & location
                    Checkpoint



     Next Revolution
     Toward Open Platform                                                                              -41-
NDAP Search: Near Real-Time Indexing
 Near real-time indexing using RAM Index
   Adding RAM index for near real-time indexing in ElasticSearch
   Flushing RAM index into Disk index after a given time period or a buffer
   overflow
   When searching, both RAM index and disk index are examined



                 Indexer                                         Searcher


                       add                                            search


               IndexWriter                                      IndexReader
                                             create
                           write



                                   buffer
                                                         read

                                                                        read
                       commit

                                            Disk Index


    Next Revolution
    Toward Open Platform                                                       -42-
NDAP Search: Index Split Strategies
 Modifying ElasticSearch to add more index split schemes for log search
    Searching log data has usually time constraint like daily or monthly
    Combining time-based index split and size-based index split
 Time-based index split
    Splitting an index according to a given time period
    Improving indexing and search performance
    Easy to implement auto-retention
 Size-based index split
    Splitting an index according to a given size
    Resolving a big index performance problem


                     Time-based                       Size-based                ElasticSearch
                   Index Partitions                Index Sequences              Index Shards


                    2011.10.08            0001          0002                0   Primary   Replica   Replica




                    2011.10.09            0001          0002         0003   1   Primary   Replica   Replica




Search                                                                      3   Primary   Replica   Replica




                    2011.10.30            0001          0002         0003


     Next Revolution
     Toward Open Platform                                                                               -43-
NDAP Admin Center: Distributed Coordinator
 Zookeeper-based distributed coordinator
    Zookeeper handles the coordination among NDAP components
    Patching several issues of Zookeeper and ZkClient
    Providing common libraries for NDAP components
      Gourp membership, master election, distributed lock, distributed queue
      Easy to use and more reliable than any other recipes, especially to read-and-write problems


          Patched Zookeeper Ensemble


                   Zookeeper Client Thread                         Complex, Unique, Fragile

                    Patched ZkClient Thread

                      Zookeeper Recipes

       Group          Master     Distributed   Distributed        Easy, Reusable, Fault Tolerant
     Membership       Election      Lock         Queue




                  NDAP                  NDAP
                  Search               Collector

    Next Revolution
    Toward Open Platform                                                                            -44-
NDAP Admin Center: System/App Management
 Collectd-based system and application monitoring
    Server resource monitoring: CPU, memory, disk, process, vmem, tcp connects, etc
    Application monitoring: Hadoop, ElasticSearch, Flume, Zookeeper, Memcached, Collectd, etc
    Plug-in architecture: add more applications such as NoSQL
 Resource-centric view
    Displaying all nodes’ resource status in a screen for a specific resource (cpu, mem, etc)
    Most system management tools(Ganglia, Nagios, etc) offer node-centric view


                                                                        Check threshold/
                                                      Collectd                  severity
                                                       Server


                                                                 Management
                                                                  Dashboard




                                                                                         NDAP
                                                                                         Admin


     Next Revolution
     Toward Open Platform                                                                        -45-

More Related Content

What's hot

Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Apekshit Sharma
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 

What's hot (20)

Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
RDD
RDDRDD
RDD
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 

Viewers also liked

Big data analytics -hive
Big data analytics -hiveBig data analytics -hive
Big data analytics -hivekarthika karthi
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
 
SnapLogic Big Data Integration
SnapLogic Big Data IntegrationSnapLogic Big Data Integration
SnapLogic Big Data IntegrationSnapLogic
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
March Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationMarch Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationAlexandra Knoll
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentryBrock Noland
 
Hive Correlation Optimizer
Hive Correlation OptimizerHive Correlation Optimizer
Hive Correlation OptimizerYin Huai
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
HiveハンズオンSatoshi Noto
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using HadoopSrikanth VNV
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3moai kids
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive Liyin Tang
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 

Viewers also liked (20)

Big data analytics -hive
Big data analytics -hiveBig data analytics -hive
Big data analytics -hive
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
SnapLogic Big Data Integration
SnapLogic Big Data IntegrationSnapLogic Big Data Integration
SnapLogic Big Data Integration
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
6.hive
6.hive6.hive
6.hive
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
March Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationMarch Marketers: Research Trends Presentation
March Marketers: Research Trends Presentation
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentry
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
 
Hive Correlation Optimizer
Hive Correlation OptimizerHive Correlation Optimizer
Hive Correlation Optimizer
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
Hiveハンズオン
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using Hadoop
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 

Similar to Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR

Cloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in KoreaCloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in KoreaFanny Lee
 
Broadband World Forum 2012 Highlights
Broadband World Forum 2012 HighlightsBroadband World Forum 2012 Highlights
Broadband World Forum 2012 HighlightsAlan Quayle
 
WTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxWTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxlionofsouth
 
ITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU
 
Colt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plansColt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plansColt Technology Services
 
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım ÖrnekleriCDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örneklerididemtopuz
 
Colt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future PlansColt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future PlansOpen Networking Summit
 
Gef 2012 InduSoft Presentation
Gef 2012  InduSoft PresentationGef 2012  InduSoft Presentation
Gef 2012 InduSoft PresentationAVEVA
 
GE Smallworld Network Inventory Overview
GE Smallworld Network Inventory OverviewGE Smallworld Network Inventory Overview
GE Smallworld Network Inventory Overviewcwilson5496
 
KT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in TelecomKT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in TelecomEDB
 
Low-Power Wide Area - Overview
Low-Power Wide Area - OverviewLow-Power Wide Area - Overview
Low-Power Wide Area - OverviewM2M Alliance e.V.
 
Mobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP BackboneMobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP BackboneHarry Mylonas
 
GE Smallworld Overview September2010
GE Smallworld Overview September2010GE Smallworld Overview September2010
GE Smallworld Overview September2010cwilson5496
 
Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6John Rhoton
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarData Con LA
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBVoltDB
 

Similar to Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR (20)

Cloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in KoreaCloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in Korea
 
Broadband World Forum 2012 Highlights
Broadband World Forum 2012 HighlightsBroadband World Forum 2012 Highlights
Broadband World Forum 2012 Highlights
 
WTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxWTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptx
 
ITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU-T Study Group 13 Introduction
ITU-T Study Group 13 Introduction
 
Javier Lecanda - Colt SDN/NFV Experience inca 201706
Javier Lecanda - Colt SDN/NFV Experience   inca 201706Javier Lecanda - Colt SDN/NFV Experience   inca 201706
Javier Lecanda - Colt SDN/NFV Experience inca 201706
 
Colt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plansColt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plans
 
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım ÖrnekleriCDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
 
Colt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future PlansColt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
 
Gef 2012 InduSoft Presentation
Gef 2012  InduSoft PresentationGef 2012  InduSoft Presentation
Gef 2012 InduSoft Presentation
 
Infrastructure Strategies 2007
Infrastructure Strategies 2007Infrastructure Strategies 2007
Infrastructure Strategies 2007
 
GE Smallworld Network Inventory Overview
GE Smallworld Network Inventory OverviewGE Smallworld Network Inventory Overview
GE Smallworld Network Inventory Overview
 
Radisys offloading 10412_final
Radisys offloading 10412_finalRadisys offloading 10412_final
Radisys offloading 10412_final
 
KT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in TelecomKT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in Telecom
 
Low-Power Wide Area - Overview
Low-Power Wide Area - OverviewLow-Power Wide Area - Overview
Low-Power Wide Area - Overview
 
Mobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP BackboneMobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP Backbone
 
GE Smallworld Overview September2010
GE Smallworld Overview September2010GE Smallworld Overview September2010
GE Smallworld Overview September2010
 
Building a Digital Telco
Building a Digital TelcoBuilding a Digital Telco
Building a Digital Telco
 
Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxNeo4j
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 

Recently uploaded (20)

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 

Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR

  • 1. Next Revolution Toward Open Platform NYC 2011 www.nexr.com
  • 2. NexR Introduction Big data analytics firm Working on Hadoop and big data for 5 years Provided a NexR Hadoop solution to all major Korea telcos (KT, SKT, LG U+) Leading a Korean Hadoop community and holding Hadoop conferences Products NexR Data Analytics Platform (NDAP) iCube Cloud: cloud computing platform (like OpenStack) Massive email archiving solution (presented in Hadoop World 2009) Next Revolution Toward Open Platform -2-
  • 3. Agenda Voice of Customer: KT CDR Analysis System KT requirements for system migration NexR Data Analytics Platform (NDAP) overview Oracle-to-Hive migration Enterprise Hive RHive Lessons learned Conclusion You can download this presentation file: http://www.nexr.com/hw11/ndap.pdf Next Revolution Toward Open Platform -3-
  • 4. Introduction to KT business Business 1981.12 Establishment of KT Corporation Domain 2002.08 Privatization from Gov. Owned – Mobile 2G, 3G Company – WiBro (Mobile WiMax) 2006.04 Commercial Launch of World’s first Mobile WiMAX - WiBro –Internet Access –IPTV 2008.11 Commercial Launch of Real-Time IPTV –VOIP 2009.06 Merged with KTF –Multimedia Contents 2010.06 Cloud Service Launched –Local, International Telephone – Cloud Service # of Sales Telephone Broadband Mobile Employees (2010) Subscribers Subscribers Subscribers
  • 5. Introduction to KT CDR data • KT CDR(Call Detail Record) Unit : TB row data summary size Data 1 Month 1 Month (row 1 yr +sum2 yrs) Unrated CDR 3.7 2.5 104 (VOICE, Data, SMS, MMS) Wireless Rated CDR 1.5 0.2 22 Wi-Fi 0.4 0.3 12 Wibro 1.5 1.0 42 Wireline Rated CDR 1.5 1.5 55 IPDR IP-TV 1.5 0.1 19 Total 10 5.6 254 • KT Subscriber Analysis System(SAS) for Wireless CDR  Reporting, call detail summary, subscriber’s call quality, call log search, etc  Implemented with relational database over a high-end server - Data gathering and converting in a server every tens of seconds - Daily batch extract-transform-load(ETL) with SQL queries - Near real-time search against an indexed column(call number)  Hundreds of DB tables, over 3000 SQL queries for 10 years
  • 6. Current KT CDR Analysis System Architecture Relational Data Sources Database Real-Time Bottleneck Search Bottleneck Data Raw LALA2 Converting Data Dimension Table OLAP Bottleneck Batch Summary ETL Table NIBADA Bottleneck Collector Server ARGOS High-end Server
  • 7. New Challenges Faced • Increasing In Data Volume – Popular demand for Smart Phone and SNS – Need to do more complicated data analysis to bit the competition – Customer behavior analysis is needed • Slow in performance – Peak time performance became unacceptable – Some CDR’s were lost due to slow performance • KT Cloud Business Launched – Cheaper New KT Cloud H/W is available – Open source requirements are increasing in the company Can traditional DB give us an answer?
  • 8. KT meets NexR for Big Data • Scalability – Coping with increasing data volume and variety (wired, 2G, 3G, WiMax, LTE, WiFi, SMS, MMS, etc) – Enabling horizontal scalability in every data path (data collection, data Replacing traditional storage, ETL process, data search) RDB and DW with Hadoop and similar • Performance OSS – Handling streamed CDR data in near  Project start (2011.4) real-time – Completing daily ETL tasks in a given  Applying NexR solution time period regardless of data for CDR analysis (pilot) increase • Cost-Efficiency – Reducing the cost with inexpensive equipments
  • 9. Continuous Journey for KT’s Big Data • Step by Step approach….with NexR Steps Open Coverage . Replacing representative data and SQLs Hadoop CDR 2012.1Q Analysis Platform . Unrated Wireless CDR (Pilot) . Change all traditional application to OSS 2012. Wireless CDRs . Add more views and reports . Rated CDR’s, Internet access log, TV log Data Integration 2013. Advanced Analytics . Advanced Analytics . SNS, Location etc External Data 2014. . Data from KT subsidiaries Sources
  • 10. Rethinking KT’s Requirements Internet Social IPTV Log Access Data Log Data Data Volume Data + Explosion Integration Data Variety Past Present Future KT CDR System Data Hundreds of DB tables, 3000 SQLs Interface (for 10 years) + Data Engineers SQL Business Analysts Who? DBA Developers (OLAP, SAS, etc) Next Revolution Toward Open Platform -10-
  • 11. Big Data Analytics Requirements for Enterprise  Data volume is only the basic requirement  Data integration is the fundamental requirement (Structured data + Unstructured data)  Need to preserve the existing data and apps  Need to be familiar to enterprise data engineers (DBA, SQL developers, business analysts, etc)  Smooth transition is also essential What’s the solution? Next Revolution Toward Open Platform -11-
  • 12. NexR Solution: Hadoop + Hive Hive is the best solution for smooth transition from database world to Hadoop world ANSI-SQL-based query engine Good for RDB migration Batch data processing HOW TO (ETL, Reporting, Ad-hoc query) CONVERT HOW TO ADAPT Common data storage Enterprise data engineers File-based data store DBA, SQL Good for data integration Developers Next Revolution Toward Open Platform -12-
  • 13. NexR Data Analytics Platform (NDAP) Embracing database into Hadoop world Support for migrating data and logics from RDB to Hadoop Support for integrating RDB and Hadoop Offering Hive tools for DBA and SQL developers Full package for big data analytics From data collection to batch data processing, real-time query, and even advanced analytics Leveraging open source technologies Horizontal scalability in every data processing path (collection, batch processing, real-time query, etc) Injecting real-world practices by the collaboration with KT Next Revolution Toward Open Platform -13-
  • 14. NDAP Bird’s View Used Open Sources NDAP RHive Advanced analytics Integration of R and Hive NDAP Enterprise Hive Oracle-to-Hive, Hive workflow, Batch data processing Hive performance monitor, query planner NDAP Data Store Common data storage HDFS, Sqoop-based data import/export NDAP Search ElasticSearch-based distributed log search Real-time query Time-ranged index sharding NDAP Collector Flume-based data collector Streamed data collection Checkpointing for low overhead agents NDAP Admin Center Zookeeper-based distributed coordinator Coordination & Management Collectd-based system/app management Next Revolution Toward Open Platform -14-
  • 15. NDAP Architecture Data NexR Data Analytics Platform (NDAP) Applications Sources Advanced RHive Analytics Enterprise Hawk DBA PerfMon, Query Plan Hive Data Oracle ETL Importer Lama Oracle-to-Hive Ad-hoc query Hive Workflow Reporting Databases Existing BI/DW Data ODS Importer OLAP Server Data Store OLAP (Hadoop) Data Exporter Oracle Collector REST/JSON Search API Real-time Telco query Equipments (Streaming Admin Center data) Next Revolution Toward Open Platform -15-
  • 16. NDAP Bird’s View – Today’s focus NDAP RHive Integration of R and Hive Today’s talk NDAP Enterprise Hive Oracle-to-Hive, Hive workflow, Hive performance monitor, query planner NDAP Data Store HDFS, Sqoop-based data import/export NDAP Search Lucene-based distributed log search engine Time-ranged index sharding Refer to appendix NDAP Collector Flume-based data collector Checkpointing for low overhead agents NDAP Admin Center Zookeeper-based distributed coordinator Collectd-based system/app management Next Revolution Toward Open Platform -16-
  • 17. Enterprise Hive Recreating Hive for Enterprise Data Engineers Two goals Migration of data and SQL from RDB(Oracle) to Hive  Oracle-to-Hive support Rich environment for Hive developers, even DW/BI teams and DBA  Performance monitor, query planner, workflow manager Next Revolution Toward Open Platform -17-
  • 18. Is Oracle-to-Hive trivial? Simple Example SELECT * FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id SELECT * FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id) Typical Example SELECT /*+ PARALLEL(K1 16) USE_NL(K1 B) */ ETL_DATE, CALL_DATE, CASE WHEN SUBSCRIBER_TYPE ='PREMIUM' THEN 'Y' ELSE NVL(TO_CHAR(B.I_NCN),'X') END AS I_NCN, I_INOUT,VALID_CNT, I_CFC_TYPE, …… FROM 3G_CALL_LOG K1 , SASCOMM.PHONE_MAPPING B WHERE K1.i_etl_dt = TO_DATE('[#SAS_YDAY#,'YYYYMMDD') AND K1.i_call_dt ||'' >= TO_DATE('[#SAS_YDAY#]','YYYYMMDD') AND K1.i_call_dt ||'' < TO_DATE('[#SAS_YDAY#]','YYYYMMDD') + 1 and K1.I_INOUT in ('0','1') AND DECODE(K1.I_INOUT,'0',NVL(K1.I_OUT_CTN, I_CALLING_NUM),'1',K1.I_IN_CTN) = B.I_CTN(+) AND K1.CALL_DATE >= B.SDATE(+) AND K1.CALL_DATE < B.EDATE(+); Next Revolution Toward Open Platform -18-
  • 19. Enterprise Hive – Oracle-to-Hive Enhancing Hive by Fixing Hive code (JIRA issues, 2253, 2503, 2329, 2332, etc) Adding Hive UDF and UDAF for Oracle compatibility Enterprise Hive provides Conversion rules, a guide and a process Oracle data types that are not supported in Hive Oracle functions that are not supported in Hive Three conversion points to consider Data model and data types Basic functions, aggregate and analytic functions SQL syntax Next Revolution Toward Open Platform -19-
  • 20. Oracle-to-Hive – Data Model, Types, Functions Hive refered to MySQL function syntax Data Model Basic Functions Oracle Hive Function Oracle Hive Type Table Table Math round,ceil,mod, round,ceil,pmod, Partition Partition Functions power,sqrt,sin/cos power,sqrt,sin/cos substr,trim,lpad/rpad Sampling Bucket Character substr,trim,lpad/rpad ltrim/rtrim,regexp_repl Functions ltrim/rtrim,replace ace Data Type Null coalesce,nvl,nvl2 coalesce (no nvl, nvl2) Functions Oracle Hive TINYINT Added Basic Functions (Hive UDF) NUMBER(n) INT/BIGINT Function Type Hive NUMBER(n,m) FLOAT/DOUBLE Condition DECODE, GREATEST VARCHAR2 STRING Null NVL, NVL2 STRING DATE "yyyy-MM-dd Type TO_NUMBER, TO_CHAR, TO_DATE, HH:mm:ss" format Conversion INSTR4, DATE_FORMAT, LAST_DAY Hive data type is designed to be converted into Java data type Next Revolution Toward Open Platform -20-
  • 21. Oracle-to-Hive – SQL Syntax & Analytic Functions Most not-supported Oracle SQL syntax can be converted with Join syntax Oracle SQL Hive HQL SELECT * from Employee e WHERE e.DeptNo SELECT * from Employee e IN subquery IN(SELECT d.DeptNo FROM Dept d) LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo) SELECT e.* from Employee e NOT IN SELECT * from Employee e WHERE e.DeptNo LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) subquery NOT IN(SELECT d.DeptNo FROM Dept d) WHERE d.DeptNo IS NULL SELECT * SELECT * JOIN FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id) RANK SELECT name,dept,salary,RANK() OVER (PARTITION BY SELECT e.name,e.dept,e.salary,RANK(e.dept,e.salary) (Analytic dept FROM (SELECT name, dept, salary FROM emp DISTRIBUTED Function) ORDER BY salary DESC) FROM emp BY dept SORT BY dept, salary DESC) e MIN SELECT dept,tmp.m FROM emp JOIN (SELECT dept, SELECT dept, MIN(salary) OVER (PARTITION BY dept) (Aggregate MIN(salary) m FROM emp Function) FROM emp GROUP BY dept) tmp ON emp.dept = tmp.dept Oracle analytic functions are used sometimes for statistical processing (5% in KT case)  Implemented some analytic functions (RANK, DENSE_RANK, ROW_NUMBER, LAG, MIN, MAX, SUM) Next Revolution Toward Open Platform -21-
  • 22. Oracle-to-Hive – Example select /*+ use_nl(E emp_idx1) */ D.dname, E.empno, E.ename, decode(nvl(JOB, ‘SALESMAN’), 'SALESMAN', sal, 0) sal, RANK() over (PARTITION BY D.deptno ORDER BY sal desc) ranking from dept D, emp E where D.deptno = E.deptno and E.ename in (select ename from bonus where job in ('SALESMAN', 'CLERK')); select X.dname, X.empno, X.ename nexr_rank(HASH(X.deptno, X.sal), X.sal) ranking from ( select D.dname, D.deptno, E.empno, E.ename, (case coalese(JOB, ‘SALESMAN’) when 'SALESMAN‘ then sal else 0) sal, from dept D join emp E on (D.deptno = E.deptno) join bonus B on (D.ename = B.ename) where B.job in ('SALESMAN', 'CLERK') ) X distribute by hash(D.deptno, E.sal) sort by D.deptno, E.sal; Next Revolution Toward Open Platform -22-
  • 23. NDAP Process for RDB Migration Data Preparation Conversion Validation Optimization Function Oracle schema Rewriting Hive conversion to Hive schema queries Data semantically SQL conversion compatibility Data loading to (when more (by 1-on-1 check Hive using performance conversion rule Sqoop is needed) syntactically) The case of KT CDR migration Chose 100 representative SQLs for ETL and successfully converted Current step: 200-300 mainly used SQLs Next step(2012): 3000 SQLs Next Revolution Toward Open Platform -23-
  • 24. Enterprise Hive – Rich Environment for Hive Building up Hive Ecosystem by Adding assistant programs to help DBA and SQL developers Enterprise Hive provides Hive performance monitor and query planner Hive workflow manager Next Revolution Toward Open Platform -24-
  • 25. Hawk – Hive Performance Monitor Difficulty of Hive performance diagnostics Metrics and logs from Hive and Hadoop are seperated Lack of the historical and statistical view of performance Hawk performance monitor for DBA Integrated view of a Hive query and the corresponding MapReduces Hourly/daily/weekly/monthly performance views of each query Hawk Screenshot Hawk Architecture Next Revolution Toward Open Platform -25-
  • 26. Hawk – Hive Query Planner Difficulty of Hive default query planner Too complicated due to show the detail of MapReduce execution Not for DBA, but for Hive internal developers Hawk query planner for DBA Displaying a Hive query in a HQL operator level (familiar to DBA) Showing a performance result with a query at once Hive default query planner Hawk query planner Performance result Next Revolution Toward Open Platform -26-
  • 27. Lama – Hive Workflow Manager Workflow development and management tool for Hive Managing data processing jobs for Hive Choosing Oozie as a core workflow engine Providing web-based interface Workflow editing & management, user management, job scheduling, project management, etc On-demand workflow change at runtime Need to fix and resume a workflow at runtime in failure Not supported in most workflow engines Patching Oozie for suspend/resume per action(i.e., Hive query) Future plan Supporting other data processing jobs like Pig, Sqoop, MapReduce, HDFS, SSH, and Java Next Revolution Toward Open Platform -27-
  • 28. NDAP Process for Batch Data Processing Analysis Development Execution Management Performance Workflow Analyze service diagnostics & Workflow deployment & request(SR) optimization development scheduling & testing Hive data and Workflow & validation Performance query modeling suspend/fix/ monitoring resume in failure Lama Hawk Hawk Workflow Performance Query Manager Monitor Planner Next Revolution Toward Open Platform -28-
  • 29. Enterprise Hive Demo Next Revolution Toward Open Platform -29-
  • 30. R for Advanced Analytics R (GNU open source) Programming language and software environment for statistical computing and graphics (wikipedia) 4,000+ R libraries (more than SAS’s functionality) Becoming a de facto standard among statisticians R for Big Data R runs in a single node Some parallel R packages snowfall, rpvm, rmpi, etc Recent attempts to combine R and Hadoop RHIPE(Purdue), RHadoop(RA), Ricardo(IBM) Next Revolution Toward Open Platform -30-
  • 31. RHive Marrying R and Hive for Big Data Analytics Most R programmers are familiar to SQL Hive can hide the detail of Hadoop and MapReduce Inspired by IBM Ricardo(R+Jaql) Strong for deep analytics Strong for massive data manipulation Lack of massive data manipulation Lack of analytical functionalities Providing Hive interfaces in the R environment Allowing R programmers to use a familiar SQL for big data manipulation Released as open source (Apache license version 2) Source: https://github.com/nexr/RHive CRAN: http://cran.r-project.org/web/packages/RHive Next Revolution Toward Open Platform -31-
  • 32. RHive API and Architecture RHive API rhive.connect(): connect R to Hive rhive.query(): send a Hive query and return the result rhive.export(): export R functions to R processes running on the MR nodes rhive.exportAll(): export R functions and R objects to R processes running on the MapReduce nodes rhive.close(): close a Hive connection RHive Architecture Next Revolution Toward Open Platform -32-
  • 33. RHive Sample – Flight Delay Prediction  R: Building a prediction model of flight delay using linear regression with a training data set (sampled from Hive)  Hive: Running the prediction model(R objects) with an entire data set in Hive 1 library(RHive) 2 rhive.connect("127.0.0.1") 3 4 # get a training data set from Hive 5 trainset <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines",fetchsize=30,limit=100) 6 7 # convert to numeric, and extract out missing values 8 trainset$arrdelay <- as.numeric(trainset$arrdelay) 9 trainset$distance <- as.numeric(trainset$distance) 10 trainset <- trainset[!(is.na(trainset$arrdelay) | is.na(trainset$distance)),] Data set: airline on-time performance 11 http://stat-computing.org/dataexpo/2009/ 12 # create a prediction model using R model objects and internal funtions • Flight arrival and departure details for 13 model <- lm(arrdelay ~ distance + dayofweek,data=trainset) all commercial flights within the USA, 14 rhpredict <- function(arg1,arg2,arg3) { from October 1987 to April 2008. 15 if(arg1 == "NULL" | arg2 == "NULL" | arg3 == "NULL") 16 return(0.0) 17 res <- predict.lm(model, data.frame(dayofweek=arg1, arrdelay=arg2, distance=arg3)) 18 return(as.numeric(res)) 19 } 20 null <- "NULL" 21 22 # set up R objects in Hive 23 rhive.assign("null", null) 24 rhive.assign("rhpredict", rhpredict) 25 rhive.assign("model", model) 26 27 # export the R prediction model and run it in Hive 28 rhive.exportAll("rhpredict", c("10.1.3.2","10.1.3.3","10.1.3.4","10.1.3.5","10.1.3.6","10.1.3.7")) 29 rhive.query("create table delaypredict as select R('rhpredict', dayofweek, arrdelay, distance, 0.0) from airlines") Next Revolution Toward Open Platform -33-
  • 34. RHive Demo Next Revolution Toward Open Platform -34-
  • 35. Lessons Learned RDB migration to open source is complicated, time- consuming, and labor-intensive. It can become real with some practice and migration process. The average time of a query conversion (200~300 lines in average) 8 hours -> 2 hours after 4 months (4 times faster) Advantageous to those who experienced database migration (similar to Oracle-to-MySQL migration) Current data engineers are not familiar with open sources like Hadoop. They want to use software tools similar to the ones that they use. Open sources such as Hadoop and MapReduce are not easy for current IT managers. Open sources are technology-driven, not demand-driven. Open sources and technologies need to be wrapped up in familiar interfaces in order to hide the detail. Next Revolution Toward Open Platform -35-
  • 36. Lessons Learned Open source software is not a panacea. Choosing a right open source is the first significant step. Combining several OSS is common. The modification of source code of OSS is inevitable if requirements are not negotiable. Combining two separate open sources, Hive and ElasticSearch for batch data processing and real-time query on Hadoop as a common data store. The change of Hive, ElasticSearch, Flume, Oozie, Zookeeper, etc. The integration of various types of data is a critical issue for an enterprise. Especially, the structured data of database and DW need to be coupled with unstructured data in order to better understand customer’s needs. It is necessary to embrace current data and business logics in a new environment. RDB/DW and Hadoop have their pros and cons, so it is necessary to find the right mix. Next Revolution Toward Open Platform -36-
  • 37. Conclusion Big data analytics for telco and enterprises Smooth transition from RDB/DW to NexR Data Analytics Platform (NDAP) Next Revolution Toward Open Platform -37-
  • 38. NexR NDAP Team  Jaesun Han  Wonkuk Yang  Sangmin Kwak  Sebong Oh  JeongMin Kwon  SungHan Woo  Keumju Kim  Dongmin Yu  Daegeun Kim  Choonghyun Ryu  Minseok Kim  Bokju Yun  Minwoo Kim  Jonghee Lee  Yeonseop Kim  HyungJoo, Lim  Youngwoo Kim  HeeWon Jeon  Hyeon-Cheol Nah  GooBum Jung  SeungWoo Ryu  Sunghwan Cho  Seoeun Park  Junho Cho  Young-Geun Park  ByungMyon Chae  Eun-Sook Park  Yungtai Choi  Chihoon Byun  Choi Jong-wook  SeongHwa Ahn  Inho Han  Youngbae An  Seonghak Hong Next Revolution Toward Open Platform -38-
  • 39. Thank you Presentation file: http://www.nexr.com/hw11/ndap.pdf Contact jason.han@nexr.com twitter: @jaesun_han KT CDR NDAP Enterprise RHive Appendix System Overview Hive (Slide 30) (Slide 37) (Slide 4) (Slide 14) (Slide 17) Next Revolution Toward Open Platform -39-
  • 41. NDAP Collector Flume-based scalable data collector Choosing Flume due to the flexible architecture (source, decorator, sink) Adding a checkpoint mode and rolling/dedup Adding a checkpoint reliability mode Chukwa’s checkpoint is grafted onto Flume Less resource consumption in agents than Flume E2E mode Minimizing the amount of log data retransmitted at the failure of agents Rolling and deduplication Rolling fragmented log data periodically in Hadoop Removing duplicated log data in case of failover Rolling/Dedup Manager Zookeeper MapReduce Execution Rolling/De Workflow Scheduler Data Store dup MR Manager (Hadoop) Flume Source Decorator Sink Search Agent Log data & location Checkpoint Next Revolution Toward Open Platform -41-
  • 42. NDAP Search: Near Real-Time Indexing Near real-time indexing using RAM Index Adding RAM index for near real-time indexing in ElasticSearch Flushing RAM index into Disk index after a given time period or a buffer overflow When searching, both RAM index and disk index are examined Indexer Searcher add search IndexWriter IndexReader create write buffer read read commit Disk Index Next Revolution Toward Open Platform -42-
  • 43. NDAP Search: Index Split Strategies Modifying ElasticSearch to add more index split schemes for log search Searching log data has usually time constraint like daily or monthly Combining time-based index split and size-based index split Time-based index split Splitting an index according to a given time period Improving indexing and search performance Easy to implement auto-retention Size-based index split Splitting an index according to a given size Resolving a big index performance problem Time-based Size-based ElasticSearch Index Partitions Index Sequences Index Shards 2011.10.08 0001 0002 0 Primary Replica Replica 2011.10.09 0001 0002 0003 1 Primary Replica Replica Search 3 Primary Replica Replica 2011.10.30 0001 0002 0003 Next Revolution Toward Open Platform -43-
  • 44. NDAP Admin Center: Distributed Coordinator Zookeeper-based distributed coordinator Zookeeper handles the coordination among NDAP components Patching several issues of Zookeeper and ZkClient Providing common libraries for NDAP components Gourp membership, master election, distributed lock, distributed queue Easy to use and more reliable than any other recipes, especially to read-and-write problems Patched Zookeeper Ensemble Zookeeper Client Thread Complex, Unique, Fragile Patched ZkClient Thread Zookeeper Recipes Group Master Distributed Distributed Easy, Reusable, Fault Tolerant Membership Election Lock Queue NDAP NDAP Search Collector Next Revolution Toward Open Platform -44-
  • 45. NDAP Admin Center: System/App Management Collectd-based system and application monitoring Server resource monitoring: CPU, memory, disk, process, vmem, tcp connects, etc Application monitoring: Hadoop, ElasticSearch, Flume, Zookeeper, Memcached, Collectd, etc Plug-in architecture: add more applications such as NoSQL Resource-centric view Displaying all nodes’ resource status in a screen for a specific resource (cpu, mem, etc) Most system management tools(Ganglia, Nagios, etc) offer node-centric view Check threshold/ Collectd severity Server Management Dashboard NDAP Admin Next Revolution Toward Open Platform -45-