Next RevolutionToward Open Platform                                      NYC 2011                       www.nexr.com
NexR Introduction Big data analytics firm   Working on Hadoop and big data for 5 years   Provided a NexR Hadoop solution t...
Agenda Voice of Customer: KT CDR Analysis System KT requirements for system migration NexR Data Analytics Platform (NDAP) ...
Introduction to KT business                                                                   Business                    ...
Introduction to KT CDR data• KT CDR(Call Detail Record)                                                            Unit : ...
Current KT CDR Analysis System Architecture                                        RelationalData Sources                 ...
New Challenges Faced• Increasing In Data Volume   – Popular demand for Smart Phone and SNS   – Need to do more complicated...
KT meets NexR for Big Data• Scalability   – Coping with increasing data volume     and variety (wired, 2G, 3G, WiMax,     ...
Continuous Journey for KT’s Big Data• Step by Step approach….with NexR         Steps           Open                       ...
Rethinking KT’s Requirements                                                               Internet   Social              ...
Big Data Analytics Requirements for Enterprise    Data volume is only the basic requirement    Data integration is the f...
NexR Solution: Hadoop + Hive                                   Hive is the best solution for smooth transition            ...
NexR Data Analytics Platform (NDAP) Embracing database into Hadoop world   Support for migrating data and logics from RDB ...
NDAP Bird’s View Used Open Sources                                        NDAP RHive                        Advanced analy...
NDAP Architecture  Data                           NexR Data Analytics Platform (NDAP)                               Applic...
NDAP Bird’s View – Today’s focus                                     NDAP RHive                                  Integrati...
Enterprise Hive Recreating Hive for Enterprise Data Engineers Two goals   Migration of data and SQL from RDB(Oracle) to Hi...
Is Oracle-to-Hive trivial?Simple Example                  SELECT * FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id          ...
Enterprise Hive – Oracle-to-Hive Enhancing Hive by   Fixing Hive code (JIRA issues, 2253, 2503, 2329, 2332, etc)   Adding ...
Oracle-to-Hive – Data Model, Types, Functions                                                                             ...
Oracle-to-Hive – SQL Syntax & Analytic Functions                                     Most not-supported Oracle SQL syntax ...
Oracle-to-Hive – Example           select /*+ use_nl(E emp_idx1) */                  D.dname, E.empno, E.ename,           ...
NDAP Process for RDB Migration     Data  Preparation               Conversion        Validation      Optimization         ...
Enterprise Hive – Rich Environment for Hive Building up Hive Ecosystem by   Adding assistant programs to help DBA and SQL ...
Hawk – Hive Performance Monitor Difficulty of Hive performance diagnostics   Metrics and logs from Hive and Hadoop are sep...
Hawk – Hive Query Planner Difficulty of Hive default query planner     Too complicated due to show the detail of MapReduce...
Lama – Hive Workflow Manager Workflow development and management tool for Hive   Managing data processing jobs for Hive   ...
NDAP Process for Batch Data Processing    Analysis                 Development      Execution     Management              ...
Enterprise Hive Demo   Next Revolution   Toward Open Platform   -29-
R for Advanced Analytics R (GNU open source)   Programming language and software environment for statistical   computing a...
RHive Marrying R and Hive for Big Data Analytics     Most R programmers are familiar to SQL     Hive can hide the detail o...
RHive API and Architecture RHive API   rhive.connect(): connect R to Hive   rhive.query(): send a Hive query and return th...
RHive Sample – Flight Delay Prediction R: Building a prediction model of flight delay using linear regression with a trai...
RHive Demo   Next Revolution   Toward Open Platform   -34-
Lessons Learned RDB migration to open source is complicated, time- consuming, and labor-intensive. It can become real with...
Lessons Learned Open source software is not a panacea. Choosing a right open source is the first significant step. Combini...
Conclusion   Big data analytics for telco and enterprises Smooth transition from RDB/DW to     NexR Data Analytics Platfor...
NexR NDAP Team   Jaesun Han                  Wonkuk Yang   Sangmin Kwak                Sebong Oh   JeongMin Kwon     ...
Thank you              Presentation file: http://www.nexr.com/hw11/ndap.pdf                                               ...
AppendixNext RevolutionToward Open Platform              -40-
NDAP Collector Flume-based scalable data collector    Choosing Flume due to the flexible architecture (source, decorator, ...
NDAP Search: Near Real-Time Indexing Near real-time indexing using RAM Index   Adding RAM index for near real-time indexin...
NDAP Search: Index Split Strategies Modifying ElasticSearch to add more index split schemes for log search    Searching lo...
NDAP Admin Center: Distributed Coordinator Zookeeper-based distributed coordinator    Zookeeper handles the coordination a...
NDAP Admin Center: System/App Management Collectd-based system and application monitoring    Server resource monitoring: C...
Upcoming SlideShare
Loading in …5
×

Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR

15,466 views

Published on

"This session will focus on the challenges of replacing existing Relational DataBase and Data Warehouse technologies with Open Source components. Jason Han will base his presentation on his experience migrating Korea Telecom (KT’s) CDR data from Oracle to Hadoop, which required converting many Oracle SQL queries to Hive HQL queries. He will cover the differences between SQL and HQL; the implementation of Oracle’s basic/analytics functions with MapReduce; the use of Sqoop for bulk loading RDB data into Hadoop; and the use of Apache Flume for collecting fast-streamed CDR data. He’ll also discuss Lucene and ElasticSearch for near-realtime distributed indexing and searching. You’ll learn tips for migrating existing enterprise big data to open source, and gain insight into whether this strategy is suitable for your own data.

Published in: Technology
  • Be the first to comment

Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR

  1. 1. Next RevolutionToward Open Platform NYC 2011 www.nexr.com
  2. 2. NexR Introduction Big data analytics firm Working on Hadoop and big data for 5 years Provided a NexR Hadoop solution to all major Korea telcos (KT, SKT, LG U+) Leading a Korean Hadoop community and holding Hadoop conferences Products NexR Data Analytics Platform (NDAP) iCube Cloud: cloud computing platform (like OpenStack) Massive email archiving solution (presented in Hadoop World 2009) Next Revolution Toward Open Platform -2-
  3. 3. Agenda Voice of Customer: KT CDR Analysis System KT requirements for system migration NexR Data Analytics Platform (NDAP) overview Oracle-to-Hive migration Enterprise Hive RHive Lessons learned Conclusion You can download this presentation file: http://www.nexr.com/hw11/ndap.pdf Next Revolution Toward Open Platform -3-
  4. 4. Introduction to KT business Business 1981.12 Establishment of KT Corporation Domain 2002.08 Privatization from Gov. Owned – Mobile 2G, 3G Company – WiBro (Mobile WiMax) 2006.04 Commercial Launch of World’s first Mobile WiMAX - WiBro –Internet Access –IPTV 2008.11 Commercial Launch of Real-Time IPTV –VOIP 2009.06 Merged with KTF –Multimedia Contents 2010.06 Cloud Service Launched –Local, International Telephone – Cloud Service# of Sales Telephone Broadband MobileEmployees (2010) Subscribers Subscribers Subscribers
  5. 5. Introduction to KT CDR data• KT CDR(Call Detail Record) Unit : TB row data summary size Data 1 Month 1 Month (row 1 yr +sum2 yrs) Unrated CDR 3.7 2.5 104 (VOICE, Data, SMS, MMS) Wireless Rated CDR 1.5 0.2 22 Wi-Fi 0.4 0.3 12 Wibro 1.5 1.0 42 Wireline Rated CDR 1.5 1.5 55 IPDR IP-TV 1.5 0.1 19 Total 10 5.6 254• KT Subscriber Analysis System(SAS) for Wireless CDR  Reporting, call detail summary, subscriber’s call quality, call log search, etc  Implemented with relational database over a high-end server - Data gathering and converting in a server every tens of seconds - Daily batch extract-transform-load(ETL) with SQL queries - Near real-time search against an indexed column(call number)  Hundreds of DB tables, over 3000 SQL queries for 10 years
  6. 6. Current KT CDR Analysis System Architecture RelationalData Sources Database Real-Time Bottleneck Search Bottleneck Data Raw LALA2 Converting Data Dimension Table OLAP Bottleneck Batch Summary ETL Table NIBADA Bottleneck Collector Server ARGOS High-end Server
  7. 7. New Challenges Faced• Increasing In Data Volume – Popular demand for Smart Phone and SNS – Need to do more complicated data analysis to bit the competition – Customer behavior analysis is needed• Slow in performance – Peak time performance became unacceptable – Some CDR’s were lost due to slow performance• KT Cloud Business Launched – Cheaper New KT Cloud H/W is available – Open source requirements are increasing in the company Can traditional DB give us an answer?
  8. 8. KT meets NexR for Big Data• Scalability – Coping with increasing data volume and variety (wired, 2G, 3G, WiMax, LTE, WiFi, SMS, MMS, etc) – Enabling horizontal scalability in every data path (data collection, data Replacing traditional storage, ETL process, data search) RDB and DW with Hadoop and similar• Performance OSS – Handling streamed CDR data in near  Project start (2011.4) real-time – Completing daily ETL tasks in a given  Applying NexR solution time period regardless of data for CDR analysis (pilot) increase• Cost-Efficiency – Reducing the cost with inexpensive equipments
  9. 9. Continuous Journey for KT’s Big Data• Step by Step approach….with NexR Steps Open Coverage . Replacing representative data and SQLs Hadoop CDR 2012.1Q Analysis Platform . Unrated Wireless CDR (Pilot) . Change all traditional application to OSS 2012. Wireless CDRs . Add more views and reports . Rated CDR’s, Internet access log, TV log Data Integration 2013. Advanced Analytics . Advanced Analytics . SNS, Location etc External Data 2014. . Data from KT subsidiaries Sources
  10. 10. Rethinking KT’s Requirements Internet Social IPTV Log Access Data Log Data Data Volume Data + Explosion Integration Data Variety Past Present Future KT CDR System Data Hundreds of DB tables, 3000 SQLs Interface (for 10 years) + Data Engineers SQL Business Analysts Who? DBA Developers (OLAP, SAS, etc) Next Revolution Toward Open Platform -10-
  11. 11. Big Data Analytics Requirements for Enterprise  Data volume is only the basic requirement  Data integration is the fundamental requirement (Structured data + Unstructured data)  Need to preserve the existing data and apps  Need to be familiar to enterprise data engineers (DBA, SQL developers, business analysts, etc)  Smooth transition is also essential What’s the solution? Next Revolution Toward Open Platform -11-
  12. 12. NexR Solution: Hadoop + Hive Hive is the best solution for smooth transition from database world to Hadoop world ANSI-SQL-based query engine Good for RDB migration Batch data processing HOW TO (ETL, Reporting, Ad-hoc query) CONVERT HOW TO ADAPT Common data storageEnterprise data engineers File-based data store DBA, SQL Good for data integration Developers Next Revolution Toward Open Platform -12-
  13. 13. NexR Data Analytics Platform (NDAP) Embracing database into Hadoop world Support for migrating data and logics from RDB to Hadoop Support for integrating RDB and Hadoop Offering Hive tools for DBA and SQL developers Full package for big data analytics From data collection to batch data processing, real-time query, and even advanced analytics Leveraging open source technologies Horizontal scalability in every data processing path (collection, batch processing, real-time query, etc) Injecting real-world practices by the collaboration with KT Next Revolution Toward Open Platform -13-
  14. 14. NDAP Bird’s View Used Open Sources NDAP RHive Advanced analytics Integration of R and Hive NDAP Enterprise Hive Oracle-to-Hive, Hive workflow, Batch data processing Hive performance monitor, query planner NDAP Data Store Common data storage HDFS, Sqoop-based data import/export NDAP Search ElasticSearch-based distributed log search Real-time query Time-ranged index sharding NDAP Collector Flume-based data collector Streamed data collection Checkpointing for low overhead agents NDAP Admin Center Zookeeper-based distributed coordinator Coordination & Management Collectd-based system/app management Next Revolution Toward Open Platform -14-
  15. 15. NDAP Architecture Data NexR Data Analytics Platform (NDAP) Applications Sources Advanced RHive Analytics Enterprise Hawk DBA PerfMon, Query Plan Hive Data Oracle ETL Importer Lama Oracle-to-Hive Ad-hoc query Hive Workflow ReportingDatabases Existing BI/DW Data ODS Importer OLAP Server Data Store OLAP (Hadoop) Data Exporter Oracle Collector REST/JSON Search API Real-time Telco queryEquipments(Streaming Admin Center data) Next Revolution Toward Open Platform -15-
  16. 16. NDAP Bird’s View – Today’s focus NDAP RHive Integration of R and Hive Today’s talk NDAP Enterprise Hive Oracle-to-Hive, Hive workflow, Hive performance monitor, query planner NDAP Data Store HDFS, Sqoop-based data import/export NDAP Search Lucene-based distributed log search engine Time-ranged index sharding Refer to appendix NDAP Collector Flume-based data collector Checkpointing for low overhead agents NDAP Admin Center Zookeeper-based distributed coordinator Collectd-based system/app management Next Revolution Toward Open Platform -16-
  17. 17. Enterprise Hive Recreating Hive for Enterprise Data Engineers Two goals Migration of data and SQL from RDB(Oracle) to Hive  Oracle-to-Hive support Rich environment for Hive developers, even DW/BI teams and DBA  Performance monitor, query planner, workflow manager Next Revolution Toward Open Platform -17-
  18. 18. Is Oracle-to-Hive trivial?Simple Example SELECT * FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id SELECT * FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id)Typical Example SELECT /*+ PARALLEL(K1 16) USE_NL(K1 B) */ ETL_DATE, CALL_DATE, CASE WHEN SUBSCRIBER_TYPE =PREMIUM THEN Y ELSE NVL(TO_CHAR(B.I_NCN),X) END AS I_NCN, I_INOUT,VALID_CNT, I_CFC_TYPE, …… FROM 3G_CALL_LOG K1 , SASCOMM.PHONE_MAPPING B WHERE K1.i_etl_dt = TO_DATE([#SAS_YDAY#,YYYYMMDD) AND K1.i_call_dt || >= TO_DATE([#SAS_YDAY#],YYYYMMDD) AND K1.i_call_dt || < TO_DATE([#SAS_YDAY#],YYYYMMDD) + 1 and K1.I_INOUT in (0,1) AND DECODE(K1.I_INOUT,0,NVL(K1.I_OUT_CTN, I_CALLING_NUM),1,K1.I_IN_CTN) = B.I_CTN(+) AND K1.CALL_DATE >= B.SDATE(+) AND K1.CALL_DATE < B.EDATE(+); Next Revolution Toward Open Platform -18-
  19. 19. Enterprise Hive – Oracle-to-Hive Enhancing Hive by Fixing Hive code (JIRA issues, 2253, 2503, 2329, 2332, etc) Adding Hive UDF and UDAF for Oracle compatibility Enterprise Hive provides Conversion rules, a guide and a process Oracle data types that are not supported in Hive Oracle functions that are not supported in Hive Three conversion points to consider Data model and data types Basic functions, aggregate and analytic functions SQL syntax Next Revolution Toward Open Platform -19-
  20. 20. Oracle-to-Hive – Data Model, Types, Functions Hive refered to MySQL function syntax Data Model Basic Functions Oracle Hive Function Oracle Hive Type Table Table Math round,ceil,mod, round,ceil,pmod, Partition Partition Functions power,sqrt,sin/cos power,sqrt,sin/cos substr,trim,lpad/rpad Sampling Bucket Character substr,trim,lpad/rpad ltrim/rtrim,regexp_repl Functions ltrim/rtrim,replace ace Data Type Null coalesce,nvl,nvl2 coalesce (no nvl, nvl2) Functions Oracle Hive TINYINT Added Basic Functions (Hive UDF) NUMBER(n) INT/BIGINT Function Type Hive NUMBER(n,m) FLOAT/DOUBLE Condition DECODE, GREATEST VARCHAR2 STRING Null NVL, NVL2 STRING DATE "yyyy-MM-dd Type TO_NUMBER, TO_CHAR, TO_DATE, HH:mm:ss" format Conversion INSTR4, DATE_FORMAT, LAST_DAY Hive data type is designed to be converted into Java data type Next Revolution Toward Open Platform -20-
  21. 21. Oracle-to-Hive – SQL Syntax & Analytic Functions Most not-supported Oracle SQL syntax can be converted with Join syntax Oracle SQL Hive HQL SELECT * from Employee e WHERE e.DeptNo SELECT * from Employee e IN subquery IN(SELECT d.DeptNo FROM Dept d) LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo) SELECT e.* from Employee e NOT IN SELECT * from Employee e WHERE e.DeptNo LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) subquery NOT IN(SELECT d.DeptNo FROM Dept d) WHERE d.DeptNo IS NULL SELECT * SELECT * JOIN FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id) RANK SELECT name,dept,salary,RANK() OVER (PARTITION BY SELECT e.name,e.dept,e.salary,RANK(e.dept,e.salary) (Analytic dept FROM (SELECT name, dept, salary FROM emp DISTRIBUTED Function) ORDER BY salary DESC) FROM emp BY dept SORT BY dept, salary DESC) e MIN SELECT dept,tmp.m FROM emp JOIN (SELECT dept, SELECT dept, MIN(salary) OVER (PARTITION BY dept) (Aggregate MIN(salary) m FROM emp Function) FROM emp GROUP BY dept) tmp ON emp.dept = tmp.dept Oracle analytic functions are used sometimes for statistical processing (5% in KT case)  Implemented some analytic functions (RANK, DENSE_RANK, ROW_NUMBER, LAG, MIN, MAX, SUM) Next Revolution Toward Open Platform -21-
  22. 22. Oracle-to-Hive – Example select /*+ use_nl(E emp_idx1) */ D.dname, E.empno, E.ename, decode(nvl(JOB, ‘SALESMAN’), SALESMAN, sal, 0) sal, RANK() over (PARTITION BY D.deptno ORDER BY sal desc) ranking from dept D, emp E where D.deptno = E.deptno and E.ename in (select ename from bonus where job in (SALESMAN, CLERK)); select X.dname, X.empno, X.ename nexr_rank(HASH(X.deptno, X.sal), X.sal) ranking from ( select D.dname, D.deptno, E.empno, E.ename, (case coalese(JOB, ‘SALESMAN’) when SALESMAN‘ then sal else 0) sal, from dept D join emp E on (D.deptno = E.deptno) join bonus B on (D.ename = B.ename) where B.job in (SALESMAN, CLERK) ) X distribute by hash(D.deptno, E.sal) sort by D.deptno, E.sal; Next Revolution Toward Open Platform -22-
  23. 23. NDAP Process for RDB Migration Data Preparation Conversion Validation Optimization Function Oracle schema Rewriting Hive conversion to Hive schema queries Data semantically SQL conversion compatibility Data loading to (when more (by 1-on-1 check Hive using performance conversion rule Sqoop is needed) syntactically) The case of KT CDR migration Chose 100 representative SQLs for ETL and successfully converted Current step: 200-300 mainly used SQLs Next step(2012): 3000 SQLs Next Revolution Toward Open Platform -23-
  24. 24. Enterprise Hive – Rich Environment for Hive Building up Hive Ecosystem by Adding assistant programs to help DBA and SQL developers Enterprise Hive provides Hive performance monitor and query planner Hive workflow manager Next Revolution Toward Open Platform -24-
  25. 25. Hawk – Hive Performance Monitor Difficulty of Hive performance diagnostics Metrics and logs from Hive and Hadoop are seperated Lack of the historical and statistical view of performance Hawk performance monitor for DBA Integrated view of a Hive query and the corresponding MapReduces Hourly/daily/weekly/monthly performance views of each query Hawk Screenshot Hawk Architecture Next Revolution Toward Open Platform -25-
  26. 26. Hawk – Hive Query Planner Difficulty of Hive default query planner Too complicated due to show the detail of MapReduce execution Not for DBA, but for Hive internal developers Hawk query planner for DBA Displaying a Hive query in a HQL operator level (familiar to DBA) Showing a performance result with a query at onceHive default query planner Hawk query planner Performance result Next Revolution Toward Open Platform -26-
  27. 27. Lama – Hive Workflow Manager Workflow development and management tool for Hive Managing data processing jobs for Hive Choosing Oozie as a core workflow engine Providing web-based interface Workflow editing & management, user management, job scheduling, project management, etc On-demand workflow change at runtime Need to fix and resume a workflow at runtime in failure Not supported in most workflow engines Patching Oozie for suspend/resume per action(i.e., Hive query) Future plan Supporting other data processing jobs like Pig, Sqoop, MapReduce, HDFS, SSH, and Java Next Revolution Toward Open Platform -27-
  28. 28. NDAP Process for Batch Data Processing Analysis Development Execution Management Performance Workflow Analyze service diagnostics & Workflow deployment & request(SR) optimization development scheduling & testing Hive data and Workflow & validation Performance query modeling suspend/fix/ monitoring resume in failure Lama Hawk Hawk Workflow Performance Query Manager Monitor Planner Next Revolution Toward Open Platform -28-
  29. 29. Enterprise Hive Demo Next Revolution Toward Open Platform -29-
  30. 30. R for Advanced Analytics R (GNU open source) Programming language and software environment for statistical computing and graphics (wikipedia) 4,000+ R libraries (more than SAS’s functionality) Becoming a de facto standard among statisticians R for Big Data R runs in a single node Some parallel R packages snowfall, rpvm, rmpi, etc Recent attempts to combine R and Hadoop RHIPE(Purdue), RHadoop(RA), Ricardo(IBM) Next Revolution Toward Open Platform -30-
  31. 31. RHive Marrying R and Hive for Big Data Analytics Most R programmers are familiar to SQL Hive can hide the detail of Hadoop and MapReduce Inspired by IBM Ricardo(R+Jaql) Strong for deep analytics Strong for massive data manipulationLack of massive data manipulation Lack of analytical functionalities Providing Hive interfaces in the R environment Allowing R programmers to use a familiar SQL for big data manipulation Released as open source (Apache license version 2) Source: https://github.com/nexr/RHive CRAN: http://cran.r-project.org/web/packages/RHive Next Revolution Toward Open Platform -31-
  32. 32. RHive API and Architecture RHive API rhive.connect(): connect R to Hive rhive.query(): send a Hive query and return the result rhive.export(): export R functions to R processes running on the MR nodes rhive.exportAll(): export R functions and R objects to R processes running on the MapReduce nodes rhive.close(): close a Hive connection RHive Architecture Next Revolution Toward Open Platform -32-
  33. 33. RHive Sample – Flight Delay Prediction R: Building a prediction model of flight delay using linear regression with a training data set (sampled from Hive) Hive: Running the prediction model(R objects) with an entire data set in Hive 1 library(RHive) 2 rhive.connect("127.0.0.1") 3 4 # get a training data set from Hive 5 trainset <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines",fetchsize=30,limit=100) 6 7 # convert to numeric, and extract out missing values 8 trainset$arrdelay <- as.numeric(trainset$arrdelay) 9 trainset$distance <- as.numeric(trainset$distance) 10 trainset <- trainset[!(is.na(trainset$arrdelay) | is.na(trainset$distance)),] Data set: airline on-time performance 11 http://stat-computing.org/dataexpo/2009/ 12 # create a prediction model using R model objects and internal funtions • Flight arrival and departure details for 13 model <- lm(arrdelay ~ distance + dayofweek,data=trainset) all commercial flights within the USA, 14 rhpredict <- function(arg1,arg2,arg3) { from October 1987 to April 2008. 15 if(arg1 == "NULL" | arg2 == "NULL" | arg3 == "NULL") 16 return(0.0) 17 res <- predict.lm(model, data.frame(dayofweek=arg1, arrdelay=arg2, distance=arg3)) 18 return(as.numeric(res)) 19 } 20 null <- "NULL" 21 22 # set up R objects in Hive 23 rhive.assign("null", null) 24 rhive.assign("rhpredict", rhpredict) 25 rhive.assign("model", model) 26 27 # export the R prediction model and run it in Hive 28 rhive.exportAll("rhpredict", c("10.1.3.2","10.1.3.3","10.1.3.4","10.1.3.5","10.1.3.6","10.1.3.7")) 29 rhive.query("create table delaypredict as select R(rhpredict, dayofweek, arrdelay, distance, 0.0) from airlines") Next Revolution Toward Open Platform -33-
  34. 34. RHive Demo Next Revolution Toward Open Platform -34-
  35. 35. Lessons Learned RDB migration to open source is complicated, time- consuming, and labor-intensive. It can become real with some practice and migration process. The average time of a query conversion (200~300 lines in average) 8 hours -> 2 hours after 4 months (4 times faster) Advantageous to those who experienced database migration (similar to Oracle-to-MySQL migration) Current data engineers are not familiar with open sources like Hadoop. They want to use software tools similar to the ones that they use. Open sources such as Hadoop and MapReduce are not easy for current IT managers. Open sources are technology-driven, not demand-driven. Open sources and technologies need to be wrapped up in familiar interfaces in order to hide the detail. Next Revolution Toward Open Platform -35-
  36. 36. Lessons Learned Open source software is not a panacea. Choosing a right open source is the first significant step. Combining several OSS is common. The modification of source code of OSS is inevitable if requirements are not negotiable. Combining two separate open sources, Hive and ElasticSearch for batch data processing and real-time query on Hadoop as a common data store. The change of Hive, ElasticSearch, Flume, Oozie, Zookeeper, etc. The integration of various types of data is a critical issue for an enterprise. Especially, the structured data of database and DW need to be coupled with unstructured data in order to better understand customer’s needs. It is necessary to embrace current data and business logics in a new environment. RDB/DW and Hadoop have their pros and cons, so it is necessary to find the right mix. Next Revolution Toward Open Platform -36-
  37. 37. Conclusion Big data analytics for telco and enterprises Smooth transition from RDB/DW to NexR Data Analytics Platform (NDAP) Next Revolution Toward Open Platform -37-
  38. 38. NexR NDAP Team Jaesun Han  Wonkuk Yang Sangmin Kwak  Sebong Oh JeongMin Kwon  SungHan Woo Keumju Kim  Dongmin Yu Daegeun Kim  Choonghyun Ryu Minseok Kim  Bokju Yun Minwoo Kim  Jonghee Lee Yeonseop Kim  HyungJoo, Lim Youngwoo Kim  HeeWon Jeon Hyeon-Cheol Nah  GooBum Jung SeungWoo Ryu  Sunghwan Cho Seoeun Park  Junho Cho Young-Geun Park  ByungMyon Chae Eun-Sook Park  Yungtai Choi Chihoon Byun  Choi Jong-wook SeongHwa Ahn  Inho Han Youngbae An  Seonghak Hong Next Revolution Toward Open Platform -38-
  39. 39. Thank you Presentation file: http://www.nexr.com/hw11/ndap.pdf Contact jason.han@nexr.com twitter: @jaesun_hanKT CDR NDAP Enterprise RHive Appendix System Overview Hive (Slide 30) (Slide 37)(Slide 4) (Slide 14) (Slide 17) Next Revolution Toward Open Platform -39-
  40. 40. AppendixNext RevolutionToward Open Platform -40-
  41. 41. NDAP Collector Flume-based scalable data collector Choosing Flume due to the flexible architecture (source, decorator, sink) Adding a checkpoint mode and rolling/dedup Adding a checkpoint reliability mode Chukwa’s checkpoint is grafted onto Flume Less resource consumption in agents than Flume E2E mode Minimizing the amount of log data retransmitted at the failure of agents Rolling and deduplication Rolling fragmented log data periodically in Hadoop Removing duplicated log data in case of failover Rolling/Dedup Manager Zookeeper MapReduce Execution Rolling/De Workflow Scheduler Data Store dup MR Manager (Hadoop) Flume Source Decorator Sink Search Agent Log data & location Checkpoint Next Revolution Toward Open Platform -41-
  42. 42. NDAP Search: Near Real-Time Indexing Near real-time indexing using RAM Index Adding RAM index for near real-time indexing in ElasticSearch Flushing RAM index into Disk index after a given time period or a buffer overflow When searching, both RAM index and disk index are examined Indexer Searcher add search IndexWriter IndexReader create write buffer read read commit Disk Index Next Revolution Toward Open Platform -42-
  43. 43. NDAP Search: Index Split Strategies Modifying ElasticSearch to add more index split schemes for log search Searching log data has usually time constraint like daily or monthly Combining time-based index split and size-based index split Time-based index split Splitting an index according to a given time period Improving indexing and search performance Easy to implement auto-retention Size-based index split Splitting an index according to a given size Resolving a big index performance problem Time-based Size-based ElasticSearch Index Partitions Index Sequences Index Shards 2011.10.08 0001 0002 0 Primary Replica Replica 2011.10.09 0001 0002 0003 1 Primary Replica ReplicaSearch 3 Primary Replica Replica 2011.10.30 0001 0002 0003 Next Revolution Toward Open Platform -43-
  44. 44. NDAP Admin Center: Distributed Coordinator Zookeeper-based distributed coordinator Zookeeper handles the coordination among NDAP components Patching several issues of Zookeeper and ZkClient Providing common libraries for NDAP components Gourp membership, master election, distributed lock, distributed queue Easy to use and more reliable than any other recipes, especially to read-and-write problems Patched Zookeeper Ensemble Zookeeper Client Thread Complex, Unique, Fragile Patched ZkClient Thread Zookeeper Recipes Group Master Distributed Distributed Easy, Reusable, Fault Tolerant Membership Election Lock Queue NDAP NDAP Search Collector Next Revolution Toward Open Platform -44-
  45. 45. NDAP Admin Center: System/App Management Collectd-based system and application monitoring Server resource monitoring: CPU, memory, disk, process, vmem, tcp connects, etc Application monitoring: Hadoop, ElasticSearch, Flume, Zookeeper, Memcached, Collectd, etc Plug-in architecture: add more applications such as NoSQL Resource-centric view Displaying all nodes’ resource status in a screen for a specific resource (cpu, mem, etc) Most system management tools(Ganglia, Nagios, etc) offer node-centric view Check threshold/ Collectd severity Server Management Dashboard NDAP Admin Next Revolution Toward Open Platform -45-

×