Currently telecom companies store their data in database or data warehouse, treating them through ETL process and working on statistics and analysis by using OLAP tools or data mining engines. However, due to the data explosion along with the spread of Smart Phones traditional data storages like DB and DW aren’t sufficient to cope with these “Big Data”. As an alternative the method of storing data in Hadoop and performing ETL process and Ad-hoc Query with Hive is being introduced, and China Mobile is being mentioned as the most representative example. But, they are adopted mainly by new projects, which have low barriers in applying the new Hive data model and HQL. On the other hand, it is extremely difficult to replace the existing database with the combination of Hadoop and Hive if there are already a number of tables and SQL queries. NexR is migrating the telecom company’s data from Oracle DB to Hadoop, and converting a lot of existing Oracle SQL queries to Hive HQL queries. Though HQL supports a similar syntax to ANSI-SQL, it lacks a large portion of basic functions and hardly supports Oracle analytic functions like rank() which are utilized mainly in statistical analysis. Furthermore, the difference of data types like null value is also blocking the application of it. In this presentation, we will share the experience converting Oracle SQL to Hive HQL and developing additional functions with MapReduce. Also, we will introduce several ideas and trials to improve Hive performance.
Clipping is a handy way to collect important slides you want to go back to later.