Leveraging Hadoop Cluster for Carrier grade application




                             Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012
No Personalization


Service
discovery




                      Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   2
   600- 800 GB of CDR per day
                    ◦   GPRS Signaling 50GB/day
                    ◦   3G Signaling 300GB/day
                    ◦   Voice 100GB/day
                    ◦   SMS 200GB/day
                   100 - 200 GB/day of Web Data



Mammoth Data
                                         Data Analysis




               Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   3
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   4
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   5
   Framework for distributed processing of large data sets
    across clusters
   Consists of
    ◦ Hadoop Distributed File System aka HDFS (File system)
    ◦ Hadoop MapReduce (programming model )
   Characteristics
    ◦ Performance shall scale linearly
    ◦ Compute should move to data
    ◦ Simple core, Modular and Extensible



                                    Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   6
   Current Bottleneck

    ◦ Data resides in multiple nodes/zones/VM instance & no elegant,
      reliable and efficient way of extracting data

    ◦ Loading terabytes of data into database is slow

    ◦ Parallel computing not a possibility in Conventional BI ETL

    ◦ User profile and application data resides in DB which can scale
      only vertically




                                    Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   7
   Structured Data



         sqoop --connect jdbc:mysql://db.example.com/website --table USERS --as-
          sequencefile



   Un Structured Data




                                        Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   8
   A Distributed data Collection server
    ◦   Scalable
    ◦   Configurable
    ◦   Extensible
    ◦   Manageable


   Built around the concept of flows
    ◦ A single flow corresponds to a type of data source
    ◦ Supports compression, batching & reliability setups per flow


   Data come in through a source
    ◦ Optionally processed by one or more decorators
    ◦ And transmitted out via sink




                                    Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   9
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   10
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   11
   Map Reduce is very powerful, but:
    ◦ It requires a Java programmer
    ◦ User has to re-invent common
    ◦ functionality (join, filter, etc.)

   Execution engine atop Hadoop

   Pig provides a higher level language Pig Latin

   Opens the system to non-Java programmers

   Provides common operations like join, group, filter, sort




                                       Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   12
   Web log processing.
   Data processing for web search platforms.
   Ad hoc queries across large data sets.
   Rapid prototyping of algorithms for processing large data
    sets.
   Pig runs on local machine and job gets executed in hadoop
    cluster
       $ cd /usr/share/cloudera/pig/
       $ bin/pig –x local
       grunt>
           Log = LOAD ‘excite-small.log’ AS (user, timestamp, query);
           grpd = GROUP log BY user;
           cntd = FOREACH grpd GENERATE group, COUNT(log);
           STORE cntd INTO ‘output’;




                                        Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   13
   System for querying and managing structured data
   Built on top of hadoop
   Uses map reduce for execution
   SQL like syntax; supports
    ◦   From clause subquery
    ◦   ANSO Join (equi join )
    ◦   Multi-table insert
    ◦   Multi group-by
    ◦   Sampling
    ◦   Object traversal
   Engagement
    ◦ Summarization
    ◦ Ad hoc analysis
    ◦ Spam detection



                                 Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   14
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   15
Feature                          Hive                              Pig
Language                         SQL-like                          PigLatin
Schemas/Types                    Yes (explicit)                    Yes (implicit)
Partitions                       Yes                               No
Server                           Optional(thirft)                  No
User Defined Functions           Yes                               Yes
Custom Serializer/Deserializer   Yes                               Yes
DFS Direct Access                Yes (implicit)                    Yes (explicit)
Join/Order/Sort                  Yes                               Yes
Shell                            Yes                               Yes
Streaming                        Yes                               No
Web Interface                    Yes                               No
JDBC/ODBC                        Yes (limited)                     No




                                       Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   16
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   17
Copyright © 2011 Flytxt B.V. All rights reserved.   1/17/2012   18

Hadoop for carrier

  • 1.
    Leveraging Hadoop Clusterfor Carrier grade application Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012
  • 2.
    No Personalization Service discovery Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 2
  • 3.
    600- 800 GB of CDR per day ◦ GPRS Signaling 50GB/day ◦ 3G Signaling 300GB/day ◦ Voice 100GB/day ◦ SMS 200GB/day  100 - 200 GB/day of Web Data Mammoth Data Data Analysis Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 3
  • 4.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 4
  • 5.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 5
  • 6.
    Framework for distributed processing of large data sets across clusters  Consists of ◦ Hadoop Distributed File System aka HDFS (File system) ◦ Hadoop MapReduce (programming model )  Characteristics ◦ Performance shall scale linearly ◦ Compute should move to data ◦ Simple core, Modular and Extensible Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 6
  • 7.
    Current Bottleneck ◦ Data resides in multiple nodes/zones/VM instance & no elegant, reliable and efficient way of extracting data ◦ Loading terabytes of data into database is slow ◦ Parallel computing not a possibility in Conventional BI ETL ◦ User profile and application data resides in DB which can scale only vertically Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 7
  • 8.
    Structured Data  sqoop --connect jdbc:mysql://db.example.com/website --table USERS --as- sequencefile  Un Structured Data Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 8
  • 9.
    A Distributed data Collection server ◦ Scalable ◦ Configurable ◦ Extensible ◦ Manageable  Built around the concept of flows ◦ A single flow corresponds to a type of data source ◦ Supports compression, batching & reliability setups per flow  Data come in through a source ◦ Optionally processed by one or more decorators ◦ And transmitted out via sink Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 9
  • 10.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 10
  • 11.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 11
  • 12.
    Map Reduce is very powerful, but: ◦ It requires a Java programmer ◦ User has to re-invent common ◦ functionality (join, filter, etc.)  Execution engine atop Hadoop  Pig provides a higher level language Pig Latin  Opens the system to non-Java programmers  Provides common operations like join, group, filter, sort Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 12
  • 13.
    Web log processing.  Data processing for web search platforms.  Ad hoc queries across large data sets.  Rapid prototyping of algorithms for processing large data sets.  Pig runs on local machine and job gets executed in hadoop cluster  $ cd /usr/share/cloudera/pig/  $ bin/pig –x local  grunt>  Log = LOAD ‘excite-small.log’ AS (user, timestamp, query);  grpd = GROUP log BY user;  cntd = FOREACH grpd GENERATE group, COUNT(log);  STORE cntd INTO ‘output’; Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 13
  • 14.
    System for querying and managing structured data  Built on top of hadoop  Uses map reduce for execution  SQL like syntax; supports ◦ From clause subquery ◦ ANSO Join (equi join ) ◦ Multi-table insert ◦ Multi group-by ◦ Sampling ◦ Object traversal  Engagement ◦ Summarization ◦ Ad hoc analysis ◦ Spam detection Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 14
  • 15.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 15
  • 16.
    Feature Hive Pig Language SQL-like PigLatin Schemas/Types Yes (explicit) Yes (implicit) Partitions Yes No Server Optional(thirft) No User Defined Functions Yes Yes Custom Serializer/Deserializer Yes Yes DFS Direct Access Yes (implicit) Yes (explicit) Join/Order/Sort Yes Yes Shell Yes Yes Streaming Yes No Web Interface Yes No JDBC/ODBC Yes (limited) No Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 16
  • 17.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 17
  • 18.
    Copyright © 2011Flytxt B.V. All rights reserved. 1/17/2012 18