Hadoop for carrier

Leveraging Hadoop Cluster for Carrier grade application

Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012

No Personalization

Service
discovery

Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 2

 600- 800 GB of CDR per day
◦ GPRS Signaling 50GB/day
◦ 3G Signaling 300GB/day
◦ Voice 100GB/day
◦ SMS 200GB/day
 100 - 200 GB/day of Web Data

Mammoth Data
Data Analysis


 Framework for distributed processing of large data sets
across clusters
 Consists of
◦ Hadoop Distributed File System aka HDFS (File system)
◦ Hadoop MapReduce (programming model )
 Characteristics
◦ Performance shall scale linearly
◦ Compute should move to data
◦ Simple core, Modular and Extensible


 Current Bottleneck

◦ Data resides in multiple nodes/zones/VM instance & no elegant,
reliable and efficient way of extracting data

◦ Loading terabytes of data into database is slow

◦ Parallel computing not a possibility in Conventional BI ETL

◦ User profile and application data resides in DB which can scale
only vertically


 Structured Data

 sqoop --connect jdbc:mysql://db.example.com/website --table USERS --as-
sequencefile

 Un Structured Data


 A Distributed data Collection server
◦ Scalable
◦ Configurable
◦ Extensible
◦ Manageable

 Built around the concept of flows
◦ A single flow corresponds to a type of data source
◦ Supports compression, batching & reliability setups per flow

 Data come in through a source
◦ Optionally processed by one or more decorators
◦ And transmitted out via sink


 Map Reduce is very powerful, but:
◦ It requires a Java programmer
◦ User has to re-invent common
◦ functionality (join, filter, etc.)

 Execution engine atop Hadoop

 Pig provides a higher level language Pig Latin

 Opens the system to non-Java programmers

 Provides common operations like join, group, filter, sort


 Web log processing.
 Data processing for web search platforms.
 Ad hoc queries across large data sets.
 Rapid prototyping of algorithms for processing large data
sets.
 Pig runs on local machine and job gets executed in hadoop
cluster
 $ cd /usr/share/cloudera/pig/
 $ bin/pig –x local
 grunt>
 Log = LOAD ‘excite-small.log’ AS (user, timestamp, query);
 grpd = GROUP log BY user;
 cntd = FOREACH grpd GENERATE group, COUNT(log);
 STORE cntd INTO ‘output’;


 System for querying and managing structured data
 Built on top of hadoop
 Uses map reduce for execution
 SQL like syntax; supports
◦ From clause subquery
◦ ANSO Join (equi join )
◦ Multi-table insert
◦ Multi group-by
◦ Sampling
◦ Object traversal
 Engagement
◦ Summarization
◦ Ad hoc analysis
◦ Spam detection


Feature Hive Pig
Language SQL-like PigLatin
Schemas/Types Yes (explicit) Yes (implicit)
Partitions Yes No
Server Optional(thirft) No
User Defined Functions Yes Yes
Custom Serializer/Deserializer Yes Yes
DFS Direct Access Yes (implicit) Yes (explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes No
Web Interface Yes No
JDBC/ODBC Yes (limited) No


Hadoop for carrier

More Related Content

Similar to Hadoop for carrier

More from Flytxt

Recently uploaded

Hadoop for carrier