Apache Flume - Streaming data easily to Hadoop from any source for Telco operators

Apache Flume - Streaming data easily to
Hadoop from any source for Telco
operators
Federico Leven
Colin Train

3
Federico Leven
About Us
Lead Big Data Architect @ Nexius
Hadoop integration in the enterprise
Co-founder of Hadoop User Groups in
LATAM (Argentina and Chile)
VP of Delivery Services @ Nexius
Colin Train
Messi, asado and tango.
Previously worked @ Luminar Insights
Skype: @federicol
Linkedin : https://ar.linkedin.com/in/sojovi

4
Agenda
The Telco Operators' Data Challenges
Introduction to Flume
End-to-End Architecture (from
external sources to user reports)
NextGen Architecture
Summary - Q/A

5
The Telco Operators’
Data Challenges

Communications Service Providers (CSP)
6
Wired and wireless networks
Transport information electronically
Telecommunications Carriers, Cable Service Providers, Satellite
Broadcasting Operators
Challenges
- Churn
- Network Performance
- Service Usage
Questions they’re looking to answer
- Who are my valuable customers?
- Which customers are likely to cancel and why?
- How can I satisfy my customers so they don’t cancel?
- What services are my customers looking for?

7
Big Data Sources: Volume, Variety, Velocity
CRM
POS
Care
Network
RAN/CORE
VAS
Transmission
IN/Billing ERP/Social
Media
URL DPIs
Drive Test/
Coverage
Probes
CDRDWH

Traditional Sources + New Sources
8
CDR
Voice
SMS
MMS
Data
Data Probes (DPI)
Network Perf.
Data (KPI)
Social Media
Web Logs
GPS Coordinates Data
+ ∑∑

A Classic EDW Architecture (Problem)
9
Social
Media
Web
Logs EDW
ETL
Server

Moving to the Hadoop Side of Things
10
EDW
Flume
Social
Media
Web
Logs

What is Flume ?
12
Distributed
Reliable
Easy to use
Flexible
Extensible
Move
Collect
HDFS
HBASE
Other
But Kafka?
Overlaps many
functions with FLUME
More generic
More complex to implement

Key Definitions
13
The initial point where events are generated.
CHANNELEVENT EVENT
CLIENT
EVENT
AGENT
Data unit transported by Flume. Headers + data as byte array
Main component in FLUME. A container for Sources,
Channels and Sinks to transport data (Events). Is a JVM.
SINKSOURCE

Agent Configurations
14
Spooling dir
EXEC
Netcat
AVRO
HTTP
Others
Custom
Memory
JDBC
File
Others
Custom
HDFS
HBASE
Hive
Others
Custom
Timestamp
Regex
Static
Others
Custom
SOURCE INTERCEPTOR CHANNEL SINK

Data Flow Model
15
NFS HDFS
CHANNELEVENT EVENT SINKSOURCE EVENTEVENT
HTTP
API HBASE
CHANNELEVENT EVENT SINKSOURCE EVENTEVENT
HOST

Data Flow Model (Multiplexing/Replicating)
16
HDFS
CHANNEL 1 EVENT SINK 1 EVENT
External
Source
File
CHANNEL 2 EVENT SINK 2
SOURCE
EVENT
EVENT
JVM Flume Agent

Data Flow Model (Fan Out / Consolidation)
17
NFS
HDFS
CHANNEL SINKSOURCE
HOST 1
EVENTEVENT
CHANNEL SINKSOURCE EVENTEVENT
CHANNEL SINKSOURCE EVENTEVENT
EVENT
EVENT
HTTP
EVENT
HOST 2

How-to: Java extensibility
18
public class MySource extends AbstractSource
{
Public void configure(Context …) { … }
Public void start() { … }
Public void stop() { … }
Public Status process() {
..
Event e = getSomeData();
getChannelProcessor().processEvent(e);
}
}
public class MyInterceptor implements Interceptor {
public void initialize() { … }
public Event intercept(Event event) { … }
public List<Event> intercept(List<Event> events) { … }
public void close() { … }
}
public class MySink extends AbstractSink
{
Public void configure(Context …) { … }
Public void start() { … }
Public void stop() { … }
Public Status process() {
..
Event e = ch.take();
storeSomeData(e);
}
}

19
End-to-End Architecture.
From external source to user
reports

Hadoop to EDW integration
20
TEZ-YARN AND
HDFS VERTICA
ML CLASSIFIERS AND
SENTIMENT ANALYSIS
Topic Sentiment
PDW
ORACLE
Sources Ingest Hadoop EDW
ERP
CRM
Billing
Flume Agent
Flume Agents
Customer
CDR
KPI
DPI
Flume Agent
BI Tools
Hive
on Tez

Real Example of Flume Configuration
21
FacebookAgent_UAE.sources = FacebookPages_UAE
FacebookAgent_UAE.channels = MemChannelPages_UAE MemChannelPages_UAEError
FacebookAgent_UAE.sinks = HDFSFacebookPages_UAE HDFSFacebookPages_UAEError
FacebookAgent_UAE.sources.FacebookPages_UAE.type = com.nexius...Facebook_Pages
FacebookAgent_UAE.sources.FacebookPages_UAE.channels = MemChannelPages_UAE MemChannelPages_UAEError
FacebookAgent_UAE.sources.FacebookPages_UAE.consumerKey = XXXXXX
FacebookAgent_UAE.sources.FacebookPages_UAE.consumerSecret = XXXXX
FacebookAgent_UAE.sources.FacebookPages_UAE.postsLimit=20
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.type = multiplexing
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.header = msgType
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.mapping.NORMAL = MemChannelPages_UAE
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.mapping.ERROR = MemChannelPages_UAEError
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAE.channel = MemChannelPages_UAE
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAE.type = hdfs
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAEError.channel = MemChannelPages_UAEError
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAEError.type = hdfs
FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors = KeysReplace
FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors.KeysReplace.type = com...Replace$Builder
FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors.KeysReplace.map.from = fr0m

From the Back-end to the Front-end 1
22
Generate Analytics for CSPs based on social media feeds
Analyze what customers are saying about their experience and satisfaction

From the Back-end to the Front-end 2
23
Rich Mapping
Bad sentiment concentration by geographic areas
Identify areas with bad service

24
Next Generation
Architecture

Next Generation Architecture
25
SPARK ON
YARN
VERTICA
MLLib CLASSIFIERS AND
SENTIMENT ANALYSIS
Topic Sentiment
PDW
ORACLE
Sources Ingest Hadoop EDW
ERP
CRM
Billing
Flume Agent
Flume Agents
Customer
CDR
KPI
DPI
Flume Agent
BI Tools
HDFS
Hive
on Tez

http://en.wikipedia.org/wiki/Call_detail_record
http://flume.apache.org/
http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/
http://hortonworks.com/hadoop/flume/
https://blogs.oracle.com/datawarehousing/entry/flume_and_hive_for_log
Bibliography and References
26

Thank you.
Now, your
questions!

Apache Flume - Streaming data easily to Hadoop from any source for Telco operators

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Flume - Streaming data easily to Hadoop from any source for Telco operators

Similar to Apache Flume - Streaming data easily to Hadoop from any source for Telco operators (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Apache Flume - Streaming data easily to Hadoop from any source for Telco operators

Editor's Notes