3. 3
Federico Leven
About Us
Lead Big Data Architect @ Nexius
Hadoop integration in the enterprise
Co-founder of Hadoop User Groups in
LATAM (Argentina and Chile)
VP of Delivery Services @ Nexius
Colin Train
Messi, asado and tango.
Previously worked @ Luminar Insights
Skype: @federicol
Linkedin : https://ar.linkedin.com/in/sojovi
4. 4
Agenda
The Telco Operators' Data Challenges
Introduction to Flume
End-to-End Architecture (from
external sources to user reports)
NextGen Architecture
Summary - Q/A
6. Communications Service Providers (CSP)
6
Wired and wireless networks
Transport information electronically
Telecommunications Carriers, Cable Service Providers, Satellite
Broadcasting Operators
Challenges
- Churn
- Network Performance
- Service Usage
Questions they’re looking to answer
- Who are my valuable customers?
- Which customers are likely to cancel and why?
- How can I satisfy my customers so they don’t cancel?
- What services are my customers looking for?
7. 7
Big Data Sources: Volume, Variety, Velocity
CRM
POS
Care
Network
RAN/CORE
VAS
Transmission
IN/Billing ERP/Social
Media
URL DPIs
Drive Test/
Coverage
Probes
CDRDWH
8. Traditional Sources + New Sources
8
CDR
Voice
SMS
MMS
Data
Data Probes (DPI)
Network Perf.
Data (KPI)
Social Media
Web Logs
GPS Coordinates Data
+ ∑∑
9. A Classic EDW Architecture (Problem)
9
Social
Media
Web
Logs EDW
ETL
Server
10. Moving to the Hadoop Side of Things
10
EDW
Flume
Social
Media
Web
Logs
12. What is Flume ?
12
Distributed
Reliable
Easy to use
Flexible
Extensible
Move
Collect
HDFS
HBASE
Other
But Kafka?
Overlaps many
functions with FLUME
More generic
More complex to implement
13. Key Definitions
13
The initial point where events are generated.
CHANNELEVENT EVENT
CLIENT
EVENT
AGENT
Data unit transported by Flume. Headers + data as byte array
Main component in FLUME. A container for Sources,
Channels and Sinks to transport data (Events). Is a JVM.
SINKSOURCE
22. From the Back-end to the Front-end 1
22
Generate Analytics for CSPs based on social media feeds
Analyze what customers are saying about their experience and satisfaction
23. From the Back-end to the Front-end 2
23
Rich Mapping
Bad sentiment concentration by geographic areas
Identify areas with bad service
Good afternoon everyone
Let’s get started, first thanks for attending this presentation.
My name is Federico Leven, I’m the Big Data Architect @ Nexius. In the next slide I’ll talk a Little bit more about me and my colleague Colin Train, who is the expert on the telco field.
The aim of this presentation is to cover some topics, but centered mainly on Flume applied to Telco Operators, how to implement it for the telecommunication domain.
Is not the intention to go in the very detail and subtles of flume, there’s very good documentation and books (some of them we be presented at the end).
Also we will be covering an end to end real implementation of our platform, from the external sources to the final user report.
COLIN
as I said I’m the Big Data Architect @ Nexius.
My area of expertise if the integration of Hadoop in the enterprise. We know Hadoop is pretty, fast, stable, etc but when it comes to implement among all the existing systems and IT infrastructure is when the problems starts. How to keep the investment, optimizing what exists adding Hadoop and their stack of technologies.
Let me tell some of things that I’m currently doing, I’m from Argentina, better known for Messi, asado and tango.
I’ll let Colin to start…
COLIN
COLIN
COLIN
COLIN
Here we have a list with the usual datasources a telco operator needs to collect in order to feed their Enterprise DWH.
From Internal systems as CRM/ERP/Billing, to CDR data, CDR is the Call Detail Record, the basic unit of information, where every call, every message, every internet packet, is generated as a record and stored.
I want to put special attention on CDR and what we call the new datasources in the Internet 2.0 + Internet of Things times.
CDR (Call Detail Record) is a series of different records collecting information about what is happening in the network of the telco operator.
We have all the records for the voice calls, each call, 1 record. The same for SMS. MMS, Data. We have also DPI (Deep Packet Inspection) that records and stores all the data traffic in the network, each time you sent a wasap, browse a website in your mobile, transfer any data from your installed apps, all the packets travelling in the network will be stored as a record. Source IP, Dest IP, header, data, TCP, IP headers, etc.
Also we have KPI, which are a set of metrics that measure the performance of the networks and allows the telco companies to monitor the network, what’s going on, if there are problems, congestion, etc.
Each CDR has a different format, usually structured but not the same, and DPI and KPI they have their own format.
SO CDR is the classic example of a BIG DATA source. Has volume, in fact so much volumen causing the telco operators discarding most of the available fields to be able to accomodate the data in the DWH. It has variety, 6 different formats at least. It requires velocity, because at the rate is generated, even in a batch platform, the speed to ingest and process all the CDR goes beyond what a RDBMS can manage.
Now let’s add the new datasources they are starting to collect to enrich and provide a more powerful analytic platform. Social Media, Web Logs and also geolocated data from devices in the networks but also from customer activity in the network.
So, let’s think about a traditional DWH ingesting, extracting, tranforming and loading all this datasources. ETL or ELT process.
This architecture, which worked well for a long time, is showing a lot of limitations, starting for the efforts required to connect and ingest all the data, now adding the new datasources.
Then transform everything and loading to what we usually call “staging área” in the DWH. Then processing, aggregating, etc to have the final DWH
In this presentation we are going to focus on the first arrow (coll servers to ETL) but showing the entire architecture, moving out from this and moving in to this (next slide)
This is how we can apply a modern data architecture.
First let’s focus on Flume. For a telco operator, Flume has the advantage to be able to connect to many sources out of the box, no customized component or development.
In the telco domain, the most used sources are :
Spooling Directory (allows to read data from a directory where files are being added)
EXEC (allows to execute a Linux command, usually a tail over a log files)
SSH / FTP (not built-in but available to download from open source developers)
One important characteristic is that Flume is Hadoop independent. That means that Flume can run on a host isolated from Hadoop, and chaining multiple flume agents we can go from an external server to HDFS.
So we have, compared with the previous one, multiple advantages. First, Flume as flexible and easy to use component for ingestión. No ETL required to store data in HDFS. Then Hadoop to process and analyze the data collected. Then moving the data to the DWH, which will be much smaller and supporting user reports as primary purpose, for fast online SQL data Access.
Service distributed….
Reliable (use transactions but also depends on configuration)
Used to collect or move. Storage, HDFS and Hbase, but others.
Use case for Flume : If you have data that needs to be collected and stored in HDFS / HIVE or HBASE, Flume is one, if not the best choice.
Compared with Kafka, at first we can say that is more generic, Kafka implements a coordinated high availability via Zookeeper, and Kafka is more complex to implement, but at the end, you have to investigate and select the component that fit your needs.
A flume agent is a JVM compound of 3 elements inside : Source Channels and Sinks tranporting a flow of Events.
To have an agent ready to run, we need to configure sources, channels, sinks and optionally, interceptors.
A source is the object/component in Flume that connects to the external sources and retrieves events. The events are delivered to the channels or channels. The most used types for telco are listed, Spooling dir at the top of the list, EXEC and Avro are others sources you’ll use in the telco IT infrastructure, and the custom sources.
The interceptor “captures” the Events before leaving the Sources and allows you to manipulate the events, modify them, add headers, change data, or even search some data in the events to decide what to do, discard or send it to a specific cannel.
Channel is the “buffering” área in the Flume agents. Events are sent from the source to the cannel inside a transaction and the same from the cannel to the sink. The Memory cannel use the “best effort technique” meaning that an error in the Flume JVM will cause the lost of all the events in the cannel. For reliable and persistent events , use File or JDBC.
Sink is the place where we are going to store our events. We have a set of “variable” or mask that can be used to créate a variable output folder name using the day, month, year, hostname, etc. Another technique for reliability if to use sinkgroups, that will be explained in th next slides.
This is a basic data flow. In this diagram we have 2 flume agents, running in a some box, each consuming from an external sources and storing in different sources. They are isolated, there’s no connection between them and should the same as run each agent in a separate host. Of course the arch running multiple agent in a host has some limitations in terms of the numbers of instance you can have, depending on the hardware you are using, but that’s a sizing problem in Flume that we are not covering here, but you have to know the number and size of the events in time, the source type, etc, the number flume instances per host, etc.
The interceptor is executed at the Source/Channel point. In the telco domain we usually use the Interceptor for adding a timestamp and the hostname of the flume agent to the events, and also to select a channel to separate records with wrong format from safe data. This is called “multiplexing” and we will see it in next slide.
We have here 2 different ways to use multiple channels. One is called (and the default behaviour when using multichannel source) is replicating. Replicating means sending the Events from the Source to all the channels. You can use it to apply some kind of failover for sinks, because you are storing the events in all sinks, in case one of the sinks fails, the other sink will store the events.
The other behaviour is called multiplexing and allows us to decide which cannel will receive the event. This requires usually the participation of an interceptor, to find something in the event that gives us the criteria to decide which is the right cannel to receive the event. As mentioned in the previous slide, I can use it for separate bad records of good records, storing the good records in HDFS and the bad records in the local filesystem for later analysis.
Replicating : Reliability, No data loss, disaster recovery.
Multiplexing : Partitioning of data
Flume has the ability to use a Flume Agent as the External source and other flume agent as the sink, this give us the capacity to créate a chain of agents that allow as to connect from a set of external source to the final destination.
In the telco infrastructure, I can have the following architecture :
We need to store in HDFS in a folder structure partitioned by year/month data caoming from multiple collection servers with CDR logs from diferente areas.
N Flume agent running in a set of collection server, reading from a directory new files using the Spooling directory source (i.e. a NFS). The collection servers has no direct Access to Hadoop, so you connect to a host running on a datanode (to take advantage of locality and short-circuit writing) that consolidates the different Flume agent sources and store the data in HDFS
We can see here the skeleleton of the methods to implement to créate a custom Source, Sink and Interceptor.
The intialize is used mainly to Access the configuration for the flume agent.
Start / Stop
Process is where the events are being receive, processed if needed and sent to the Channel or the storage if the Source
The interceptor uses the intercepto method that handles event by event and the List is the method that processes the batch of events received by the interceptor, iterates over them.
Now , let’s see a full implementation, in this case this is the implementation of the platform for Analytics. This is how you can deploy Flume and Hadoop as part of an enhanced Analytics platform.
The goal of this Analytics platform is to answer the questions we mentioned in the slide and BLABLA (solve) the challenges :
Challenges
Avoid Churn
Improve Network Performance
Understand Service Usage
Questions they’re looking to answer
Who are my valuable customers?
Which customers are likely to cancel and why?
How can I satisfy my customers so they don’t cancel?
What services are my customers looking for?
Starting from the sources, we have a Social Media data. We use the people interactions to determine what they are saying related with a telco op. There’s a single flume agent for each social network, and rough numbers 70% comes from Twitter, 20/25% from Facebook, and the rest from the other networks.
Then we have data collected from Billing system (to get info about subscribes, the billing info, contacts to customer care, etc). This is collected by a single flume agent from a spooling directory with the data exports from the DB
Then we have all the CDR data, mainly for Voice, SMS and Data. This data is available in a set of collections servers, so we ran, for each type, a consolidation architecture for each of the services. The same for KPI and DPI.
Once into HDFS, batch processes are executed over the data to clasify the social media (using ML API classifiers) sentiment, topic , customer and try to match a user in the social network with a subscriber in the Company. The result of this analysis plus some data transformation and normalization is written to HDFS but available as Hive tables.
Then we have the EDW that supports the reports, which are updated with the latest results, adding new records or replacing aggregate tables with new agg results.
Then BI tools retrieves data from the EDW, providing faster SQL Access but also taking advangate of the more powerfull SQL provided by RDBMS like Vertica PDW or Oracle. This data transfer is done using the Big Data connectors for Hadoop that this RDBMS provides (PDW -> Polybase. Vertica -> Hadoop Connectors. Oracle – Big Data Connector)
Focus on the multiplexing config and interceptor
This how we are moving into Real-Time Analytics. On top of Spark, and taking andvantage of the MLLib, we are replacing the batch oriented architecure to a Real-time arch. Flume remains almost exactly the same, but now the Sink is an Avro Sink, tied to a port and a host which is the Spark application. So, from the flume side the changes are minimal, the changes on the arch are on the Hadoop side.