SlideShare a Scribd company logo
1 of 27
Apache Flume - Streaming data easily to
Hadoop from any source for Telco
operators
Federico Leven
Colin Train
2
3
Federico Leven
About Us
Lead Big Data Architect @ Nexius
Hadoop integration in the enterprise
Co-founder of Hadoop User Groups in
LATAM (Argentina and Chile)
VP of Delivery Services @ Nexius
Colin Train
Messi, asado and tango.
Previously worked @ Luminar Insights
Skype: @federicol
Linkedin : https://ar.linkedin.com/in/sojovi
4
Agenda
The Telco Operators' Data Challenges
Introduction to Flume
End-to-End Architecture (from
external sources to user reports)
NextGen Architecture
Summary - Q/A
5
The Telco Operators’
Data Challenges
Communications Service Providers (CSP)
6
Wired and wireless networks
Transport information electronically
Telecommunications Carriers, Cable Service Providers, Satellite
Broadcasting Operators
Challenges
- Churn
- Network Performance
- Service Usage
Questions they’re looking to answer
- Who are my valuable customers?
- Which customers are likely to cancel and why?
- How can I satisfy my customers so they don’t cancel?
- What services are my customers looking for?
7
Big Data Sources: Volume, Variety, Velocity
CRM
POS
Care
Network
RAN/CORE
VAS
Transmission
IN/Billing ERP/Social
Media
URL DPIs
Drive Test/
Coverage
Probes
CDRDWH
Traditional Sources + New Sources
8
CDR
Voice
SMS
MMS
Data
Data Probes (DPI)
Network Perf.
Data (KPI)
Social Media
Web Logs
GPS Coordinates Data
+ ∑∑
A Classic EDW Architecture (Problem)
9
Social
Media
Web
Logs EDW
ETL
Server
Moving to the Hadoop Side of Things
10
EDW
Flume
Social
Media
Web
Logs
11
Flume Introduction
What is Flume ?
12
Distributed
Reliable
Easy to use
Flexible
Extensible
Move
Collect
HDFS
HBASE
Other
But Kafka?
Overlaps many
functions with FLUME
More generic
More complex to implement
Key Definitions
13
The initial point where events are generated.
CHANNELEVENT EVENT
CLIENT
EVENT
AGENT
Data unit transported by Flume. Headers + data as byte array
Main component in FLUME. A container for Sources,
Channels and Sinks to transport data (Events). Is a JVM.
SINKSOURCE
Agent Configurations
14
Spooling dir
EXEC
Netcat
AVRO
HTTP
Others
Custom
Memory
JDBC
File
Others
Custom
HDFS
HBASE
Hive
Others
Custom
Timestamp
Regex
Static
Others
Custom
SOURCE INTERCEPTOR CHANNEL SINK
Data Flow Model
15
NFS HDFS
CHANNELEVENT EVENT SINKSOURCE EVENTEVENT
HTTP
API HBASE
CHANNELEVENT EVENT SINKSOURCE EVENTEVENT
HOST
Data Flow Model (Multiplexing/Replicating)
16
HDFS
CHANNEL 1 EVENT SINK 1 EVENT
External
Source
File
CHANNEL 2 EVENT SINK 2
SOURCE
EVENT
EVENT
JVM Flume Agent
Data Flow Model (Fan Out / Consolidation)
17
NFS
HDFS
CHANNEL SINKSOURCE
HOST 1
EVENTEVENT
CHANNEL SINKSOURCE EVENTEVENT
CHANNEL SINKSOURCE EVENTEVENT
EVENT
EVENT
HTTP
EVENT
HOST 2
How-to: Java extensibility
18
public class MySource extends AbstractSource
{
Public void configure(Context …) { … }
Public void start() { … }
Public void stop() { … }
Public Status process() {
..
Event e = getSomeData();
getChannelProcessor().processEvent(e);
}
}
public class MyInterceptor implements Interceptor {
public void initialize() { … }
public Event intercept(Event event) { … }
public List<Event> intercept(List<Event> events) { … }
public void close() { … }
}
public class MySink extends AbstractSink
{
Public void configure(Context …) { … }
Public void start() { … }
Public void stop() { … }
Public Status process() {
..
Event e = ch.take();
storeSomeData(e);
}
}
19
End-to-End Architecture.
From external source to user
reports
Hadoop to EDW integration
20
TEZ-YARN AND
HDFS VERTICA
ML CLASSIFIERS AND
SENTIMENT ANALYSIS
Topic Sentiment
PDW
ORACLE
Sources Ingest Hadoop EDW
ERP
CRM
Billing
Flume Agent
Flume Agents
Customer
CDR
KPI
DPI
Flume Agent
BI Tools
Hive
on Tez
Real Example of Flume Configuration
21
FacebookAgent_UAE.sources = FacebookPages_UAE
FacebookAgent_UAE.channels = MemChannelPages_UAE MemChannelPages_UAEError
FacebookAgent_UAE.sinks = HDFSFacebookPages_UAE HDFSFacebookPages_UAEError
FacebookAgent_UAE.sources.FacebookPages_UAE.type = com.nexius...Facebook_Pages
FacebookAgent_UAE.sources.FacebookPages_UAE.channels = MemChannelPages_UAE MemChannelPages_UAEError
FacebookAgent_UAE.sources.FacebookPages_UAE.consumerKey = XXXXXX
FacebookAgent_UAE.sources.FacebookPages_UAE.consumerSecret = XXXXX
FacebookAgent_UAE.sources.FacebookPages_UAE.postsLimit=20
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.type = multiplexing
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.header = msgType
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.mapping.NORMAL = MemChannelPages_UAE
FacebookAgent_UAE.sources.FacebookPages_UAE.selector.mapping.ERROR = MemChannelPages_UAEError
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAE.channel = MemChannelPages_UAE
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAE.type = hdfs
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAEError.channel = MemChannelPages_UAEError
FacebookAgent_UAE.sinks.HDFSFacebookPages_UAEError.type = hdfs
FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors = KeysReplace
FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors.KeysReplace.type = com...Replace$Builder
FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors.KeysReplace.map.from = fr0m
From the Back-end to the Front-end 1
22
Generate Analytics for CSPs based on social media feeds
Analyze what customers are saying about their experience and satisfaction
From the Back-end to the Front-end 2
23
Rich Mapping
Bad sentiment concentration by geographic areas
Identify areas with bad service
24
Next Generation
Architecture
Next Generation Architecture
25
SPARK ON
YARN
VERTICA
MLLib CLASSIFIERS AND
SENTIMENT ANALYSIS
Topic Sentiment
PDW
ORACLE
Sources Ingest Hadoop EDW
ERP
CRM
Billing
Flume Agent
Flume Agents
Customer
CDR
KPI
DPI
Flume Agent
BI Tools
HDFS
Hive
on Tez
http://en.wikipedia.org/wiki/Call_detail_record
http://flume.apache.org/
http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/
http://hortonworks.com/hadoop/flume/
https://blogs.oracle.com/datawarehousing/entry/flume_and_hive_for_log
Bibliography and References
26
Thank you.
Now, your
questions!

More Related Content

What's hot

Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019confluent
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)W2O Group
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Scaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayScaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayKarthik Ramasamy
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropeFlip Kromer
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 

What's hot (20)

Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Scaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayScaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/day
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 

Similar to Apache Flume - Streaming data easily to Hadoop from any source for Telco operators

Networking and communications security – network architecture design
Networking and communications security – network architecture designNetworking and communications security – network architecture design
Networking and communications security – network architecture designEnterpriseGRC Solutions, Inc.
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageDamien Dallimore
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 
Cwin16 tls-a micro-service deployment - v1.0
Cwin16 tls-a micro-service deployment - v1.0Cwin16 tls-a micro-service deployment - v1.0
Cwin16 tls-a micro-service deployment - v1.0Capgemini
 
WS-VLAM workflow
WS-VLAM workflowWS-VLAM workflow
WS-VLAM workflowguest6295d0
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Timothy Spann
 
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEMadaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEMtherealgaston
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suitesmarru
 
Linked services for the Web of Data
Linked services for the Web of DataLinked services for the Web of Data
Linked services for the Web of DataJohn Domingue
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 
IoTivity for Automotive IoT Interoperability
IoTivity for Automotive IoT InteroperabilityIoTivity for Automotive IoT Interoperability
IoTivity for Automotive IoT InteroperabilitySamsung Open Source Group
 
Open Source XMPP for Cloud Services
Open Source XMPP for Cloud ServicesOpen Source XMPP for Cloud Services
Open Source XMPP for Cloud Servicesmattjive
 
Cloud to hybrid edge cloud evolution Jun112020.pptx
Cloud to hybrid edge cloud evolution Jun112020.pptxCloud to hybrid edge cloud evolution Jun112020.pptx
Cloud to hybrid edge cloud evolution Jun112020.pptxMichel Burger
 
Getting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirGetting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirLuciano Resende
 
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...mfrancis
 
PowerPoint
PowerPointPowerPoint
PowerPointVideoguy
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...confluent
 

Similar to Apache Flume - Streaming data easily to Hadoop from any source for Telco operators (20)

Networking and communications security – network architecture design
Networking and communications security – network architecture designNetworking and communications security – network architecture design
Networking and communications security – network architecture design
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Cwin16 tls-a micro-service deployment - v1.0
Cwin16 tls-a micro-service deployment - v1.0Cwin16 tls-a micro-service deployment - v1.0
Cwin16 tls-a micro-service deployment - v1.0
 
WS-VLAM workflow
WS-VLAM workflowWS-VLAM workflow
WS-VLAM workflow
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEMadaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
 
Linked services for the Web of Data
Linked services for the Web of DataLinked services for the Web of Data
Linked services for the Web of Data
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
IoTivity for Automotive IoT Interoperability
IoTivity for Automotive IoT InteroperabilityIoTivity for Automotive IoT Interoperability
IoTivity for Automotive IoT Interoperability
 
Open Source XMPP for Cloud Services
Open Source XMPP for Cloud ServicesOpen Source XMPP for Cloud Services
Open Source XMPP for Cloud Services
 
Cloud to hybrid edge cloud evolution Jun112020.pptx
Cloud to hybrid edge cloud evolution Jun112020.pptxCloud to hybrid edge cloud evolution Jun112020.pptx
Cloud to hybrid edge cloud evolution Jun112020.pptx
 
Getting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirGetting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache Bahir
 
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Apache Flume - Streaming data easily to Hadoop from any source for Telco operators

  • 1. Apache Flume - Streaming data easily to Hadoop from any source for Telco operators Federico Leven Colin Train
  • 2. 2
  • 3. 3 Federico Leven About Us Lead Big Data Architect @ Nexius Hadoop integration in the enterprise Co-founder of Hadoop User Groups in LATAM (Argentina and Chile) VP of Delivery Services @ Nexius Colin Train Messi, asado and tango. Previously worked @ Luminar Insights Skype: @federicol Linkedin : https://ar.linkedin.com/in/sojovi
  • 4. 4 Agenda The Telco Operators' Data Challenges Introduction to Flume End-to-End Architecture (from external sources to user reports) NextGen Architecture Summary - Q/A
  • 6. Communications Service Providers (CSP) 6 Wired and wireless networks Transport information electronically Telecommunications Carriers, Cable Service Providers, Satellite Broadcasting Operators Challenges - Churn - Network Performance - Service Usage Questions they’re looking to answer - Who are my valuable customers? - Which customers are likely to cancel and why? - How can I satisfy my customers so they don’t cancel? - What services are my customers looking for?
  • 7. 7 Big Data Sources: Volume, Variety, Velocity CRM POS Care Network RAN/CORE VAS Transmission IN/Billing ERP/Social Media URL DPIs Drive Test/ Coverage Probes CDRDWH
  • 8. Traditional Sources + New Sources 8 CDR Voice SMS MMS Data Data Probes (DPI) Network Perf. Data (KPI) Social Media Web Logs GPS Coordinates Data + ∑∑
  • 9. A Classic EDW Architecture (Problem) 9 Social Media Web Logs EDW ETL Server
  • 10. Moving to the Hadoop Side of Things 10 EDW Flume Social Media Web Logs
  • 12. What is Flume ? 12 Distributed Reliable Easy to use Flexible Extensible Move Collect HDFS HBASE Other But Kafka? Overlaps many functions with FLUME More generic More complex to implement
  • 13. Key Definitions 13 The initial point where events are generated. CHANNELEVENT EVENT CLIENT EVENT AGENT Data unit transported by Flume. Headers + data as byte array Main component in FLUME. A container for Sources, Channels and Sinks to transport data (Events). Is a JVM. SINKSOURCE
  • 15. Data Flow Model 15 NFS HDFS CHANNELEVENT EVENT SINKSOURCE EVENTEVENT HTTP API HBASE CHANNELEVENT EVENT SINKSOURCE EVENTEVENT HOST
  • 16. Data Flow Model (Multiplexing/Replicating) 16 HDFS CHANNEL 1 EVENT SINK 1 EVENT External Source File CHANNEL 2 EVENT SINK 2 SOURCE EVENT EVENT JVM Flume Agent
  • 17. Data Flow Model (Fan Out / Consolidation) 17 NFS HDFS CHANNEL SINKSOURCE HOST 1 EVENTEVENT CHANNEL SINKSOURCE EVENTEVENT CHANNEL SINKSOURCE EVENTEVENT EVENT EVENT HTTP EVENT HOST 2
  • 18. How-to: Java extensibility 18 public class MySource extends AbstractSource { Public void configure(Context …) { … } Public void start() { … } Public void stop() { … } Public Status process() { .. Event e = getSomeData(); getChannelProcessor().processEvent(e); } } public class MyInterceptor implements Interceptor { public void initialize() { … } public Event intercept(Event event) { … } public List<Event> intercept(List<Event> events) { … } public void close() { … } } public class MySink extends AbstractSink { Public void configure(Context …) { … } Public void start() { … } Public void stop() { … } Public Status process() { .. Event e = ch.take(); storeSomeData(e); } }
  • 20. Hadoop to EDW integration 20 TEZ-YARN AND HDFS VERTICA ML CLASSIFIERS AND SENTIMENT ANALYSIS Topic Sentiment PDW ORACLE Sources Ingest Hadoop EDW ERP CRM Billing Flume Agent Flume Agents Customer CDR KPI DPI Flume Agent BI Tools Hive on Tez
  • 21. Real Example of Flume Configuration 21 FacebookAgent_UAE.sources = FacebookPages_UAE FacebookAgent_UAE.channels = MemChannelPages_UAE MemChannelPages_UAEError FacebookAgent_UAE.sinks = HDFSFacebookPages_UAE HDFSFacebookPages_UAEError FacebookAgent_UAE.sources.FacebookPages_UAE.type = com.nexius...Facebook_Pages FacebookAgent_UAE.sources.FacebookPages_UAE.channels = MemChannelPages_UAE MemChannelPages_UAEError FacebookAgent_UAE.sources.FacebookPages_UAE.consumerKey = XXXXXX FacebookAgent_UAE.sources.FacebookPages_UAE.consumerSecret = XXXXX FacebookAgent_UAE.sources.FacebookPages_UAE.postsLimit=20 FacebookAgent_UAE.sources.FacebookPages_UAE.selector.type = multiplexing FacebookAgent_UAE.sources.FacebookPages_UAE.selector.header = msgType FacebookAgent_UAE.sources.FacebookPages_UAE.selector.mapping.NORMAL = MemChannelPages_UAE FacebookAgent_UAE.sources.FacebookPages_UAE.selector.mapping.ERROR = MemChannelPages_UAEError FacebookAgent_UAE.sinks.HDFSFacebookPages_UAE.channel = MemChannelPages_UAE FacebookAgent_UAE.sinks.HDFSFacebookPages_UAE.type = hdfs FacebookAgent_UAE.sinks.HDFSFacebookPages_UAEError.channel = MemChannelPages_UAEError FacebookAgent_UAE.sinks.HDFSFacebookPages_UAEError.type = hdfs FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors = KeysReplace FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors.KeysReplace.type = com...Replace$Builder FacebookAgent_UAE.sources.FacebookPages_UAE.interceptors.KeysReplace.map.from = fr0m
  • 22. From the Back-end to the Front-end 1 22 Generate Analytics for CSPs based on social media feeds Analyze what customers are saying about their experience and satisfaction
  • 23. From the Back-end to the Front-end 2 23 Rich Mapping Bad sentiment concentration by geographic areas Identify areas with bad service
  • 25. Next Generation Architecture 25 SPARK ON YARN VERTICA MLLib CLASSIFIERS AND SENTIMENT ANALYSIS Topic Sentiment PDW ORACLE Sources Ingest Hadoop EDW ERP CRM Billing Flume Agent Flume Agents Customer CDR KPI DPI Flume Agent BI Tools HDFS Hive on Tez

Editor's Notes

  1. Good afternoon everyone Let’s get started, first thanks for attending this presentation. My name is Federico Leven, I’m the Big Data Architect @ Nexius. In the next slide I’ll talk a Little bit more about me and my colleague Colin Train, who is the expert on the telco field. The aim of this presentation is to cover some topics, but centered mainly on Flume applied to Telco Operators, how to implement it for the telecommunication domain. Is not the intention to go in the very detail and subtles of flume, there’s very good documentation and books (some of them we be presented at the end). Also we will be covering an end to end real implementation of our platform, from the external sources to the final user report.
  2. COLIN
  3. as I said I’m the Big Data Architect @ Nexius. My area of expertise if the integration of Hadoop in the enterprise. We know Hadoop is pretty, fast, stable, etc but when it comes to implement among all the existing systems and IT infrastructure is when the problems starts. How to keep the investment, optimizing what exists adding Hadoop and their stack of technologies. Let me tell some of things that I’m currently doing, I’m from Argentina, better known for Messi, asado and tango. I’ll let Colin to start… COLIN
  4. COLIN
  5. COLIN
  6. COLIN
  7. Here we have a list with the usual datasources a telco operator needs to collect in order to feed their Enterprise DWH. From Internal systems as CRM/ERP/Billing, to CDR data, CDR is the Call Detail Record, the basic unit of information, where every call, every message, every internet packet, is generated as a record and stored.
  8. I want to put special attention on CDR and what we call the new datasources in the Internet 2.0 + Internet of Things times. CDR (Call Detail Record) is a series of different records collecting information about what is happening in the network of the telco operator. We have all the records for the voice calls, each call, 1 record. The same for SMS. MMS, Data. We have also DPI (Deep Packet Inspection) that records and stores all the data traffic in the network, each time you sent a wasap, browse a website in your mobile, transfer any data from your installed apps, all the packets travelling in the network will be stored as a record. Source IP, Dest IP, header, data, TCP, IP headers, etc. Also we have KPI, which are a set of metrics that measure the performance of the networks and allows the telco companies to monitor the network, what’s going on, if there are problems, congestion, etc. Each CDR has a different format, usually structured but not the same, and DPI and KPI they have their own format. SO CDR is the classic example of a BIG DATA source. Has volume, in fact so much volumen causing the telco operators discarding most of the available fields to be able to accomodate the data in the DWH. It has variety, 6 different formats at least. It requires velocity, because at the rate is generated, even in a batch platform, the speed to ingest and process all the CDR goes beyond what a RDBMS can manage. Now let’s add the new datasources they are starting to collect to enrich and provide a more powerful analytic platform. Social Media, Web Logs and also geolocated data from devices in the networks but also from customer activity in the network.
  9. So, let’s think about a traditional DWH ingesting, extracting, tranforming and loading all this datasources. ETL or ELT process. This architecture, which worked well for a long time, is showing a lot of limitations, starting for the efforts required to connect and ingest all the data, now adding the new datasources. Then transform everything and loading to what we usually call “staging área” in the DWH. Then processing, aggregating, etc to have the final DWH In this presentation we are going to focus on the first arrow (coll servers to ETL) but showing the entire architecture, moving out from this and moving in to this (next slide)
  10. This is how we can apply a modern data architecture. First let’s focus on Flume. For a telco operator, Flume has the advantage to be able to connect to many sources out of the box, no customized component or development. In the telco domain, the most used sources are : Spooling Directory (allows to read data from a directory where files are being added) EXEC (allows to execute a Linux command, usually a tail over a log files) SSH / FTP (not built-in but available to download from open source developers) One important characteristic is that Flume is Hadoop independent. That means that Flume can run on a host isolated from Hadoop, and chaining multiple flume agents we can go from an external server to HDFS. So we have, compared with the previous one, multiple advantages. First, Flume as flexible and easy to use component for ingestión. No ETL required to store data in HDFS. Then Hadoop to process and analyze the data collected. Then moving the data to the DWH, which will be much smaller and supporting user reports as primary purpose, for fast online SQL data Access.
  11. Service distributed…. Reliable (use transactions but also depends on configuration) Used to collect or move. Storage, HDFS and Hbase, but others. Use case for Flume : If you have data that needs to be collected and stored in HDFS / HIVE or HBASE, Flume is one, if not the best choice. Compared with Kafka, at first we can say that is more generic, Kafka implements a coordinated high availability via Zookeeper, and Kafka is more complex to implement, but at the end, you have to investigate and select the component that fit your needs.
  12. A flume agent is a JVM compound of 3 elements inside : Source Channels and Sinks tranporting a flow of Events.
  13. To have an agent ready to run, we need to configure sources, channels, sinks and optionally, interceptors. A source is the object/component in Flume that connects to the external sources and retrieves events. The events are delivered to the channels or channels. The most used types for telco are listed, Spooling dir at the top of the list, EXEC and Avro are others sources you’ll use in the telco IT infrastructure, and the custom sources. The interceptor “captures” the Events before leaving the Sources and allows you to manipulate the events, modify them, add headers, change data, or even search some data in the events to decide what to do, discard or send it to a specific cannel. Channel is the “buffering” área in the Flume agents. Events are sent from the source to the cannel inside a transaction and the same from the cannel to the sink. The Memory cannel use the “best effort technique” meaning that an error in the Flume JVM will cause the lost of all the events in the cannel. For reliable and persistent events , use File or JDBC. Sink is the place where we are going to store our events. We have a set of “variable” or mask that can be used to créate a variable output folder name using the day, month, year, hostname, etc. Another technique for reliability if to use sinkgroups, that will be explained in th next slides.
  14. This is a basic data flow. In this diagram we have 2 flume agents, running in a some box, each consuming from an external sources and storing in different sources. They are isolated, there’s no connection between them and should the same as run each agent in a separate host. Of course the arch running multiple agent in a host has some limitations in terms of the numbers of instance you can have, depending on the hardware you are using, but that’s a sizing problem in Flume that we are not covering here, but you have to know the number and size of the events in time, the source type, etc, the number flume instances per host, etc. The interceptor is executed at the Source/Channel point. In the telco domain we usually use the Interceptor for adding a timestamp and the hostname of the flume agent to the events, and also to select a channel to separate records with wrong format from safe data. This is called “multiplexing” and we will see it in next slide.
  15. We have here 2 different ways to use multiple channels. One is called (and the default behaviour when using multichannel source) is replicating. Replicating means sending the Events from the Source to all the channels. You can use it to apply some kind of failover for sinks, because you are storing the events in all sinks, in case one of the sinks fails, the other sink will store the events. The other behaviour is called multiplexing and allows us to decide which cannel will receive the event. This requires usually the participation of an interceptor, to find something in the event that gives us the criteria to decide which is the right cannel to receive the event. As mentioned in the previous slide, I can use it for separate bad records of good records, storing the good records in HDFS and the bad records in the local filesystem for later analysis. Replicating : Reliability, No data loss, disaster recovery. Multiplexing : Partitioning of data
  16. Flume has the ability to use a Flume Agent as the External source and other flume agent as the sink, this give us the capacity to créate a chain of agents that allow as to connect from a set of external source to the final destination. In the telco infrastructure, I can have the following architecture : We need to store in HDFS in a folder structure partitioned by year/month data caoming from multiple collection servers with CDR logs from diferente areas. N Flume agent running in a set of collection server, reading from a directory new files using the Spooling directory source (i.e. a NFS). The collection servers has no direct Access to Hadoop, so you connect to a host running on a datanode (to take advantage of locality and short-circuit writing) that consolidates the different Flume agent sources and store the data in HDFS
  17. We can see here the skeleleton of the methods to implement to créate a custom Source, Sink and Interceptor. The intialize is used mainly to Access the configuration for the flume agent. Start / Stop Process is where the events are being receive, processed if needed and sent to the Channel or the storage if the Source The interceptor uses the intercepto method that handles event by event and the List is the method that processes the batch of events received by the interceptor, iterates over them.
  18. Now , let’s see a full implementation, in this case this is the implementation of the platform for Analytics. This is how you can deploy Flume and Hadoop as part of an enhanced Analytics platform. The goal of this Analytics platform is to answer the questions we mentioned in the slide and BLABLA (solve) the challenges : Challenges Avoid Churn Improve Network Performance Understand Service Usage Questions they’re looking to answer Who are my valuable customers? Which customers are likely to cancel and why? How can I satisfy my customers so they don’t cancel? What services are my customers looking for? Starting from the sources, we have a Social Media data. We use the people interactions to determine what they are saying related with a telco op. There’s a single flume agent for each social network, and rough numbers 70% comes from Twitter, 20/25% from Facebook, and the rest from the other networks. Then we have data collected from Billing system (to get info about subscribes, the billing info, contacts to customer care, etc). This is collected by a single flume agent from a spooling directory with the data exports from the DB Then we have all the CDR data, mainly for Voice, SMS and Data. This data is available in a set of collections servers, so we ran, for each type, a consolidation architecture for each of the services. The same for KPI and DPI. Once into HDFS, batch processes are executed over the data to clasify the social media (using ML API classifiers) sentiment, topic , customer and try to match a user in the social network with a subscriber in the Company. The result of this analysis plus some data transformation and normalization is written to HDFS but available as Hive tables. Then we have the EDW that supports the reports, which are updated with the latest results, adding new records or replacing aggregate tables with new agg results. Then BI tools retrieves data from the EDW, providing faster SQL Access but also taking advangate of the more powerfull SQL provided by RDBMS like Vertica PDW or Oracle. This data transfer is done using the Big Data connectors for Hadoop that this RDBMS provides (PDW -> Polybase. Vertica -> Hadoop Connectors. Oracle – Big Data Connector)
  19. Focus on the multiplexing config and interceptor
  20. This how we are moving into Real-Time Analytics. On top of Spark, and taking andvantage of the MLLib, we are replacing the batch oriented architecure to a Real-time arch. Flume remains almost exactly the same, but now the Sink is an Avro Sink, tied to a port and a host which is the Spark application. So, from the flume side the changes are minimal, the changes on the arch are on the Hadoop side.