SlideShare a Scribd company logo
1 of 27
1© Copyright 2014 EMC Corporation. All rights reserved.
Real Time Data Streaming
+
Speakers:
Sumit Gupta, Data Intelligene Engineer, EMC
Kartikeya Putturaya, Data Intelligence Engineer, EMC
ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
2© Copyright 2014 EMC Corporation. All rights reserved.
Data Engineering at EMC IT
Stack
Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm
Messaging Systems: Rabbit MQ, Apache Kafka
Relation Store: Greenplum
A glimpse on what we do
Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange
servers in real time, with an analytics engine running on a 8 node cluster, processing
data volumes of ~100MB per 2 minutes
User Behavior Analytics for Network Threat Detection – Real time monitoring of
EMC’s internal networks and performing user behavior pattern analysis for threats,
again on a 8 node cluster, processing a stream of ~150MB of data any point of time
3© Copyright 2014 EMC Corporation. All rights reserved.
Predictive Maintenance of Exchange Servers
4© Copyright 2014 EMC Corporation. All rights reserved.
User Behavior Analytics for Network Threat Detection
5© Copyright 2014 EMC Corporation. All rights reserved.
Apache Kafka
6© Copyright 2014 EMC Corporation. All rights reserved.
Overview
An apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g. logs, metrics collections
• Written in Scala
• Does not follow JMS Standards, neither uses JMS APIs
Features
Persistent messaging
High-throughput
Supports both queue and topic semantics
Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)
and many more…
http://kafka.apache.org/
7© Copyright 2014 EMC Corporation. All rights reserved.
How it works
8© Copyright 2014 EMC Corporation. All rights reserved.
Real time transfer
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
9© Copyright 2014 EMC Corporation. All rights reserved.
Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a
partitioned log
10© Copyright 2014 EMC Corporation. All rights reserved.
Kafka Installation
Download
http://kafka.apache.org/downloads.html
Untar it
> tar -xzf kafka_<version>.tgz
> cd kafka_<version>
11© Copyright 2014 EMC Corporation. All rights reserved.
Start Servers
Start the Zookeeper server
> bin/zookeeper-server-start.sh config/zookeeper.properties
Pre-requisite: Zookeeper should be up and running.
Now Start the Kafka Server
> bin/kafka-server-start.sh config/server.properties
12© Copyright 2014 EMC Corporation. All rights reserved.
Create a topic
> bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions 1 --
topic test
List down all topics
> bin/kafka-topics.sh --list --zookeeper
localhost:2181
Output: test
Create/List Topics
13© Copyright 2014 EMC Corporation. All rights reserved.
Producer
Send some Messages
> bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Now type on console:
This is a message
This is another message
14© Copyright 2014 EMC Corporation. All rights reserved.
Consumer
Receive some Messages
> bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning
This is a message
This is another message
15© Copyright 2014 EMC Corporation. All rights reserved.
Copy configs
> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties
Changes in the config files.
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
Multi-Broker Cluster
16© Copyright 2014 EMC Corporation. All rights reserved.
Start other Nodes with new configs
> bin/kafka-server-start.sh config/server-1.properties &
> bin/kafka-server-start.sh config/server-2.properties &
Create a new topic with replication factor as 3
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --
replication-factor 3 --partitions 1 --topic my-replicated-topic
List down the all topics
> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-
replicated-topic
Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Start with New Nodes
17© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming
Makes it easy to build scalable fault-tolerant streaming
applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive queries.
18© Copyright 2014 EMC Corporation. All rights reserved.
19© Copyright 2014 EMC Corporation. All rights reserved.
20© Copyright 2014 EMC Corporation. All rights reserved.
Spark Steaming Programming Model
Spark streaming provides a high level abstraction called
Discretized Stream or DStream
- represents a stream of data
- implemented as a sequence of RDDS
21© Copyright 2014 EMC Corporation. All rights reserved.
22© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming + Kafka
There are two approaches to receive the data from Kafka for spark streaming
• Receiver based approach
• Direct approach
23© Copyright 2014 EMC Corporation. All rights reserved.
24© Copyright 2014 EMC Corporation. All rights reserved.
#import Streaming Context and KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 1)
#create KafkaStream by passing zookeeper server address and topic SparkStreaming
kvs = KafkaUtils.createStream(ssc, "localhost:2181",
"spark-streaming-consumer", {“sparkStream":1})
#lines Dstream from KafkaStream
lines = kvs.map(lambda x: x[1])
#count Dstream from lines Dstream
counts = lines.flatMap(lambda line: line.split(" "))  .map(lambda word: (word, 1))
 .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start() ssc.awaitTermination()
25© Copyright 2014 EMC Corporation. All rights reserved.
26© Copyright 2014 EMC Corporation. All rights reserved.
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list":
brokers})
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream
.transform(storeOffsetRanges)
.foreachRDD(printOffsetRanges)
27© Copyright 2014 EMC Corporation. All rights reserved.
Thank You

More Related Content

What's hot

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
 
(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private CloudAmazon Web Services
 
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OpenvSwitch
 
LF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK
 
(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion PacketsAmazon Web Services
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCynthia Thomas
 
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronUsing PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronVikram G Hosakote
 
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...Amazon Web Services
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNICIndonesia Network Operators Group
 
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OpenvSwitch
 
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLANIndonesia Network Operators Group
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker, Inc.
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK
 
Transforming Infrastructure into Code - Importing existing cloud resources u...
Transforming Infrastructure into Code  - Importing existing cloud resources u...Transforming Infrastructure into Code  - Importing existing cloud resources u...
Transforming Infrastructure into Code - Importing existing cloud resources u...Shih Oon Liong
 
Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Cumulus Networks
 
Addressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronAddressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronVikram G Hosakote
 

What's hot (20)

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud
 
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
 
LF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDK
 
(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
 
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronUsing PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
 
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
 
BEST REST in OpenStack
BEST REST in OpenStackBEST REST in OpenStack
BEST REST in OpenStack
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
 
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
 
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent data
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloads
 
Transforming Infrastructure into Code - Importing existing cloud resources u...
Transforming Infrastructure into Code  - Importing existing cloud resources u...Transforming Infrastructure into Code  - Importing existing cloud resources u...
Transforming Infrastructure into Code - Importing existing cloud resources u...
 
Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013
 
Addressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronAddressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack Neutron
 

Similar to Real time data processing with kafla spark integration

Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Storage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyStorage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyEMC
 
Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceAntonio Romeo
 
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月VirtualTech Japan Inc.
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowCisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowLancope, Inc.
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANLdgoodell
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP serverKazuho Oku
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD{code}
 
Web scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelWeb scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelpurpleocean
 
Copr HD OpenStack Day India
Copr HD OpenStack Day IndiaCopr HD OpenStack Day India
Copr HD OpenStack Day Indiaopenstackindia
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Kendrick Coleman
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rulesFreddy Buenaño
 

Similar to Real time data processing with kafla spark integration (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Storage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyStorage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technology
 
Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with Opensource
 
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowCisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
 
Web scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelWeb scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannel
 
Copr HD OpenStack Day India
Copr HD OpenStack Day IndiaCopr HD OpenStack Day India
Copr HD OpenStack Day India
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Real time data processing with kafla spark integration

  • 1. 1© Copyright 2014 EMC Corporation. All rights reserved. Real Time Data Streaming + Speakers: Sumit Gupta, Data Intelligene Engineer, EMC Kartikeya Putturaya, Data Intelligence Engineer, EMC ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
  • 2. 2© Copyright 2014 EMC Corporation. All rights reserved. Data Engineering at EMC IT Stack Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm Messaging Systems: Rabbit MQ, Apache Kafka Relation Store: Greenplum A glimpse on what we do Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time
  • 3. 3© Copyright 2014 EMC Corporation. All rights reserved. Predictive Maintenance of Exchange Servers
  • 4. 4© Copyright 2014 EMC Corporation. All rights reserved. User Behavior Analytics for Network Threat Detection
  • 5. 5© Copyright 2014 EMC Corporation. All rights reserved. Apache Kafka
  • 6. 6© Copyright 2014 EMC Corporation. All rights reserved. Overview An apache project initially developed at LinkedIn Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs Features Persistent messaging High-throughput Supports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… http://kafka.apache.org/
  • 7. 7© Copyright 2014 EMC Corporation. All rights reserved. How it works
  • 8. 8© Copyright 2014 EMC Corporation. All rights reserved. Real time transfer Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 9. 9© Copyright 2014 EMC Corporation. All rights reserved. Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log
  • 10. 10© Copyright 2014 EMC Corporation. All rights reserved. Kafka Installation Download http://kafka.apache.org/downloads.html Untar it > tar -xzf kafka_<version>.tgz > cd kafka_<version>
  • 11. 11© Copyright 2014 EMC Corporation. All rights reserved. Start Servers Start the Zookeeper server > bin/zookeeper-server-start.sh config/zookeeper.properties Pre-requisite: Zookeeper should be up and running. Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties
  • 12. 12© Copyright 2014 EMC Corporation. All rights reserved. Create a topic > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 -- topic test List down all topics > bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test Create/List Topics
  • 13. 13© Copyright 2014 EMC Corporation. All rights reserved. Producer Send some Messages > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message
  • 14. 14© Copyright 2014 EMC Corporation. All rights reserved. Consumer Receive some Messages > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message
  • 15. 15© Copyright 2014 EMC Corporation. All rights reserved. Copy configs > cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties Changes in the config files. config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Multi-Broker Cluster
  • 16. 16© Copyright 2014 EMC Corporation. All rights reserved. Start other Nodes with new configs > bin/kafka-server-start.sh config/server-1.properties & > bin/kafka-server-start.sh config/server-2.properties & Create a new topic with replication factor as 3 > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 3 --partitions 1 --topic my-replicated-topic List down the all topics > bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my- replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0 Start with New Nodes
  • 17. 17© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. Ease of Use Fault Tolerance Combine streaming with batch and interactive queries.
  • 18. 18© Copyright 2014 EMC Corporation. All rights reserved.
  • 19. 19© Copyright 2014 EMC Corporation. All rights reserved.
  • 20. 20© Copyright 2014 EMC Corporation. All rights reserved. Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS
  • 21. 21© Copyright 2014 EMC Corporation. All rights reserved.
  • 22. 22© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming + Kafka There are two approaches to receive the data from Kafka for spark streaming • Receiver based approach • Direct approach
  • 23. 23© Copyright 2014 EMC Corporation. All rights reserved.
  • 24. 24© Copyright 2014 EMC Corporation. All rights reserved. #import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream counts = lines.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) counts.pprint() ssc.start() ssc.awaitTermination()
  • 25. 25© Copyright 2014 EMC Corporation. All rights reserved.
  • 26. 26© Copyright 2014 EMC Corporation. All rights reserved. from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) offsetRanges = [] def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset) directKafkaStream .transform(storeOffsetRanges) .foreachRDD(printOffsetRanges)
  • 27. 27© Copyright 2014 EMC Corporation. All rights reserved. Thank You