SlideShare a Scribd company logo
1 of 27
1© Copyright 2014 EMC Corporation. All rights reserved.
Real Time Data Streaming
+
Speakers:
Sumit Gupta, Data Intelligene Engineer, EMC
Kartikeya Putturaya, Data Intelligence Engineer, EMC
ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
2© Copyright 2014 EMC Corporation. All rights reserved.
Data Engineering at EMC IT
Stack
Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm
Messaging Systems: Rabbit MQ, Apache Kafka
Relation Store: Greenplum
A glimpse on what we do
Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange
servers in real time, with an analytics engine running on a 8 node cluster, processing
data volumes of ~100MB per 2 minutes
User Behavior Analytics for Network Threat Detection – Real time monitoring of
EMC’s internal networks and performing user behavior pattern analysis for threats,
again on a 8 node cluster, processing a stream of ~150MB of data any point of time
3© Copyright 2014 EMC Corporation. All rights reserved.
Predictive Maintenance of Exchange Servers
4© Copyright 2014 EMC Corporation. All rights reserved.
User Behavior Analytics for Network Threat Detection
5© Copyright 2014 EMC Corporation. All rights reserved.
Apache Kafka
6© Copyright 2014 EMC Corporation. All rights reserved.
Overview
An apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g. logs, metrics collections
• Written in Scala
• Does not follow JMS Standards, neither uses JMS APIs
Features
Persistent messaging
High-throughput
Supports both queue and topic semantics
Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)
and many more…
http://kafka.apache.org/
7© Copyright 2014 EMC Corporation. All rights reserved.
How it works
8© Copyright 2014 EMC Corporation. All rights reserved.
Real time transfer
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
9© Copyright 2014 EMC Corporation. All rights reserved.
Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a
partitioned log
10© Copyright 2014 EMC Corporation. All rights reserved.
Kafka Installation
Download
http://kafka.apache.org/downloads.html
Untar it
> tar -xzf kafka_<version>.tgz
> cd kafka_<version>
11© Copyright 2014 EMC Corporation. All rights reserved.
Start Servers
Start the Zookeeper server
> bin/zookeeper-server-start.sh config/zookeeper.properties
Pre-requisite: Zookeeper should be up and running.
Now Start the Kafka Server
> bin/kafka-server-start.sh config/server.properties
12© Copyright 2014 EMC Corporation. All rights reserved.
Create a topic
> bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions 1 --
topic test
List down all topics
> bin/kafka-topics.sh --list --zookeeper
localhost:2181
Output: test
Create/List Topics
13© Copyright 2014 EMC Corporation. All rights reserved.
Producer
Send some Messages
> bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Now type on console:
This is a message
This is another message
14© Copyright 2014 EMC Corporation. All rights reserved.
Consumer
Receive some Messages
> bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning
This is a message
This is another message
15© Copyright 2014 EMC Corporation. All rights reserved.
Copy configs
> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties
Changes in the config files.
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
Multi-Broker Cluster
16© Copyright 2014 EMC Corporation. All rights reserved.
Start other Nodes with new configs
> bin/kafka-server-start.sh config/server-1.properties &
> bin/kafka-server-start.sh config/server-2.properties &
Create a new topic with replication factor as 3
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --
replication-factor 3 --partitions 1 --topic my-replicated-topic
List down the all topics
> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-
replicated-topic
Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Start with New Nodes
17© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming
Makes it easy to build scalable fault-tolerant streaming
applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive queries.
18© Copyright 2014 EMC Corporation. All rights reserved.
19© Copyright 2014 EMC Corporation. All rights reserved.
20© Copyright 2014 EMC Corporation. All rights reserved.
Spark Steaming Programming Model
Spark streaming provides a high level abstraction called
Discretized Stream or DStream
- represents a stream of data
- implemented as a sequence of RDDS
21© Copyright 2014 EMC Corporation. All rights reserved.
22© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming + Kafka
There are two approaches to receive the data from Kafka for spark streaming
• Receiver based approach
• Direct approach
23© Copyright 2014 EMC Corporation. All rights reserved.
24© Copyright 2014 EMC Corporation. All rights reserved.
#import Streaming Context and KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 1)
#create KafkaStream by passing zookeeper server address and topic SparkStreaming
kvs = KafkaUtils.createStream(ssc, "localhost:2181",
"spark-streaming-consumer", {“sparkStream":1})
#lines Dstream from KafkaStream
lines = kvs.map(lambda x: x[1])
#count Dstream from lines Dstream
counts = lines.flatMap(lambda line: line.split(" "))  .map(lambda word: (word, 1))
 .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start() ssc.awaitTermination()
25© Copyright 2014 EMC Corporation. All rights reserved.
26© Copyright 2014 EMC Corporation. All rights reserved.
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list":
brokers})
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream
.transform(storeOffsetRanges)
.foreachRDD(printOffsetRanges)
27© Copyright 2014 EMC Corporation. All rights reserved.
Thank You

More Related Content

What's hot

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
 
(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private CloudAmazon Web Services
 
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OpenvSwitch
 
LF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK
 
(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion PacketsAmazon Web Services
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCynthia Thomas
 
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronUsing PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronVikram G Hosakote
 
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...Amazon Web Services
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNICIndonesia Network Operators Group
 
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OpenvSwitch
 
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLANIndonesia Network Operators Group
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker, Inc.
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK
 
Transforming Infrastructure into Code - Importing existing cloud resources u...
Transforming Infrastructure into Code  - Importing existing cloud resources u...Transforming Infrastructure into Code  - Importing existing cloud resources u...
Transforming Infrastructure into Code - Importing existing cloud resources u...Shih Oon Liong
 
Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Cumulus Networks
 
Addressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronAddressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronVikram G Hosakote
 

What's hot (20)

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud
 
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
 
LF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDK
 
(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
 
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronUsing PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
 
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
 
BEST REST in OpenStack
BEST REST in OpenStackBEST REST in OpenStack
BEST REST in OpenStack
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
 
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
 
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent data
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloads
 
Transforming Infrastructure into Code - Importing existing cloud resources u...
Transforming Infrastructure into Code  - Importing existing cloud resources u...Transforming Infrastructure into Code  - Importing existing cloud resources u...
Transforming Infrastructure into Code - Importing existing cloud resources u...
 
Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013
 
Addressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronAddressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack Neutron
 

Similar to Real time data processing with kafla spark integration

Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Storage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyStorage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyEMC
 
Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceAntonio Romeo
 
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月VirtualTech Japan Inc.
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowCisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowLancope, Inc.
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANLdgoodell
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP serverKazuho Oku
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD{code}
 
Web scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelWeb scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelpurpleocean
 
Copr HD OpenStack Day India
Copr HD OpenStack Day IndiaCopr HD OpenStack Day India
Copr HD OpenStack Day Indiaopenstackindia
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Kendrick Coleman
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rulesFreddy Buenaño
 

Similar to Real time data processing with kafla spark integration (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Storage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyStorage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technology
 
Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with Opensource
 
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowCisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
 
Web scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelWeb scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannel
 
Copr HD OpenStack Day India
Copr HD OpenStack Day IndiaCopr HD OpenStack Day India
Copr HD OpenStack Day India
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 

Recently uploaded

Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 

Recently uploaded (20)

Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 

Real time data processing with kafla spark integration

  • 1. 1© Copyright 2014 EMC Corporation. All rights reserved. Real Time Data Streaming + Speakers: Sumit Gupta, Data Intelligene Engineer, EMC Kartikeya Putturaya, Data Intelligence Engineer, EMC ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
  • 2. 2© Copyright 2014 EMC Corporation. All rights reserved. Data Engineering at EMC IT Stack Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm Messaging Systems: Rabbit MQ, Apache Kafka Relation Store: Greenplum A glimpse on what we do Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time
  • 3. 3© Copyright 2014 EMC Corporation. All rights reserved. Predictive Maintenance of Exchange Servers
  • 4. 4© Copyright 2014 EMC Corporation. All rights reserved. User Behavior Analytics for Network Threat Detection
  • 5. 5© Copyright 2014 EMC Corporation. All rights reserved. Apache Kafka
  • 6. 6© Copyright 2014 EMC Corporation. All rights reserved. Overview An apache project initially developed at LinkedIn Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs Features Persistent messaging High-throughput Supports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… http://kafka.apache.org/
  • 7. 7© Copyright 2014 EMC Corporation. All rights reserved. How it works
  • 8. 8© Copyright 2014 EMC Corporation. All rights reserved. Real time transfer Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 9. 9© Copyright 2014 EMC Corporation. All rights reserved. Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log
  • 10. 10© Copyright 2014 EMC Corporation. All rights reserved. Kafka Installation Download http://kafka.apache.org/downloads.html Untar it > tar -xzf kafka_<version>.tgz > cd kafka_<version>
  • 11. 11© Copyright 2014 EMC Corporation. All rights reserved. Start Servers Start the Zookeeper server > bin/zookeeper-server-start.sh config/zookeeper.properties Pre-requisite: Zookeeper should be up and running. Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties
  • 12. 12© Copyright 2014 EMC Corporation. All rights reserved. Create a topic > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 -- topic test List down all topics > bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test Create/List Topics
  • 13. 13© Copyright 2014 EMC Corporation. All rights reserved. Producer Send some Messages > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message
  • 14. 14© Copyright 2014 EMC Corporation. All rights reserved. Consumer Receive some Messages > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message
  • 15. 15© Copyright 2014 EMC Corporation. All rights reserved. Copy configs > cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties Changes in the config files. config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Multi-Broker Cluster
  • 16. 16© Copyright 2014 EMC Corporation. All rights reserved. Start other Nodes with new configs > bin/kafka-server-start.sh config/server-1.properties & > bin/kafka-server-start.sh config/server-2.properties & Create a new topic with replication factor as 3 > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 3 --partitions 1 --topic my-replicated-topic List down the all topics > bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my- replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0 Start with New Nodes
  • 17. 17© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. Ease of Use Fault Tolerance Combine streaming with batch and interactive queries.
  • 18. 18© Copyright 2014 EMC Corporation. All rights reserved.
  • 19. 19© Copyright 2014 EMC Corporation. All rights reserved.
  • 20. 20© Copyright 2014 EMC Corporation. All rights reserved. Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS
  • 21. 21© Copyright 2014 EMC Corporation. All rights reserved.
  • 22. 22© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming + Kafka There are two approaches to receive the data from Kafka for spark streaming • Receiver based approach • Direct approach
  • 23. 23© Copyright 2014 EMC Corporation. All rights reserved.
  • 24. 24© Copyright 2014 EMC Corporation. All rights reserved. #import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream counts = lines.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) counts.pprint() ssc.start() ssc.awaitTermination()
  • 25. 25© Copyright 2014 EMC Corporation. All rights reserved.
  • 26. 26© Copyright 2014 EMC Corporation. All rights reserved. from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) offsetRanges = [] def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset) directKafkaStream .transform(storeOffsetRanges) .foreachRDD(printOffsetRanges)
  • 27. 27© Copyright 2014 EMC Corporation. All rights reserved. Thank You