Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1© Copyright 2014 EMC Corporation. All rights reserved.
Real Time Data Streaming
+
Speakers:
Sumit Gupta, Data Intelligene...
2© Copyright 2014 EMC Corporation. All rights reserved.
Data Engineering at EMC IT
Stack
Distributed Frameworks: Apache Sp...
3© Copyright 2014 EMC Corporation. All rights reserved.
Predictive Maintenance of Exchange Servers
4© Copyright 2014 EMC Corporation. All rights reserved.
User Behavior Analytics for Network Threat Detection
5© Copyright 2014 EMC Corporation. All rights reserved.
Apache Kafka
6© Copyright 2014 EMC Corporation. All rights reserved.
Overview
An apache project initially developed at LinkedIn
Distrib...
7© Copyright 2014 EMC Corporation. All rights reserved.
How it works
8© Copyright 2014 EMC Corporation. All rights reserved.
Real time transfer
Broker does not Push messages to Consumer, Cons...
9© Copyright 2014 EMC Corporation. All rights reserved.
Kafka maintains a feed of messages in categories called topics. Fo...
10© Copyright 2014 EMC Corporation. All rights reserved.
Kafka Installation
Download
http://kafka.apache.org/downloads.htm...
11© Copyright 2014 EMC Corporation. All rights reserved.
Start Servers
Start the Zookeeper server
> bin/zookeeper-server-s...
12© Copyright 2014 EMC Corporation. All rights reserved.
Create a topic
> bin/kafka-topics.sh --create --zookeeper
localho...
13© Copyright 2014 EMC Corporation. All rights reserved.
Producer
Send some Messages
> bin/kafka-console-producer.sh --bro...
14© Copyright 2014 EMC Corporation. All rights reserved.
Consumer
Receive some Messages
> bin/kafka-console-consumer.sh --...
15© Copyright 2014 EMC Corporation. All rights reserved.
Copy configs
> cp config/server.properties config/server-1.proper...
16© Copyright 2014 EMC Corporation. All rights reserved.
Start other Nodes with new configs
> bin/kafka-server-start.sh co...
17© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming
Makes it easy to build scalable fault-tolerant st...
18© Copyright 2014 EMC Corporation. All rights reserved.
19© Copyright 2014 EMC Corporation. All rights reserved.
20© Copyright 2014 EMC Corporation. All rights reserved.
Spark Steaming Programming Model
Spark streaming provides a high ...
21© Copyright 2014 EMC Corporation. All rights reserved.
22© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming + Kafka
There are two approaches to receive the d...
23© Copyright 2014 EMC Corporation. All rights reserved.
24© Copyright 2014 EMC Corporation. All rights reserved.
#import Streaming Context and KafkaUtils
from pyspark import Spar...
25© Copyright 2014 EMC Corporation. All rights reserved.
26© Copyright 2014 EMC Corporation. All rights reserved.
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream ...
27© Copyright 2014 EMC Corporation. All rights reserved.
Thank You
Upcoming SlideShare
Loading in …5
×

Real time data processing with kafla spark integration

481 views

Published on

Spark streaming application through Kafka, spark integration, How to start Kafka server & configuring to run in clustering mode, Receiver based and direct Kafka extraction techniques

Published in: Technology

Real time data processing with kafla spark integration

  1. 1. 1© Copyright 2014 EMC Corporation. All rights reserved. Real Time Data Streaming + Speakers: Sumit Gupta, Data Intelligene Engineer, EMC Kartikeya Putturaya, Data Intelligence Engineer, EMC ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
  2. 2. 2© Copyright 2014 EMC Corporation. All rights reserved. Data Engineering at EMC IT Stack Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm Messaging Systems: Rabbit MQ, Apache Kafka Relation Store: Greenplum A glimpse on what we do Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time
  3. 3. 3© Copyright 2014 EMC Corporation. All rights reserved. Predictive Maintenance of Exchange Servers
  4. 4. 4© Copyright 2014 EMC Corporation. All rights reserved. User Behavior Analytics for Network Threat Detection
  5. 5. 5© Copyright 2014 EMC Corporation. All rights reserved. Apache Kafka
  6. 6. 6© Copyright 2014 EMC Corporation. All rights reserved. Overview An apache project initially developed at LinkedIn Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs Features Persistent messaging High-throughput Supports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… http://kafka.apache.org/
  7. 7. 7© Copyright 2014 EMC Corporation. All rights reserved. How it works
  8. 8. 8© Copyright 2014 EMC Corporation. All rights reserved. Real time transfer Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  9. 9. 9© Copyright 2014 EMC Corporation. All rights reserved. Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log
  10. 10. 10© Copyright 2014 EMC Corporation. All rights reserved. Kafka Installation Download http://kafka.apache.org/downloads.html Untar it > tar -xzf kafka_<version>.tgz > cd kafka_<version>
  11. 11. 11© Copyright 2014 EMC Corporation. All rights reserved. Start Servers Start the Zookeeper server > bin/zookeeper-server-start.sh config/zookeeper.properties Pre-requisite: Zookeeper should be up and running. Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties
  12. 12. 12© Copyright 2014 EMC Corporation. All rights reserved. Create a topic > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 -- topic test List down all topics > bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test Create/List Topics
  13. 13. 13© Copyright 2014 EMC Corporation. All rights reserved. Producer Send some Messages > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message
  14. 14. 14© Copyright 2014 EMC Corporation. All rights reserved. Consumer Receive some Messages > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message
  15. 15. 15© Copyright 2014 EMC Corporation. All rights reserved. Copy configs > cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties Changes in the config files. config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Multi-Broker Cluster
  16. 16. 16© Copyright 2014 EMC Corporation. All rights reserved. Start other Nodes with new configs > bin/kafka-server-start.sh config/server-1.properties & > bin/kafka-server-start.sh config/server-2.properties & Create a new topic with replication factor as 3 > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 3 --partitions 1 --topic my-replicated-topic List down the all topics > bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my- replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0 Start with New Nodes
  17. 17. 17© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. Ease of Use Fault Tolerance Combine streaming with batch and interactive queries.
  18. 18. 18© Copyright 2014 EMC Corporation. All rights reserved.
  19. 19. 19© Copyright 2014 EMC Corporation. All rights reserved.
  20. 20. 20© Copyright 2014 EMC Corporation. All rights reserved. Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS
  21. 21. 21© Copyright 2014 EMC Corporation. All rights reserved.
  22. 22. 22© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming + Kafka There are two approaches to receive the data from Kafka for spark streaming • Receiver based approach • Direct approach
  23. 23. 23© Copyright 2014 EMC Corporation. All rights reserved.
  24. 24. 24© Copyright 2014 EMC Corporation. All rights reserved. #import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream counts = lines.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) counts.pprint() ssc.start() ssc.awaitTermination()
  25. 25. 25© Copyright 2014 EMC Corporation. All rights reserved.
  26. 26. 26© Copyright 2014 EMC Corporation. All rights reserved. from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) offsetRanges = [] def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset) directKafkaStream .transform(storeOffsetRanges) .foreachRDD(printOffsetRanges)
  27. 27. 27© Copyright 2014 EMC Corporation. All rights reserved. Thank You

×