Apache Spark with Hortonworks
Data Pla5orm
Saptak Sen [@saptak]
Technical Product Manager
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 1
What we will do?
• Setup
• YARN Queues for Spark
• Zeppelin Notebook
• Spark Concepts
• Hands-On
Get the slides from: h/p://l.saptak.in/sparksea/le
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 2
What is Apache Spark?
A fast and general engine for large-scale data processing.
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 3
Setup
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 4
Download Hortonworks Sandbox
h"p://hortonworks.com/sandbox
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 5
Configure Hortonworks Sandbox
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 6
Open network se,ngs for VM
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 7
Open port for Zeppelin
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 8
Login to Ambari
Ambari is at port 8080. Username and password is admin/admin
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 9
Open YARN Queue Manager
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 10
Create a dedicated queue for Spark
Set name to root.spark, Capacity to 50% and Max Capacity to 95%
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 11
Save and Refresh Queues
Adjust capacity of the default queue to 50% if required before saving
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 12
Open Apache Spark configura1on
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 13
Set the 'spark' queue as the default queue for Spark
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 14
Restart All Affected Services
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 15
SSH into your Sandbox
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 16
Install the Zeppelin Service for Ambari
VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 17
Restart Ambari to enumerate the new Services
service ambari restart
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 18
Go to Ambari to Add the Zeppelin Service
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 19
Select Zeppelin Notebook
click Next and select default configura4ons unless you know what you are doing
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 20
Wait for Zeppelin deployment to complete
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 21
Launch Zeppelin from your browser
Zeppelin is at port 9995
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 22
Concepts
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 23
RDD
• Resilient
• Distributed
• Dataset
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 24
Program Flow
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 25
SparkContext
• Created by your driver Program
• Responsible for making your RDD's resilient and Distributed
• Instan<ates RDDs
• The Spark shell creates a "sc" object for you
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 26
Crea%ng RDDs
myRDD = parallelize([1,2,3,4])
myRDD = sc.textFile("hdfs:///tmp/shakepeare.txt")
# file://, s3n://
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("SELECT name, age from users")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 27
Crea%ng RDDs
• Can also create from:
• JDBC
• Cassandra
• HBase
• Elas6csearch
• JSON, CSV, sequence files, object files, ORC, Parquet, Avro
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 28
Passing Func+ons to Spark
Many RDD methods accept a func3on as a parameter
myRdd.map(lambda x: x*x)
is the same things as
def sqrN(x):
return x*x
myRdd.map(sqrN)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 29
RDD Transforma,ons
• map
• flatmap
• filter
• dis.nct
• sample
• union, intersec.on, subtract, cartesian
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 30
Map example
myRdd = sc. parallelize([1,2,3,4])
myRdd.map(lambda x: x+x)
• Results: 1, 2, 6, 8
Transforma)ons are lazily evaluated
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 31
RDD Ac&ons
• collect
• count
• countByValue
• take
• top
• reduce
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 32
Reduce
sum = rdd.reduce(lambda x, y: x + y)
Aggregate
sumCount = nums.aggregate(
(0, 0),
(lambda acc, value: (acc[0] + value, acc[1] + 1)),
(lambda acc1 , acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))
return sumCount[0] / float(sumCount[1])
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 33
Imports
from pyspark import SparkConf, SparkContext
import collections
Setup the Context
conf = SparkConf().setMaster("local").setAppName("foo")
sc = SparkContext(conf = conf)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 34
First RDD
lines = sc.textFile("hdfs:///tmp/shakepeare.txt")
Extract the Data
ratings = lines.map(lambda x: x.split()[2])
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 35
Aggregate
result = ratings.countByValue()
Sort and Print the results
sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.iteritems():
print "%s %i" % (key, value)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 36
Run the Program
spark-submit ratings-counter.py
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 37
Key, Value RDD
#list of key, value RDDs where the value is 1.
totalsByAge = rdd.map(lambda x: (x, 1))
With Key, Value RDDs, we can:
- reduceByKey()
- groupByKey()
- sortByKey()
- keys(), values()
rdd.reduceByKey(lambda x, y: x + y)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 38
Set opera)ons on Key, Value RDDs
With two lists of key, value RDDs, we can:
• join
• rightOuterJoin
• le/OuterJoin
• cogroup
• subtractByKey
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 39
Mapping just values
When mapping just values in a Key, Value RDD, it is more efficient
to use:
• mapValues()
• flatMapValues()
as it maintains the original par//oning.
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 40
Persistence
result = input.map(lambda x: x * x)
result.persist().is_cached
print(result.count())
print(result.collect().mkString(","))
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 41
Hands-On
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 42
Download the Data
#remove existing copies of dataset from HDFS
hadoop fs -rm /tmp/kddcup.data_10_percent.gz
wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz 
-O /tmp/kddcup.data_10_percent.gz
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 43
Load data in HDFS
hadoop fs -put /tmp/kddcup.data_10_percent.gz /tmp
hadoop fs -ls -h /tmp/kddcup.data_10_percent.gz
rm /tmp/kddcup.data_10_percent.gz
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 44
First RDD
input_file = "hdfs:///tmp/kddcup.data_10_percent.gz"
raw_rdd = sc.textFile(input_file)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 45
Retrieving version and configura1on informa1on
sc.version
sc.getConf.get("spark.home")
System.getenv().get("PYTHONPATH")
System.getenv().get("SPARK_HOME")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 46
Count the number of rows in the dataset
print raw_rdd.count()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 47
Inspect what the data looks like
print raw_rdd.take(5)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 48
Spreading data across the cluster
a = range(100)
data = sc.parallelize(a)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 49
Filtering lines in the data
normal_raw_rdd = raw_rdd.filter(lambda x: 'normal.' in x)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 50
Count the filtered RDD
normal_count = normal_raw_rdd.count()
print normal_count
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 51
Impor&ng local libraries
from pprint import pprint
csv_rdd = raw_rdd.map(lambda x: x.split(","))
head_rows = csv_rdd.take(5)
pprint(head_rows[0])
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 52
Using non-lambda func1ons in transforma1on
def parse_interaction(line):
elems = line.split(",")
tag = elems[41]
return (tag, elems)
key_csv_rdd = raw_rdd.map(parse_interaction)
head_rows = key_csv_rdd.take(5)
pprint(head_rows[0]
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 53
Using collect() on the Spark driver
all_raw_rdd = raw_rdd.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 54
Extrac'ng a smaller sample of the RDD
raw_rdd_sample = raw_rdd.sample(False, 0.1, 1234)
print raw_rdd_sample.count()
print raw_rdd.count()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 55
Local sample for local processing
raw_data_local_sample = raw_rdd.takeSample(False, 1000, 1234)
normal_data_sample = [x.split(",") for x in raw_data_local_sample if "normal." in x]
normal_data_sample_size = len(normal_data_sample)
normal_ratio = normal_data_sample_size/1000.0
print normal_ratio
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 56
Subtrac(ng on RDD from another
attack_raw_rdd = raw_rdd.subtract(normal_raw_rdd)
print attack_raw_rdd.count()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 57
Lis$ng a dis$nct list of protocols
protocols = csv_rdd.map(lambda x: x[1]).distinct()
print protocols.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 58
Lis$ng a dis$nct list of services
services = csv_rdd.map(lambda x: x[2]).distinct()
print services.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 59
Cartesian product of two RDDs
product = protocols.cartesian(services).collect()
print len(product)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 60
Filtering out the normal and a0ack events
normal_csv_data = csv_rdd.filter(lambda x: x[41]=="normal.")
attack_csv_data = csv_rdd.filter(lambda x: x[41]!="normal.")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 61
Calcula&ng total normal event dura&on and a1ack event
dura&on
normal_duration_data = normal_csv_data.map(lambda x: int(x[0]))
total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y)
print total_normal_duration
attack_duration_data = attack_csv_data.map(lambda x: int(x[0]))
total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y)
print total_attack_duration
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 62
Calcula&ng average dura&on of normal and a1ack events
normal_count = normal_duration_data.count()
attack_count = attack_duration_data.count()
print round(total_normal_duration/float(normal_count),3)
print round(total_attack_duration/float(attack_count),3)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 63
Use aggegrate() func/on to calculate the average in a
single scan
normal_sum_count = normal_duration_data.aggregate(
(0,0), # the initial value
(lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine val/acc
(lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
)
print round(normal_sum_count[0]/float(normal_sum_count[1]),3)
attack_sum_count = attack_duration_data.aggregate(
(0,0), # the initial value
(lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc
(lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators
)
print round(attack_sum_count[0]/float(attack_sum_count[1]),3)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 64
Crea%ng a key, value RDD
key_value_rdd = csv_rdd.map(lambda x: (x[41], x))
print key_value_rdd.take(1)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 65
Reduce by key and aggregate
key_value_duration = csv_rdd.map(lambda x: (x[41], float(x[0])))
durations_by_key = key_value_duration.reduceByKey(lambda x, y: x + y)
print durations_by_key.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 66
Count by key
counts_by_key = key_value_rdd.countByKey()
print counts_by_key
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 67
Combine by Key to get dura1on and a2empts by type
sum_counts = key_value_duration.combineByKey(
(lambda x: (x, 1)),
# the initial value, with value x and count 1
(lambda acc, value: (acc[0]+value, acc[1]+1)),
# how to combine a pair value with the accumulator: sum value, and increment count
(lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1]))
# combine accumulators
)
print sum_counts.collectAsMap()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 68
Sorted average dura,on by type
duration_means_by_type = sum_counts
.map(lambda (key,value): (key,
round(value[0]/value[1],3))).collectAsMap()
# Print them sorted
for tag in sorted(duration_means_by_type, key=duration_means_by_type.get, reverse=True):
print tag, duration_means_by_type[tag]
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 69
Crea%ng a Schema and crea%ng a DataFrame
row_data = csv_rdd.map(lambda p: Row(
duration=int(p[0]),
protocol_type=p[1],
service=p[2],
flag=p[3],
src_bytes=int(p[4]),
dst_bytes=int(p[5])
)
)
interactions_df = sqlContext.createDataFrame(row_data)
interactions_df.registerTempTable("interactions")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 70
Quering with SQL
SELECT duration, dst_bytes FROM interactions
WHERE protocol_type = 'tcp'
AND duration > 1000
AND dst_bytes = 0
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 71
Prin%ng the Schema
interactions_df.printSchema()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 72
Coun%ng events by protocol
interactions_df.select("protocol_type", "duration", "dst_bytes")
.groupBy("protocol_type").count().show()
interactions_df.select("protocol_type", "duration",
"dst_bytes").filter(interactions_df.duration>1000)
.filter(interactions_df.dst_bytes==0)
.groupBy("protocol_type").count().show()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 73
Modifying the labels
def get_label_type(label):
if label!="normal.":
return "attack"
else:
return "normal"
row_labeled_data = csv_rdd.map(lambda p: Row(
duration=int(p[0]),
protocol_type=p[1],
service=p[2],
flag=p[3],
src_bytes=int(p[4]),
dst_bytes=int(p[5]),
label=get_label_type(p[41])
)
)
interactions_labeled_df = sqlContext.createDataFrame(row_labeled_data)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 74
Coun%ng by labels
interactions_labeled_df.select("label", "protocol_type")
.groupBy("label", "protocol_type").count().show()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 75
Group by label and protocol
interactions_labeled_df.select("label", "protocol_type")
.groupBy("label", "protocol_type").count().show()
interactions_labeled_df.select("label", "protocol_type", "dst_bytes")
.groupBy("label", "protocol_type",
interactions_labeled_df.dst_bytes==0).count().show()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 76
Ques%ons
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 77
More resources
• More tutorials: h/p://hortonworks.com/tutorials
• Hortonworks gallery: h/p://hortonworks-gallery.github.io
My co-ordinates
• Twi%er: @saptak
• Website: h%p://saptak.in
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 78
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 79

Apache Spark with Hortonworks Data Platform - Seattle Meetup

  • 1.
    Apache Spark withHortonworks Data Pla5orm Saptak Sen [@saptak] Technical Product Manager © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 1
  • 2.
    What we willdo? • Setup • YARN Queues for Spark • Zeppelin Notebook • Spark Concepts • Hands-On Get the slides from: h/p://l.saptak.in/sparksea/le © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 2
  • 3.
    What is ApacheSpark? A fast and general engine for large-scale data processing. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 3
  • 4.
    Setup © Crea've CommonsA.ribu'on-ShareAlike 3.0 Unported License 4
  • 5.
    Download Hortonworks Sandbox h"p://hortonworks.com/sandbox ©Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 5
  • 6.
    Configure Hortonworks Sandbox ©Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 6
  • 7.
    Open network se,ngsfor VM © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 7
  • 8.
    Open port forZeppelin © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 8
  • 9.
    Login to Ambari Ambariis at port 8080. Username and password is admin/admin © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 9
  • 10.
    Open YARN QueueManager © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 10
  • 11.
    Create a dedicatedqueue for Spark Set name to root.spark, Capacity to 50% and Max Capacity to 95% © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 11
  • 12.
    Save and RefreshQueues Adjust capacity of the default queue to 50% if required before saving © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 12
  • 13.
    Open Apache Sparkconfigura1on © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 13
  • 14.
    Set the 'spark'queue as the default queue for Spark © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 14
  • 15.
    Restart All AffectedServices © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 15
  • 16.
    SSH into yourSandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 16
  • 17.
    Install the ZeppelinService for Ambari VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'` sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 17
  • 18.
    Restart Ambari toenumerate the new Services service ambari restart © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 18
  • 19.
    Go to Ambarito Add the Zeppelin Service © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 19
  • 20.
    Select Zeppelin Notebook clickNext and select default configura4ons unless you know what you are doing © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 20
  • 21.
    Wait for Zeppelindeployment to complete © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 21
  • 22.
    Launch Zeppelin fromyour browser Zeppelin is at port 9995 © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 22
  • 23.
    Concepts © Crea've CommonsA.ribu'on-ShareAlike 3.0 Unported License 23
  • 24.
    RDD • Resilient • Distributed •Dataset © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 24
  • 25.
    Program Flow © Crea'veCommons A.ribu'on-ShareAlike 3.0 Unported License 25
  • 26.
    SparkContext • Created byyour driver Program • Responsible for making your RDD's resilient and Distributed • Instan<ates RDDs • The Spark shell creates a "sc" object for you © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 26
  • 27.
    Crea%ng RDDs myRDD =parallelize([1,2,3,4]) myRDD = sc.textFile("hdfs:///tmp/shakepeare.txt") # file://, s3n:// hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age from users") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 27
  • 28.
    Crea%ng RDDs • Canalso create from: • JDBC • Cassandra • HBase • Elas6csearch • JSON, CSV, sequence files, object files, ORC, Parquet, Avro © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 28
  • 29.
    Passing Func+ons toSpark Many RDD methods accept a func3on as a parameter myRdd.map(lambda x: x*x) is the same things as def sqrN(x): return x*x myRdd.map(sqrN) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 29
  • 30.
    RDD Transforma,ons • map •flatmap • filter • dis.nct • sample • union, intersec.on, subtract, cartesian © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 30
  • 31.
    Map example myRdd =sc. parallelize([1,2,3,4]) myRdd.map(lambda x: x+x) • Results: 1, 2, 6, 8 Transforma)ons are lazily evaluated © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 31
  • 32.
    RDD Ac&ons • collect •count • countByValue • take • top • reduce © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 32
  • 33.
    Reduce sum = rdd.reduce(lambdax, y: x + y) Aggregate sumCount = nums.aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1 , acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) return sumCount[0] / float(sumCount[1]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 33
  • 34.
    Imports from pyspark importSparkConf, SparkContext import collections Setup the Context conf = SparkConf().setMaster("local").setAppName("foo") sc = SparkContext(conf = conf) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 34
  • 35.
    First RDD lines =sc.textFile("hdfs:///tmp/shakepeare.txt") Extract the Data ratings = lines.map(lambda x: x.split()[2]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 35
  • 36.
    Aggregate result = ratings.countByValue() Sortand Print the results sortedResults = collections.OrderedDict(sorted(result.items())) for key, value in sortedResults.iteritems(): print "%s %i" % (key, value) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 36
  • 37.
    Run the Program spark-submitratings-counter.py © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 37
  • 38.
    Key, Value RDD #listof key, value RDDs where the value is 1. totalsByAge = rdd.map(lambda x: (x, 1)) With Key, Value RDDs, we can: - reduceByKey() - groupByKey() - sortByKey() - keys(), values() rdd.reduceByKey(lambda x, y: x + y) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 38
  • 39.
    Set opera)ons onKey, Value RDDs With two lists of key, value RDDs, we can: • join • rightOuterJoin • le/OuterJoin • cogroup • subtractByKey © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 39
  • 40.
    Mapping just values Whenmapping just values in a Key, Value RDD, it is more efficient to use: • mapValues() • flatMapValues() as it maintains the original par//oning. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 40
  • 41.
    Persistence result = input.map(lambdax: x * x) result.persist().is_cached print(result.count()) print(result.collect().mkString(",")) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 41
  • 42.
    Hands-On © Crea've CommonsA.ribu'on-ShareAlike 3.0 Unported License 42
  • 43.
    Download the Data #removeexisting copies of dataset from HDFS hadoop fs -rm /tmp/kddcup.data_10_percent.gz wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz -O /tmp/kddcup.data_10_percent.gz © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 43
  • 44.
    Load data inHDFS hadoop fs -put /tmp/kddcup.data_10_percent.gz /tmp hadoop fs -ls -h /tmp/kddcup.data_10_percent.gz rm /tmp/kddcup.data_10_percent.gz © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 44
  • 45.
    First RDD input_file ="hdfs:///tmp/kddcup.data_10_percent.gz" raw_rdd = sc.textFile(input_file) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 45
  • 46.
    Retrieving version andconfigura1on informa1on sc.version sc.getConf.get("spark.home") System.getenv().get("PYTHONPATH") System.getenv().get("SPARK_HOME") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 46
  • 47.
    Count the numberof rows in the dataset print raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 47
  • 48.
    Inspect what thedata looks like print raw_rdd.take(5) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 48
  • 49.
    Spreading data acrossthe cluster a = range(100) data = sc.parallelize(a) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 49
  • 50.
    Filtering lines inthe data normal_raw_rdd = raw_rdd.filter(lambda x: 'normal.' in x) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 50
  • 51.
    Count the filteredRDD normal_count = normal_raw_rdd.count() print normal_count © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 51
  • 52.
    Impor&ng local libraries frompprint import pprint csv_rdd = raw_rdd.map(lambda x: x.split(",")) head_rows = csv_rdd.take(5) pprint(head_rows[0]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 52
  • 53.
    Using non-lambda func1onsin transforma1on def parse_interaction(line): elems = line.split(",") tag = elems[41] return (tag, elems) key_csv_rdd = raw_rdd.map(parse_interaction) head_rows = key_csv_rdd.take(5) pprint(head_rows[0] © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 53
  • 54.
    Using collect() onthe Spark driver all_raw_rdd = raw_rdd.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 54
  • 55.
    Extrac'ng a smallersample of the RDD raw_rdd_sample = raw_rdd.sample(False, 0.1, 1234) print raw_rdd_sample.count() print raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 55
  • 56.
    Local sample forlocal processing raw_data_local_sample = raw_rdd.takeSample(False, 1000, 1234) normal_data_sample = [x.split(",") for x in raw_data_local_sample if "normal." in x] normal_data_sample_size = len(normal_data_sample) normal_ratio = normal_data_sample_size/1000.0 print normal_ratio © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 56
  • 57.
    Subtrac(ng on RDDfrom another attack_raw_rdd = raw_rdd.subtract(normal_raw_rdd) print attack_raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 57
  • 58.
    Lis$ng a dis$nctlist of protocols protocols = csv_rdd.map(lambda x: x[1]).distinct() print protocols.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 58
  • 59.
    Lis$ng a dis$nctlist of services services = csv_rdd.map(lambda x: x[2]).distinct() print services.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 59
  • 60.
    Cartesian product oftwo RDDs product = protocols.cartesian(services).collect() print len(product) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 60
  • 61.
    Filtering out thenormal and a0ack events normal_csv_data = csv_rdd.filter(lambda x: x[41]=="normal.") attack_csv_data = csv_rdd.filter(lambda x: x[41]!="normal.") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 61
  • 62.
    Calcula&ng total normalevent dura&on and a1ack event dura&on normal_duration_data = normal_csv_data.map(lambda x: int(x[0])) total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y) print total_normal_duration attack_duration_data = attack_csv_data.map(lambda x: int(x[0])) total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y) print total_attack_duration © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 62
  • 63.
    Calcula&ng average dura&onof normal and a1ack events normal_count = normal_duration_data.count() attack_count = attack_duration_data.count() print round(total_normal_duration/float(normal_count),3) print round(total_attack_duration/float(attack_count),3) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 63
  • 64.
    Use aggegrate() func/onto calculate the average in a single scan normal_sum_count = normal_duration_data.aggregate( (0,0), # the initial value (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine val/acc (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) ) print round(normal_sum_count[0]/float(normal_sum_count[1]),3) attack_sum_count = attack_duration_data.aggregate( (0,0), # the initial value (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators ) print round(attack_sum_count[0]/float(attack_sum_count[1]),3) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 64
  • 65.
    Crea%ng a key,value RDD key_value_rdd = csv_rdd.map(lambda x: (x[41], x)) print key_value_rdd.take(1) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 65
  • 66.
    Reduce by keyand aggregate key_value_duration = csv_rdd.map(lambda x: (x[41], float(x[0]))) durations_by_key = key_value_duration.reduceByKey(lambda x, y: x + y) print durations_by_key.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 66
  • 67.
    Count by key counts_by_key= key_value_rdd.countByKey() print counts_by_key © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 67
  • 68.
    Combine by Keyto get dura1on and a2empts by type sum_counts = key_value_duration.combineByKey( (lambda x: (x, 1)), # the initial value, with value x and count 1 (lambda acc, value: (acc[0]+value, acc[1]+1)), # how to combine a pair value with the accumulator: sum value, and increment count (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])) # combine accumulators ) print sum_counts.collectAsMap() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 68
  • 69.
    Sorted average dura,onby type duration_means_by_type = sum_counts .map(lambda (key,value): (key, round(value[0]/value[1],3))).collectAsMap() # Print them sorted for tag in sorted(duration_means_by_type, key=duration_means_by_type.get, reverse=True): print tag, duration_means_by_type[tag] © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 69
  • 70.
    Crea%ng a Schemaand crea%ng a DataFrame row_data = csv_rdd.map(lambda p: Row( duration=int(p[0]), protocol_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]) ) ) interactions_df = sqlContext.createDataFrame(row_data) interactions_df.registerTempTable("interactions") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 70
  • 71.
    Quering with SQL SELECTduration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0 © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 71
  • 72.
    Prin%ng the Schema interactions_df.printSchema() ©Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 72
  • 73.
    Coun%ng events byprotocol interactions_df.select("protocol_type", "duration", "dst_bytes") .groupBy("protocol_type").count().show() interactions_df.select("protocol_type", "duration", "dst_bytes").filter(interactions_df.duration>1000) .filter(interactions_df.dst_bytes==0) .groupBy("protocol_type").count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 73
  • 74.
    Modifying the labels defget_label_type(label): if label!="normal.": return "attack" else: return "normal" row_labeled_data = csv_rdd.map(lambda p: Row( duration=int(p[0]), protocol_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]), label=get_label_type(p[41]) ) ) interactions_labeled_df = sqlContext.createDataFrame(row_labeled_data) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 74
  • 75.
    Coun%ng by labels interactions_labeled_df.select("label","protocol_type") .groupBy("label", "protocol_type").count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 75
  • 76.
    Group by labeland protocol interactions_labeled_df.select("label", "protocol_type") .groupBy("label", "protocol_type").count().show() interactions_labeled_df.select("label", "protocol_type", "dst_bytes") .groupBy("label", "protocol_type", interactions_labeled_df.dst_bytes==0).count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 76
  • 77.
    Ques%ons © Crea've CommonsA.ribu'on-ShareAlike 3.0 Unported License 77
  • 78.
    More resources • Moretutorials: h/p://hortonworks.com/tutorials • Hortonworks gallery: h/p://hortonworks-gallery.github.io My co-ordinates • Twi%er: @saptak • Website: h%p://saptak.in © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 78
  • 79.
    © Crea've CommonsA.ribu'on-ShareAlike 3.0 Unported License 79