Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark with Hortonworks Data Platform - Seattle Meetup

831 views

Published on

Apache Spark with Hortonworks Data Platform - Seattle Meetup
- Setup
- YARN Queues for Spark
- Zeppelin
Concepts
RDD
DataFrame
Hands-ON

Published in: Software
  • Login to see the comments

  • Be the first to like this

Apache Spark with Hortonworks Data Platform - Seattle Meetup

  1. 1. Apache Spark with Hortonworks Data Pla5orm Saptak Sen [@saptak] Technical Product Manager © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 1
  2. 2. What we will do? • Setup • YARN Queues for Spark • Zeppelin Notebook • Spark Concepts • Hands-On Get the slides from: h/p://l.saptak.in/sparksea/le © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 2
  3. 3. What is Apache Spark? A fast and general engine for large-scale data processing. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 3
  4. 4. Setup © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 4
  5. 5. Download Hortonworks Sandbox h"p://hortonworks.com/sandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 5
  6. 6. Configure Hortonworks Sandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 6
  7. 7. Open network se,ngs for VM © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 7
  8. 8. Open port for Zeppelin © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 8
  9. 9. Login to Ambari Ambari is at port 8080. Username and password is admin/admin © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 9
  10. 10. Open YARN Queue Manager © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 10
  11. 11. Create a dedicated queue for Spark Set name to root.spark, Capacity to 50% and Max Capacity to 95% © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 11
  12. 12. Save and Refresh Queues Adjust capacity of the default queue to 50% if required before saving © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 12
  13. 13. Open Apache Spark configura1on © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 13
  14. 14. Set the 'spark' queue as the default queue for Spark © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 14
  15. 15. Restart All Affected Services © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 15
  16. 16. SSH into your Sandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 16
  17. 17. Install the Zeppelin Service for Ambari VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'` sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 17
  18. 18. Restart Ambari to enumerate the new Services service ambari restart © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 18
  19. 19. Go to Ambari to Add the Zeppelin Service © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 19
  20. 20. Select Zeppelin Notebook click Next and select default configura4ons unless you know what you are doing © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 20
  21. 21. Wait for Zeppelin deployment to complete © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 21
  22. 22. Launch Zeppelin from your browser Zeppelin is at port 9995 © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 22
  23. 23. Concepts © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 23
  24. 24. RDD • Resilient • Distributed • Dataset © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 24
  25. 25. Program Flow © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 25
  26. 26. SparkContext • Created by your driver Program • Responsible for making your RDD's resilient and Distributed • Instan<ates RDDs • The Spark shell creates a "sc" object for you © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 26
  27. 27. Crea%ng RDDs myRDD = parallelize([1,2,3,4]) myRDD = sc.textFile("hdfs:///tmp/shakepeare.txt") # file://, s3n:// hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age from users") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 27
  28. 28. Crea%ng RDDs • Can also create from: • JDBC • Cassandra • HBase • Elas6csearch • JSON, CSV, sequence files, object files, ORC, Parquet, Avro © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 28
  29. 29. Passing Func+ons to Spark Many RDD methods accept a func3on as a parameter myRdd.map(lambda x: x*x) is the same things as def sqrN(x): return x*x myRdd.map(sqrN) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 29
  30. 30. RDD Transforma,ons • map • flatmap • filter • dis.nct • sample • union, intersec.on, subtract, cartesian © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 30
  31. 31. Map example myRdd = sc. parallelize([1,2,3,4]) myRdd.map(lambda x: x+x) • Results: 1, 2, 6, 8 Transforma)ons are lazily evaluated © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 31
  32. 32. RDD Ac&ons • collect • count • countByValue • take • top • reduce © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 32
  33. 33. Reduce sum = rdd.reduce(lambda x, y: x + y) Aggregate sumCount = nums.aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1 , acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) return sumCount[0] / float(sumCount[1]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 33
  34. 34. Imports from pyspark import SparkConf, SparkContext import collections Setup the Context conf = SparkConf().setMaster("local").setAppName("foo") sc = SparkContext(conf = conf) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 34
  35. 35. First RDD lines = sc.textFile("hdfs:///tmp/shakepeare.txt") Extract the Data ratings = lines.map(lambda x: x.split()[2]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 35
  36. 36. Aggregate result = ratings.countByValue() Sort and Print the results sortedResults = collections.OrderedDict(sorted(result.items())) for key, value in sortedResults.iteritems(): print "%s %i" % (key, value) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 36
  37. 37. Run the Program spark-submit ratings-counter.py © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 37
  38. 38. Key, Value RDD #list of key, value RDDs where the value is 1. totalsByAge = rdd.map(lambda x: (x, 1)) With Key, Value RDDs, we can: - reduceByKey() - groupByKey() - sortByKey() - keys(), values() rdd.reduceByKey(lambda x, y: x + y) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 38
  39. 39. Set opera)ons on Key, Value RDDs With two lists of key, value RDDs, we can: • join • rightOuterJoin • le/OuterJoin • cogroup • subtractByKey © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 39
  40. 40. Mapping just values When mapping just values in a Key, Value RDD, it is more efficient to use: • mapValues() • flatMapValues() as it maintains the original par//oning. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 40
  41. 41. Persistence result = input.map(lambda x: x * x) result.persist().is_cached print(result.count()) print(result.collect().mkString(",")) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 41
  42. 42. Hands-On © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 42
  43. 43. Download the Data #remove existing copies of dataset from HDFS hadoop fs -rm /tmp/kddcup.data_10_percent.gz wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz -O /tmp/kddcup.data_10_percent.gz © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 43
  44. 44. Load data in HDFS hadoop fs -put /tmp/kddcup.data_10_percent.gz /tmp hadoop fs -ls -h /tmp/kddcup.data_10_percent.gz rm /tmp/kddcup.data_10_percent.gz © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 44
  45. 45. First RDD input_file = "hdfs:///tmp/kddcup.data_10_percent.gz" raw_rdd = sc.textFile(input_file) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 45
  46. 46. Retrieving version and configura1on informa1on sc.version sc.getConf.get("spark.home") System.getenv().get("PYTHONPATH") System.getenv().get("SPARK_HOME") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 46
  47. 47. Count the number of rows in the dataset print raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 47
  48. 48. Inspect what the data looks like print raw_rdd.take(5) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 48
  49. 49. Spreading data across the cluster a = range(100) data = sc.parallelize(a) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 49
  50. 50. Filtering lines in the data normal_raw_rdd = raw_rdd.filter(lambda x: 'normal.' in x) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 50
  51. 51. Count the filtered RDD normal_count = normal_raw_rdd.count() print normal_count © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 51
  52. 52. Impor&ng local libraries from pprint import pprint csv_rdd = raw_rdd.map(lambda x: x.split(",")) head_rows = csv_rdd.take(5) pprint(head_rows[0]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 52
  53. 53. Using non-lambda func1ons in transforma1on def parse_interaction(line): elems = line.split(",") tag = elems[41] return (tag, elems) key_csv_rdd = raw_rdd.map(parse_interaction) head_rows = key_csv_rdd.take(5) pprint(head_rows[0] © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 53
  54. 54. Using collect() on the Spark driver all_raw_rdd = raw_rdd.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 54
  55. 55. Extrac'ng a smaller sample of the RDD raw_rdd_sample = raw_rdd.sample(False, 0.1, 1234) print raw_rdd_sample.count() print raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 55
  56. 56. Local sample for local processing raw_data_local_sample = raw_rdd.takeSample(False, 1000, 1234) normal_data_sample = [x.split(",") for x in raw_data_local_sample if "normal." in x] normal_data_sample_size = len(normal_data_sample) normal_ratio = normal_data_sample_size/1000.0 print normal_ratio © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 56
  57. 57. Subtrac(ng on RDD from another attack_raw_rdd = raw_rdd.subtract(normal_raw_rdd) print attack_raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 57
  58. 58. Lis$ng a dis$nct list of protocols protocols = csv_rdd.map(lambda x: x[1]).distinct() print protocols.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 58
  59. 59. Lis$ng a dis$nct list of services services = csv_rdd.map(lambda x: x[2]).distinct() print services.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 59
  60. 60. Cartesian product of two RDDs product = protocols.cartesian(services).collect() print len(product) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 60
  61. 61. Filtering out the normal and a0ack events normal_csv_data = csv_rdd.filter(lambda x: x[41]=="normal.") attack_csv_data = csv_rdd.filter(lambda x: x[41]!="normal.") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 61
  62. 62. Calcula&ng total normal event dura&on and a1ack event dura&on normal_duration_data = normal_csv_data.map(lambda x: int(x[0])) total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y) print total_normal_duration attack_duration_data = attack_csv_data.map(lambda x: int(x[0])) total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y) print total_attack_duration © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 62
  63. 63. Calcula&ng average dura&on of normal and a1ack events normal_count = normal_duration_data.count() attack_count = attack_duration_data.count() print round(total_normal_duration/float(normal_count),3) print round(total_attack_duration/float(attack_count),3) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 63
  64. 64. Use aggegrate() func/on to calculate the average in a single scan normal_sum_count = normal_duration_data.aggregate( (0,0), # the initial value (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine val/acc (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) ) print round(normal_sum_count[0]/float(normal_sum_count[1]),3) attack_sum_count = attack_duration_data.aggregate( (0,0), # the initial value (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators ) print round(attack_sum_count[0]/float(attack_sum_count[1]),3) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 64
  65. 65. Crea%ng a key, value RDD key_value_rdd = csv_rdd.map(lambda x: (x[41], x)) print key_value_rdd.take(1) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 65
  66. 66. Reduce by key and aggregate key_value_duration = csv_rdd.map(lambda x: (x[41], float(x[0]))) durations_by_key = key_value_duration.reduceByKey(lambda x, y: x + y) print durations_by_key.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 66
  67. 67. Count by key counts_by_key = key_value_rdd.countByKey() print counts_by_key © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 67
  68. 68. Combine by Key to get dura1on and a2empts by type sum_counts = key_value_duration.combineByKey( (lambda x: (x, 1)), # the initial value, with value x and count 1 (lambda acc, value: (acc[0]+value, acc[1]+1)), # how to combine a pair value with the accumulator: sum value, and increment count (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])) # combine accumulators ) print sum_counts.collectAsMap() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 68
  69. 69. Sorted average dura,on by type duration_means_by_type = sum_counts .map(lambda (key,value): (key, round(value[0]/value[1],3))).collectAsMap() # Print them sorted for tag in sorted(duration_means_by_type, key=duration_means_by_type.get, reverse=True): print tag, duration_means_by_type[tag] © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 69
  70. 70. Crea%ng a Schema and crea%ng a DataFrame row_data = csv_rdd.map(lambda p: Row( duration=int(p[0]), protocol_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]) ) ) interactions_df = sqlContext.createDataFrame(row_data) interactions_df.registerTempTable("interactions") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 70
  71. 71. Quering with SQL SELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0 © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 71
  72. 72. Prin%ng the Schema interactions_df.printSchema() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 72
  73. 73. Coun%ng events by protocol interactions_df.select("protocol_type", "duration", "dst_bytes") .groupBy("protocol_type").count().show() interactions_df.select("protocol_type", "duration", "dst_bytes").filter(interactions_df.duration>1000) .filter(interactions_df.dst_bytes==0) .groupBy("protocol_type").count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 73
  74. 74. Modifying the labels def get_label_type(label): if label!="normal.": return "attack" else: return "normal" row_labeled_data = csv_rdd.map(lambda p: Row( duration=int(p[0]), protocol_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]), label=get_label_type(p[41]) ) ) interactions_labeled_df = sqlContext.createDataFrame(row_labeled_data) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 74
  75. 75. Coun%ng by labels interactions_labeled_df.select("label", "protocol_type") .groupBy("label", "protocol_type").count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 75
  76. 76. Group by label and protocol interactions_labeled_df.select("label", "protocol_type") .groupBy("label", "protocol_type").count().show() interactions_labeled_df.select("label", "protocol_type", "dst_bytes") .groupBy("label", "protocol_type", interactions_labeled_df.dst_bytes==0).count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 76
  77. 77. Ques%ons © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 77
  78. 78. More resources • More tutorials: h/p://hortonworks.com/tutorials • Hortonworks gallery: h/p://hortonworks-gallery.github.io My co-ordinates • Twi%er: @saptak • Website: h%p://saptak.in © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 78
  79. 79. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 79

×