Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Spark / PUT 06.2014

1,538 views

Published on

Published in: Software, Technology
  • Be the first to comment

Introduction to Apache Spark / PUT 06.2014

  1. 1. APACHE-SPARK LARGE-SCALE DATA PROCESSING ENGINE Bartosz Bogacki <bbogacki@bidlab.pl>
  2. 2. CTO, CODER, ROCK CLIMBER • current: • Chief Technology Officer at Bidlab • previous: • IT Director at InternetowyKantor.pl SA • Software Architect / Project Manager at Wolters Kluwer Polska • find out more (if you care): • linkedin.com/in/bartoszbogacki
  3. 3. WE PROCESS MORETHAN 200GB OF LOGS DAILY Did I mention that… ?
  4. 4. WHY? • To discover inventory and potential • To optimize traffic • To optimize campaigns • To learn about trends • To calculate conversions
  5. 5. APACHE SPARK !
  6. 6. HISTORY • 2013-06-19 Project enters Apache incubation • 2014-02-19 Project established as an ApacheTop Level Project. • 2014-05-30 Spark 1.0.0 released
  7. 7. • "Apache Spark is a (lightning-) fast and general-purpose cluster computing system" • Engine compatible with Apache Hadoop • Up to 100x faster than Hadoop • Less code to write, more elastic • Active community (117 developers contributed to release 1.0.0)
  8. 8. KEY CONCEPTS • Spark /YARN / Mesos resources compatible • HDFS / S3 support built-in • RDD - Resilient Distribiuted Dataset • Transformations & Actions • Written in Scala,API for Java / Scala / Python
  9. 9. ECOSYSTEM • Spark Streaming • Shark • MLlib (machine learning) • GraphX • Spark SQL
  10. 10. RDD • Collections of objects • Stored in memory (or disk) • Spread across the cluster • Auto-rebuild on failure
  11. 11. TRANSFORMATIONS • map / flatMap • filter • union / intersection / join / cogroup • distinct • many more…
  12. 12. ACTIONS • reduce / reduceByKey • foreach • count / countByKey • first / take / takeOrdered • collect / saveAsTextFile / saveAsObjectFile
  13. 13. EXAMPLES val s1=sc.parallelize(Array(1,2,3,4,5)) val s2=sc.parallelize(Array(3,4,6,7,8)) val s3=sc.parallelize(Array(1,2,2,3,3,3)) ! s2.map(num => num * num) // => 9, 16, 36, 49, 64 s1.reduce((a,b) => a + b) // => 15 s1 union s2 // => 1, 2, 3, 4, 5, 3, 4, 6, 7, 8 s1 subtract s2 // => 1, 5, 2 s1 intersection s2 // => 4, 3 s3.distinct // => 1, 2, 3
  14. 14. EXAMPLES val set1 = sc.parallelize(Array[(Integer,String)] ((1,”bartek"),(2,"jacek"),(3,"tomek"))) val set2 = sc.parallelize(Array[(Integer,String)] ((2,”nowak”),(4,"kowalski"),(5,"iksiński"))) ! set1 join set2 // =>(2,(jacek,nowak)) set1 leftOuterJoin set2 // =>(1,(bartek,None)), (2,(jacek,Some(nowak))), (3, (tomek,None)) set1 rightOuterJoin set2 // =>(4,(None,kowalski)), (5,(None,iksiński)), (2, (Some(jacek),nowak))
  15. 15. EXAMPLES set1.cogroup(set2).sortByKey() // => (1,(ArrayBuffer(bartek),ArrayBuffer())), (2, (ArrayBuffer(jacek),ArrayBuffer(nowak))), (3, (ArrayBuffer(tomek),ArrayBuffer())), (4, (ArrayBuffer(),ArrayBuffer(kowalski))), (5, (ArrayBuffer(),ArrayBuffer(iksiński))) ! set2.map((t) => (t._1, t._2.length)) // => (2,5), (4,8), (5,8) ! val set3 = sc.parallelize(Array[(String,Long)] (("onet.pl",1), ("onet.pl",1), ("wp.pl",1)) ! set3.reduceByKey((n1,n2) => n1 + n2) // => (onet.pl,2), (wp.pl,1)
  16. 16. HANDS ON
  17. 17. RUNNING EC2 SPARK CLUSTER ./spark-ec2 -k spark-key -i spark-key.pem -s 5 -t m3.2xlarge launch cluster-name --region=eu-west-1
  18. 18. SPARK CONSOLE
  19. 19. LINKING WITH SPARK <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.0.0</version> </dependency> If you want to use HDFS ! groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version> If you want to use Spark Streaming ! groupId = org.apache.spark artifactId = spark-streaming_2.10 version = 1.0.0
  20. 20. INITIALIZING • SparkConf conf = new SparkConf() .setAppName("TEST") .setMaster("local"); • JavaSparkContext sc = new JavaSparkContext(conf);
  21. 21. CREATING RDD • List<Integer> data = Arrays.asList(1, 2, 3, 4, 5); • JavaRDD<Integer> distData = sc.parallelize(data);
  22. 22. CREATING RDD • JavaRDD<String> logLines = sc.textFile("data.txt");
  23. 23. CREATING RDD • JavaRDD<String> logLines = sc.textFile(”hdfs:// <HOST>:<PORT>/daily/data-20-00.txt”); • JavaRDD<String> logLines = sc.textFile(”s3n://my- bucket/daily/data-*.txt”);
  24. 24. TRANSFORM JavaRDD<Log> logs = logLines.map(new Function<String, Log>() { public Log call(String s) { return LogParser.parse(s); } }).filter(new Function<Log, Boolean>(){ public Integer call(Log log) { return log.getLevel() == 1; } });
  25. 25. ACTION :) logs.count();
  26. 26. TRANSFORM-ACTION List<Tuple2<String,Integer>> result = sc.textFile(”/data/notifies-20-00.txt”) .mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String line) throws Exception { NotifyRequest nr = LogParser.parseNotifyRequest(line); return new Tuple2<String, Integer>(nr.getFlightId(), 1); } }) .reduceByKey(new Function2<Integer, Integer, Integer>(){ @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; }}) .sortByKey() .collect();
  27. 27. FUNCTIONS, PAIRFUNCTIONS, ETC.
  28. 28. BROADCASTVARIABLES • "allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks" Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3}); ! broadcastVar.value(); // returns [1, 2, 3]
  29. 29. ACCUMULATORS • variables that are only “added” to through an associative operation (add()) • only the driver program can read the accumulator’s value Accumulator<Integer> accum = sc.accumulator(0); ! sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x -> accum.add(x)); ! accum.value(); // returns 10
  30. 30. SERIALIZATION • All objects used in your code have to be serializable • Otherwise: org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException
  31. 31. USE KRYO SERIALIZER public class MyRegistrator implements KryoRegistrator { @Override public void registerClasses(Kryo kryo) { kryo.register(BidRequest.class); kryo.register(NotifyRequest.class); kryo.register(Event.class); } } sparkConfig.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer"); sparkConfig.set( "spark.kryo.registrator", "pl.instream.dsp.offline.MyRegistrator"); sparkConfig.set( "spark.kryoserializer.buffer.mb", "10");
  32. 32. CACHE ! JavaPairRDD<String, Integer> cachedSet = sc.textFile(”/data/notifies-20-00.txt”) .mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String line) throws Exception { NotifyRequest nr = LogParser.parseNotifyRequest(line); return new Tuple2<String, Integer>(nr.getFlightId(), 1); } }).cache();
  33. 33. RDD PERSISTANCE • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2, … • OFF_HEAP (Tachyon, ecperimental)
  34. 34. PARTITIONS • RDD is partitioned • You may (and probably should) control number and size of partitions with coalesce() method • By default 1 input file = 1 partition
  35. 35. PARTITIONS • If your partitions are too big, you’ll face: [GC 5208198K(5208832K), 0,2403780 secs] [Full GC 5208831K->5208212K(5208832K), 9,8765730 secs] [Full GC 5208829K->5208238K(5208832K), 9,7567820 secs] [Full GC 5208829K->5208295K(5208832K), 9,7629460 secs] [GC 5208301K(5208832K), 0,2403480 secs] [Full GC 5208831K->5208344K(5208832K), 9,7497710 secs] [Full GC 5208829K->5208366K(5208832K), 9,7542880 secs] [Full GC 5208831K->5208415K(5208832K), 9,7574860 secs] WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-xx-xx-xxx-xxx.eu- west-1.compute.internal, 60048, 0) with no recent heart beats: 64828ms exceeds 45000ms
  36. 36. RESULTS • result.saveAsTextFile(„hdfs://<HOST>:<PORT>/ out.txt") • result.saveAsObjectFile(„/result/out.obj”) • collect()
  37. 37. PROCESS RESULTS PARTITION BY PARTITION for (Partition partition : result.rdd().partitions()) { List<String> subresult[] = result.collectPartitions(new int[] { partition.index() }); for (String line : subresult[0]) { System.out.println(line); } }
  38. 38. SPARK STREAMING
  39. 39. „SPARK STREAMING IS AN EXTENSION OFTHE CORE SPARK APITHAT ENABLES HIGH-THROUGHPUT, FAULT-TOLERANT STREAM PROCESSING OF LIVE DATA STREAMS.”
  40. 40. HOW IT WORKS?
  41. 41. DSTREAMS • continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream • represented by a continuous sequence of RDDs
  42. 42. INITIALIZING • SparkConf conf = new SparkConf().setAppName("Real-Time Analytics").setMaster("local"); • JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(TIME_IN_MILIS));;
  43. 43. CREATING DSTREAM • JavaReceiverInputDStream<String> logLines = jssc.socketTextStream(sourceAddr, sourcePort, StorageLevels.MEMORY_AND_DISK_SER);
  44. 44. DATA SOURCES • plainTCP sockets • Apache Kafka • Apache Flume • ZeroMQ
  45. 45. TRANSFORMATIONS • map, flatMap, filter, union, join, etc. • transform • updateStateByKey
  46. 46. WINDOW OPERATIONS • window • countByWindow / countByValueAndWindow • reduceByWindow / reduceByKeyAndWindow
  47. 47. OUTPUT OPERTIONS • print • foreachRDD • saveAsObjectFiles • saveAsTextFiles • saveAsTextFiles
  48. 48. THINGSTO REMEMBER
  49. 49. USE SPARK-SHELLTO LEARN
  50. 50. PROVIDE ENOUGH RAM TO WORKERS
  51. 51. PROVIDE ENOUGH RAM TO EXECUTOR
  52. 52. SET FRAME SIZE / BUFFERS ACCORDINGLY
  53. 53. USE KRYO SERIALIZER
  54. 54. SPLIT DATATO APPROPRIATE NUMBER OF PARTITIONS
  55. 55. PACKAGEYOUR APPLICATION IN UBER-JAR
  56. 56. DESIGNYOUR DATA FLOW AND…
  57. 57. BUILD A FRAMEWORKTO PROCESS DATA EFFICIENTLY
  58. 58. IT’S EASIER WITH SCALA! // word count example inputLine.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _);
  59. 59. HOW WE USE SPARK?
  60. 60. HOW WE USE SPARK?
  61. 61. HOW WE USE SPARK?
  62. 62. THANKS!
  63. 63. we’re hiring ! mail me: bbogacki@bidlab.pl

×