Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

1,342 views

Published on

Apache Spark is a great solution for building Big Data applications. It provides really fast SQL-like processing, machine learning library, and streaming module for near real time processing of data streams. Unfortunately, during application development and production deployments we often encounter many difficulties in mixing various data sources or bulk loading of computed data to SQL or NoSQL databases

https://www.bigdataspain.org/2017/talk/apache-spark-vs-rest-of-the-world-problems-and-solutions

Big Data Spain 2017
16th - 17th November Kinépolis Madrid

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

  1. 1. Apache Spark vs rest of the world - Problems and Solutions Arkadiusz Jachnik
  2. 2. #BigDataSpain 2017 About Arkadiusz • Senior Data Scientist at AGORA SA - user profiling & content personalization - recommendation system • PhD Student at 
 Poznan University of Technology - multi-class & multi-label classification - multi-output prediction - recommendation algorithms 2
  3. 3. #BigDataSpain 2017 Agora’s BigData Team 3 my boss Luiza :) it’s me! we are all here at #BDS! I invite to talk of these guys :) Arek Wojtek Paweł Paweł Dawid Bartek Jacek Daniel
  4. 4. #BigDataSpain 20174 Internet Press Polish Media Company Magazines Radio Cinemas Advertising TV Books
  5. 5. #BigDataSpain 2017 Spark in Agora's BigData Platform 5 DATA COLLECTING AND INTEGRATION USER PROFILING
 SYSTEM DATA ANALYTICSRECOMMENDATION SYSTEM DATA ENRICHMENT AND CONTENT STRUCTURISATION HADOOP CLUSTER own build, v2.2 structuredstreaming Spark SQL, MLlib Spark streaming over 3 years of experience
  6. 6. #BigDataSpain 2017 Today discussed problems 6 1. Processing parts of data and loading from 
 Spark to relational database in parallel 2. Bulk loading do HBase database 3. From relational database to Spark DataFrame (with user defined functions) 4. From HBase to Spark by Hive external table (with timestamps of HBase cells) 5. Spark Streaming with Kafka - how to implement own offset manager
  7. 7. #BigDataSpain 2017 I will show some code… • I will show real technical problems we have encountered during Spark deployment • We use Spark in Agora for over 3 years so we have great experience • I will present practical solutions showing some code in Scala • Scala is natural for Spark 7
  8. 8. 1. Processing and writing parts of data in parallel Problem description: • We have processed huge DataFrame of computed recommendations for users • There are 4 defined types of recommendations • For each type we want to take top-K recommendations for each user • Recommendations of each type should be loaded to different PostgreSQL table #BigDataSpain 20178 User Recommendation type Article Score Grzegorz TYPE_3 Article F 1.0 Bożena TYPE_4 Article B 0.2 Grażyna TYPE_2 Article B 0.2 Grzegorz TYPE_3 Article D 0.9 Krzysztof TYPE_3 Article D 0.4 Grażyna TYPE_2 Article C 0.9 Grażyna TYPE_1 Article D 0.3 Bożena TYPE_2 Article E 0.9 Grzegorz TYPE_1 Article E 1.0 Grzegorz TYPE_1 Article A 0.7
  9. 9. #BigDataSpain 2017 Code intro: input & output 9 Grzegorz, Article A, 1.0 Grzegorz, Article F, 0.9 Grzegorz, Article C, 0.9 Grzegorz, Article D, 0.8 Grzegorz, Article B, 0.75 Bożena, ... ... TYPE1
 5recos.peruser save table_1 Krzysztof, Article F, 1.0 Krzysztof, Article D, 1.0 Krzysztof, Article C, 0.8 Krzysztof, Article B, 0.85 Grażyna, Article C, 1.0 Grażyna, ... ... TYPE2
 4recos.peruser save table_2 Grzegorz, Article E, 1.0 Grzegorz, Article B, 0.75 Grzegorz, Article A, 0.8 Bożena, Article E, 0.9 Bożena, Article A, 0.75 Bożena, Article C 0.75 TYPE3
 3recos.peruser save table_3 Grażyna, Article A, 1.0 Grażyna, Article F, 0.9 Bożena, Article B, 0.9 Bożena, Article D, 0.9 Grzegorz, Article B, 1.0 Grzegorz, Article E, 0.95 TYPE4
 2recos.peruser save table_4
  10. 10. #BigDataSpain 2017 Standard approach recoTypes.foreach(recoType => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 10 no-parallelism parallelism but most of the tasks skipped
  11. 11. #BigDataSpain 2017 maybe we can add .par ? recoTypes.par.foreach(recoType => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 11 parallelism but too much tasks :(
  12. 12. #BigDataSpain 2017 Our trick parallelizeProcessing(recoTypes, (recoType: RecoType) => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = { f(recoTypes.head) if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_)) } 12 execute Spark action for the first type… parallelize the rest
  13. 13. 2. Fast bulk-loading to HBase Problems with standard HBase client (inserts with Put class): • Difficult integration with Spark • Complicated parallelization • For non pre-splited tables problem with *Region*Exception-s • Slow for millions of rows #BigDataSpain 201713 Spark DataFrame / RDD .foreachPartition hTable
 .put(…) hTable
 .put(…) hTable
 .put(…) hTable
 .put(…)
  14. 14. #BigDataSpain 2017 Idea Our approach is based on: https://github.com/zeyuanxy/ spark-hbase-bulk-loading Input RDD: data: RDD[( //pair RDD Array[Byte], //HBase row key Map[ //data: String, //column-family Array[( String, //column name (String, //cell value Long) //timestamp )] ] )] 14 General idea: We have to save our RDD data as HFiles (HBase data are stored in such files) and load them into the given pre-existing table. General steps: 1. Implement Spark Partitioner that defines how our data in a key-value pair RDD should be partitioned for HBase row key 2. Repartition and sort the RDD within column-families and starting row keys for every HBase region 3. Save RDD to HDFS as HFiles by rdd.saveAsNewAPIHadoopFile method 4. Load files to table by LoadIncrementalHFiles (HBase API)
  15. 15. #BigDataSpain 2017 Implementation // Prepare hConnection, tableName, hTable ... val regionLocator = hConnection.getRegionLocator(tableName) val columnFamilies = hTable.getTableDescriptor .getFamiliesKeys.map(Bytes.toString(_)) val partitioner = new HFilePartitioner(regionLocator.getStartKeys, fraction) // prepare partitioned RDD val rdds = for { family <- columnFamilies rdd = data .collect{ case (key, dataMap) if dataMap.contains(family) => (key, dataMap(family))} .flatMap{ case (key, familyDataMap) => familyDataMap.map{ case (column: String, valueTs: (String, Long)) => (((key, Bytes.toBytes(column)), valueTs._2), Bytes.toBytes(valueTs._1)) } } } yield getPartitionedRdd(rdd, family, partitioner) 15 val rddToSave = rdds.reduce(_ ++ _) // prepare map-reduce job for bulk-load HFileOutputFormat2.configureIncrementalLoad( job, hTable, regionLocator) // prepare path for HFiles output val fs = FileSystem.get(hbaseConfig) val hFilePath = new Path(...) try { rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) // prepare HFiles for incremental load by setting // folders permissions read/write/exec for all... setRecursivePermission(hFilePath) val loader = new LoadIncrementalHFiles(hbaseConfig) loader.doBulkLoad(hFilePath, hConnection.getAdmin, hTable, regionLocator) } // finally close resources, ... Prepare HBase connection, table
 and region locator Prepare Spark partitioner for HBase regions Repartition and sort data within partitions by the partitioner Save HFiles by NewAPIHadoopFile 
 to HDFS Load HFiles 
 to HBase table
  16. 16. #BigDataSpain 2017 Keep in mind • Set optimally HBase parameter:
 hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32) • For large data too small value of this parameter may causes
 IllegalArgumentException: Size exceeds Integer.MAX_VALUE • Create HBase tables with splits adapted to the expected row keys - example: for row keys of HEX IDs create table with splits like:
 create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',
 ‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']} - for further single puts it minimizes *Region*Exceptions 16
  17. 17. #BigDataSpain 2017 3. Loading data from Postgres to Spark 
 This is possible for data from Hive:
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val data: DataFrame = sparkSesstion.sql( "SELECT id, toUpperCaseUdf(code) FROM types" ) 17 But this is not possible for data from JDBC (for example PostgreSQL):
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT toUpperCaseUdf(code) " + "FROM codes) as codesData", connectionConf) this query is executed by Postgres (not Spark) here you can can specify just Postgres table name and how to parallelize data loading?
  18. 18. #BigDataSpain 2017 Try to load ’raw’ data without UDFs and next use .withColumn with UDF as expression: val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT code " + "FROM codes) as codesData", connectionConf) .withColumn("upperCode", expr("toUpperCaseUdf(code)")) Our solution 18 .jdbc produces DataFrame We will split the table read across executors on the selected column: val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc( url = jdbcUrl, table = "(SELECT code, type_id " + "FROM codes) as codesData", columnName = "type_id", lowerBound = 1L, upperBound = 100L, numPartitions = 10, connectionProperties = connectionConf) but it’s one partition!
  19. 19. #BigDataSpain 2017 Is it working? spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", properties = connectionProperties) .cache() spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", columnName = "type", lowerBound = 1L, upperBound = 100L, numPartitions = 4, connectionProperties = connectionProperties) .cache() 19 test data 1 partition 4 partitions
  20. 20. #BigDataSpain 2017 4. From HBase to Spark by Hive There are commonly used method for loading data from HBase to Spark by Hive external table: CREATE TABLE hive_view_on_hbase ( key int, value string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key, cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 20 72A9DBA74524 column-family: cities Poznan Warsaw Cracow Gdansk 40 5 1 3 58383B36275A Poznan Warsaw Cracow Gdansk 120 60 5 009D22419988 Poznan Warsaw Cracow Gdansk 75 1 user_id cities_map last_city 72A9DBA 74524 map(Poznan->40, Warsaw->5,
 Cracow->1, Gdansk->3) ? 58383B3 6275A map(Warsaw->120, 
 Cracow->60, Gdansk->5) ? 009D224 19988 map(Poznan->75, Warsaw->1) ? HiveHBaseHandler but how to get the last (most recent) values? where aretimestamps?
  21. 21. #BigDataSpain 2017 Our case • We use HDP distribution of Hadoop cluster with HBase 1.1.x • There is possibility to add to Hive view on HBase table the latest timestamp of row modification: CREATE TABLE hive_view_on_hbase ( key int, value string, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key, cf1:val, :timestamp' ) TBLPROPERTIES ( 'hbase.table.name' = 'xyz' ); 21 • How to extract timestamp of each cell? • Answer: rewrite Hive-HBase-Handler that is responsible for creating the Hive views on HBase tables :) … but first … • Do not download source code of Hive from the Hive GitHub repository - check your Hadoop distribution! (for example HDP has own code branch)
  22. 22. #BigDataSpain 201722 There is a patch on Hive repo… …but still not reviewed and merged :(
  23. 23. #BigDataSpain 2017 There is a lot of code… …but we have some tips on how to change Hive-HBase-Handler: • Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java which returns ColumnMappings object • LazyHBaseRow class stores data from HBase row. • Timestamps of processed HBase cells can be read from loaded (by scanner) rows in LazyHBaseCellMap class • Column parser and HBase scanner is initialized in HBaseStorageHandler.java 23
  24. 24. #BigDataSpain 2017 5. Spark + Kafka: own offset manager Problem description: • Spark output operations are at-least-once • For exactly-once semantics, you must store offsets after an idempotent output, or in an atomic transaction alongside output • Options: 1. Checkpoints + easy to enable by Spark checkpointing - output operation must be idempotent - cannot recover from a checkpoint if application code has changed 2. Own data store + regardless of changes to your application code + you can use data stores that support transactions + exactly-once semantics 24 Single Spark batch Process and save data Save offsets Image source: Spark Streaming documentation https://spark.apache.org/docs/latest/streaming-programming-guide.html
  25. 25. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext = new StreamingContext(…) val stream: DStream[ConsumerRecord[String, String]] = ... stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, ...) }) 25 Single Spark batch Process and save data Save offsets
  26. 26. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext = new StreamingContext(...) val stream: DStream[ConsumerRecord[String, String]] = kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams) stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, zkPath) }) def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore, kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = { offsetsStore.readOffsets(topic, zkPath) match { case Some(offsetsMap) => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap)) case None => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams) ) } } 26
  27. 27. #BigDataSpain 2017 Code of offset store class MyOffsetsStore(zkHosts: String) { val zkUtils = ZkUtils(zkHosts, 10000, 10000, false) def saveOffsets(rdd: RDD[_], zkPath: String): Unit = { val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges offsetsRanges.groupBy(_.topic).foreach { case (topic, offsetsRangesPerTopic) => { val offsetsRangesStr = offsetsRangesPerTopic .map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",") zkUtils.updatePersistentPath(zkPath, offsetsRangesStr) } }} def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = { val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath) offsetsRangesStrOpt match { case Some(offsetsRangesStr) => Some(offsetsRangesStr.split(",").map(s => s.split(":")).map { case Array(partitionStr, offsetStr) => new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong }.toMap) case None => None } } } 27
  28. 28. Thank you! Questions? arkadiusz.jachnik@agora.pl www.linkedin.com/in/arkadiusz-jachnik

×