Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Apache Spark vs rest of the world
- Problems and Solutions
Arkadiusz Jachnik

#BigDataSpain 2017
About Arkadiusz
• Senior Data Scientist at AGORA SA
- user proﬁling & content personalization

- recommendation system

• PhD Student at  
Poznan University of Technology
- multi-class & multi-label classiﬁcation

- multi-output prediction

- recommendation algorithms
2

#BigDataSpain 2017
Agora’s BigData Team
3
my boss Luiza :) it’s me!
we are all here
at #BDS!
I invite to talk of these guys :)
Arek Wojtek
Paweł
Paweł
Dawid
Bartek Jacek Daniel

#BigDataSpain 20174
Internet Press
Polish Media Company
Magazines
Radio
Cinemas
Advertising
TV
Books

#BigDataSpain 2017
Spark in Agora's BigData Platform
5
DATA COLLECTING AND INTEGRATION
USER PROFILING 
SYSTEM DATA ANALYTICSRECOMMENDATION
SYSTEM
DATA ENRICHMENT AND CONTENT STRUCTURISATION
HADOOP CLUSTER
own build, v2.2
structuredstreaming
Spark SQL, MLlib
Spark
streaming
over 3 years of experience

#BigDataSpain 2017
Today discussed problems
6
1. Processing parts of data and loading from  
Spark to relational database in parallel
2. Bulk loading do HBase database
3. From relational database to Spark DataFrame
(with user deﬁned functions)
4. From HBase to Spark by Hive external table
(with timestamps of HBase cells)
5. Spark Streaming with Kafka - how to implement
own offset manager

#BigDataSpain 2017
I will show some code…
• I will show real technical problems we have
encountered during Spark deployment

• We use Spark in Agora for over 3 years so
we have great experience

• I will present practical solutions showing
some code in Scala

• Scala is natural for Spark
7

1. Processing and writing parts of data in parallel
Problem description:

• We have processed huge
DataFrame of computed
recommendations for users

• There are 4 deﬁned types of
recommendations

• For each type we want to take
top-K recommendations for each
user

• Recommendations of each type
should be loaded to different
PostgreSQL table
#BigDataSpain 20178
User
Recommendation
type
Article Score
Grzegorz TYPE_3 Article F 1.0
Bożena TYPE_4 Article B 0.2
Grażyna TYPE_2 Article B 0.2
Grzegorz TYPE_3 Article D 0.9
Krzysztof TYPE_3 Article D 0.4
Grażyna TYPE_2 Article C 0.9
Grażyna TYPE_1 Article D 0.3
Bożena TYPE_2 Article E 0.9
Grzegorz TYPE_1 Article E 1.0
Grzegorz TYPE_1 Article A 0.7

#BigDataSpain 2017
Code intro: input & output
9
Grzegorz, Article A, 1.0
Grzegorz, Article F, 0.9
Grzegorz, Article C, 0.9
Grzegorz, Article D, 0.8
Grzegorz, Article B, 0.75
Bożena, ... ...
TYPE1 
5recos.peruser
save table_1
Krzysztof, Article F, 1.0
Krzysztof, Article D, 1.0
Krzysztof, Article C, 0.8
Krzysztof, Article B, 0.85
Grażyna, Article C, 1.0
Grażyna, ... ...
TYPE2 
4recos.peruser
save table_2
Grzegorz, Article E, 1.0
Grzegorz, Article A, 0.8
Bożena, Article E, 0.9
Bożena, Article A, 0.75
Bożena, Article C 0.75
TYPE3 
3recos.peruser
save table_3
Grażyna, Article A, 1.0
Grażyna, Article F, 0.9
Bożena, Article B, 0.9
Bożena, Article D, 0.9
Grzegorz, Article E, 0.95
TYPE4 
2recos.peruser
save table_4

#BigDataSpain 2017
Standard approach
recoTypes.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
10
no-parallelism parallelism but most of the tasks skipped

#BigDataSpain 2017
maybe we can add .par ?
recoTypes.par.foreach(recoType => {
})
11
parallelism but too much tasks :(

#BigDataSpain 2017
Our trick
parallelizeProcessing(recoTypes, (recoType: RecoType) => {
})
def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = {
f(recoTypes.head)
if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_))
}
12
execute Spark action for the first type…
parallelize the rest

2. Fast bulk-loading to HBase
Problems with standard HBase
client (inserts with Put class):

• Difﬁcult integration with Spark

• Complicated parallelization

• For non pre-splited tables problem
with *Region*Exception-s
• Slow for millions of rows
#BigDataSpain 201713
Spark DataFrame / RDD
.foreachPartition
hTable 
.put(…)
hTable 
.put(…)
hTable 
.put(…)
hTable 
.put(…)

#BigDataSpain 2017
Idea
Our approach is based on:

https://github.com/zeyuanxy/
spark-hbase-bulk-loading
Input RDD:

data: RDD[( //pair RDD
Array[Byte], //HBase row key
Map[ //data:
String, //column-family
Array[(
String, //column name
(String, //cell value
Long) //timestamp
)]
]
)]
14
General idea:

We have to save our RDD data as HFiles
(HBase data are stored in such files) and load
them into the given pre-existing table.
General steps:

1. Implement Spark Partitioner that defines
how our data in a key-value pair RDD
should be partitioned for HBase row key

2. Repartition and sort the RDD within
column-families and starting row keys for
every HBase region

3. Save RDD to HDFS as HFiles by
rdd.saveAsNewAPIHadoopFile method

4. Load files to table by
LoadIncrementalHFiles (HBase API)

#BigDataSpain 2017
Implementation
// Prepare hConnection, tableName, hTable ...
val regionLocator =
hConnection.getRegionLocator(tableName)
val columnFamilies = hTable.getTableDescriptor
.getFamiliesKeys.map(Bytes.toString(_))
val partitioner = new
HFilePartitioner(regionLocator.getStartKeys, fraction)
// prepare partitioned RDD
val rdds = for {
family <- columnFamilies
rdd = data
.collect{ case (key, dataMap) if
dataMap.contains(family) => (key, dataMap(family))}
.flatMap{ case (key, familyDataMap) =>
familyDataMap.map{
case (column: String, valueTs: (String, Long)) =>
(((key, Bytes.toBytes(column)), valueTs._2),
Bytes.toBytes(valueTs._1))
}
}
} yield getPartitionedRdd(rdd, family, partitioner)
15
val rddToSave = rdds.reduce(_ ++ _)
// prepare map-reduce job for bulk-load
HFileOutputFormat2.configureIncrementalLoad(
job, hTable, regionLocator)
// prepare path for HFiles output
val fs = FileSystem.get(hbaseConfig)
val hFilePath = new Path(...)
try {
rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString,
classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], job.getConfiguration)
// prepare HFiles for incremental load by setting
// folders permissions read/write/exec for all...
setRecursivePermission(hFilePath)
val loader = new LoadIncrementalHFiles(hbaseConfig)
loader.doBulkLoad(hFilePath, hConnection.getAdmin,
hTable, regionLocator)
} // finally close resources, ...
Prepare HBase
connection, table 
and region locator
Prepare Spark
partitioner for
HBase regions
Repartition and sort
data within partitions
by the partitioner
Save HFiles by
NewAPIHadoopFile  
to HDFS
Load HFiles  
to HBase table

#BigDataSpain 2017
Keep in mind
• Set optimally HBase parameter: 
hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32)

• For large data too small value of this parameter may causes 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE
• Create HBase tables with splits adapted to the expected row keys

- example: for row keys of HEX IDs create table with splits like: 
create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2', 
‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']}
- for further single puts it minimizes *Region*Exceptions
16

#BigDataSpain 2017
3. Loading data from Postgres to Spark
 
This is possible for data from Hive: 
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val data: DataFrame = sparkSesstion.sql(
"SELECT id, toUpperCaseUdf(code) FROM types"
)
17
But this is not possible for data from
JDBC (for example PostgreSQL): 
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT toUpperCaseUdf(code) " +
"FROM codes) as codesData",
connectionConf)
this query is executed
by Postgres (not Spark)
here you can can specify
just Postgres table name
and how to parallelize
data loading?

#BigDataSpain 2017
Try to load ’raw’ data without UDFs and next
use .withColumn with UDF as expression:

.jdbc(jdbcUrl,
"(SELECT code " +
connectionConf)
.withColumn("upperCode",
expr("toUpperCaseUdf(code)"))
Our solution
18
.jdbc produces
DataFrame
We will split the table read across executors
on the selected column:

.jdbc(
url = jdbcUrl,
table = "(SELECT code, type_id " +
columnName = "type_id",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 10,
connectionProperties = connectionConf)
but it’s one partition!

#BigDataSpain 2017
Is it working?
spark.read.jdbc(
url = "jdbc:mysql://localhost:3306/test",
table = "users",
properties = connectionProperties)
.cache()
spark.read.jdbc(
url = "jdbc:mysql://localhost:3306/test",
table = "users",
columnName = "type",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 4,
connectionProperties = connectionProperties)
.cache()
19
test data
1 partition
4 partitions

#BigDataSpain 2017
4. From HBase to Spark by Hive
There are commonly used method for loading
data from HBase to Spark by Hive external
table:

CREATE TABLE hive_view_on_hbase (
key int,
value string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key, cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
20
72A9DBA74524
column-family: cities
Poznan Warsaw Cracow Gdansk
40 5 1 3
58383B36275A
120 60 5
009D22419988
75 1
user_id cities_map last_city
72A9DBA
74524
map(Poznan->40, Warsaw->5, 
Cracow->1, Gdansk->3)
?
58383B3
6275A
map(Warsaw->120,  
Cracow->60, Gdansk->5)
?
009D224
19988
map(Poznan->75, Warsaw->1) ?
HiveHBaseHandler
but how to get the last
(most recent) values?
where aretimestamps?

#BigDataSpain 2017
Our case
• We use HDP distribution of Hadoop cluster
with HBase 1.1.x

• There is possibility to add to Hive view on
HBase table the latest timestamp of row
modiﬁcation:

CREATE TABLE hive_view_on_hbase (
key int,
value string,
ts timestamp
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key, cf1:val, :timestamp'
)
TBLPROPERTIES (
'hbase.table.name' = 'xyz'
);
21
• How to extract timestamp of each cell?

• Answer: rewrite Hive-HBase-Handler that is
responsible for creating the Hive views on
HBase tables :) … but ﬁrst …

• Do not download source code of Hive
from the Hive GitHub repository - check
your Hadoop distribution! (for example
HDP has own code branch)

#BigDataSpain 201722
There is a patch on Hive repo…
…but still not reviewed and merged :(

#BigDataSpain 2017
There is a lot of code…
…but we have some tips on how to change Hive-HBase-Handler:

• Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java
which returns ColumnMappings object

• LazyHBaseRow class stores data from HBase row.

• Timestamps of processed HBase cells can be read from loaded (by scanner) rows in
LazyHBaseCellMap class

• Column parser and HBase scanner is initialized in HBaseStorageHandler.java
23

#BigDataSpain 2017
5. Spark + Kafka: own offset manager
Problem description:

• Spark output operations are at-least-once

• For exactly-once semantics, you must store
oﬀsets after an idempotent output, or in an
atomic transaction alongside output

• Options:

1. Checkpoints

+ easy to enable by Spark checkpointing

- output operation must be idempotent

- cannot recover from a checkpoint if
application code has changed

2. Own data store

+ regardless of changes to your application
code

+ you can use data stores that support
transactions

+ exactly-once semantics
24
Single Spark batch
Process
and save data
Save
offsets
Image source: Spark Streaming documentation

https://spark.apache.org/docs/latest/streaming-programming-guide.html

#BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(…)
val stream: DStream[ConsumerRecord[String, String]] = ...
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, ...)
})
25
Single Spark batch
Process
and save data
Save
offsets

#BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(...)
val stream: DStream[ConsumerRecord[String, String]] =
kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams)
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, zkPath)
})
def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore,
kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = {
offsetsStore.readOffsets(topic, zkPath) match {
case Some(offsetsMap) =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap))
case None =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams)
)
}
}
26

#BigDataSpain 2017
Code of offset store
class MyOffsetsStore(zkHosts: String) {
val zkUtils = ZkUtils(zkHosts, 10000, 10000, false)
def saveOffsets(rdd: RDD[_], zkPath: String): Unit = {
val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetsRanges.groupBy(_.topic).foreach {
case (topic, offsetsRangesPerTopic) => {
val offsetsRangesStr = offsetsRangesPerTopic
.map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",")
zkUtils.updatePersistentPath(zkPath, offsetsRangesStr)
}
}}
def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = {
val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath)
offsetsRangesStrOpt match {
case Some(offsetsRangesStr) =>
Some(offsetsRangesStr.split(",").map(s => s.split(":")).map {
case Array(partitionStr, offsetStr) =>
new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong
}.toMap)
case None => None
}
}
}
27

Thank you!
Questions?
arkadiusz.jachnik@agora.pl
www.linkedin.com/in/arkadiusz-jachnik

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Similar to Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017